# Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions

^{1}

^{2}

## Abstract

**:**

## 1. Introduction

_{x}exp[sd(x, y)] and ∑

_{y}exp[sd(x, y)], are not 1. Its main feature is that its maximum value is exp(0) = 1. An NEF is more like a membership function [10], similarity function, or Distribution Constraint Function (DCF), which softly constrains a probability distribution. Why can we put an NEF as a probability function into a Bayes-like formula? Can we find a simple explanation for why these NEFs and partition functions exist in different areas? Although the relationship between the rate-distortion function and the maximum entropy method has been studied [11], we still need a more general probability theory or probability framework to explain the above phenomenon.

- Distortion function d(x, y) is subjectively defined, lacking the objective standard;
- It is hard to define the distortion function using distances when we use labels to replace instances in machine learning; where possible the labels’ number should be much less than the possible instances’ number. For example, we need to use “Light rain”, “Moderate rain”, “Heavy rain”, “Light to moderate rain”, etc., to replace daily precipitations in millimeters, or use “Child”, “Youth”, “Adult”, “Elder”, etc., to replace people’s ages. In these cases, it is not easy to construct distortion functions;
- We cannot apply the rate-distortion function for semantic compression, e.g., data compression according to labels’ semantic meanings.

_{j}denotes the labels: y

_{1}= “Non-adult”, y

_{2}= “Youth”, y

_{3}= “Adult”, and y

_{4}= “Elder”. All x values that make y

_{j}true form a fuzzy set. The truth functions of these labels are also the membership functions of the fuzzy sets (see Figure 1). According to Davidson’s truth-conditional semantics [15], truth function T(y

_{j}|x) ascertains the semantic meaning of y

_{j}. For a given P(x) and the constraint, we need to find the Shannon channel P(y|x) that minimizes the mutual information between y and x. The Minimum Mutual Information (MMI) should be the lower limit of the average codeword length for coding x to y. The constraint condition states that labels’ selections should accord with the rules of the langauges used, expressed by the truth functions. In this case, the rate-distortion function cannot work well because it is not easy to define a distortion function which is compatible with the labels’ semantic meanings.

- the truth function is a learning function; it can be obtained from a sample or sampling distribution [31] and hence is not subjectively defined;
- using the transformation relation, we can indirectly express the distortion function between any instance x and any label y by the truth function that may come from machine learning;
- truth functions indicate labels’ semantic meanings and, hence, can be used as the constraint condition for semantic compression.

- the NEF and the partition functions are the truth function and the logical probability, respectively;
- the formula, such as Equation (2), for the distribution of Minimum Mutual Information (MMI) or maximum entropy is the semantic Bayes’ formula;
- MMI R(D) can be expressed by the semantic mutual information formula;
- maximum entropy is equal to extremely maximum entropy minus semantic mutual information.

- help readers understand rate-distortion functions and maximum entropy distributions from the new perspective;
- combine machine learning (for the distortion function) and data compression;
- show that the rate-distortion function can be extended to the rate-truth function and the rate-verisimilitude function for communication data’s semantic compression.

## 2. Background

#### 2.1. Shannon’s Entropies and Mutual Information

**Definition**

**1.**

- Variable x denotes an instance; X denotes a discrete random variable taking a value x ∈ U = {x
_{1}, x_{2}, …, x_{m}}. - Variable y denotes a hypothesis or label; Y denotes a discrete random variable taking a value y ∈ V = {y
_{1}, y_{2}, …, y_{n}}.

_{j}|x) (with a certain y

_{j}and variable x) a Transition Probability Function (TPF) [34] (p. 18). A Shannon channel is a transition probability matrix or a group of TPFs:

_{j}|x), j = 1, 2, …, n.

_{j}is given, I(X; Y) becomes the Kullback–Leibler (KL) divergence:

_{j}) = P(x).

#### 2.2. Rate-Distortion Function R(D)

_{ij}= d(x

_{i}, y

_{j}), i = 1, 2, …; j = 1, 2, …; $\overline{d}$ be the average of d(x, y); and D be the upper limit of $\overline{d}$. An often-used distortion function is d(x, y) = (x − y)

^{2}. This function fits cases where x and y have the same universe (e.g., U = V), and distortion only depends on the distance between x and y.

_{i}is defined with λ

_{i}≡ exp(μ

_{i}/P(x

_{i})). To optimize P(y), we fix P(y

_{j}|x

_{i}) in F and order $\partial F/\partial P({y}_{j})=0$. Hence, we derive α = 1 and the optimized P(y):

_{1}) = P(y

_{2}) = … 1/n and then repeat Equations (10) and (11) until P(y) is unchanged. We call this iteration the MMI iteration.

_{j}|x), j = 1, 2, …, but it is difficult to find them directly because of the constraint conditions in Equations (7) and (8). Therefore, we can only find the proper posterior distributions P(y|x

_{i}) of y, i = 1, 2, …, by the MMI iteration to obtain the TPFs indirectly.

_{i}is the partition function.

_{ij}) is a negative exponential function. Thus, a larger |s| will result in narrower exp(sd

_{ij}), larger R, and less D.

#### 2.3. The Maximum Entropy Method

_{k}(x, y), k = 1, 2, … the constraint conditions are:

## 3. The Author’s Related Work

#### 3.1. The P-T Probability Framework

**Definition**

**2.**

- The y
_{j}is a label or a hypothesis; y_{j}(x_{i}) is a proposition. The θ_{j}is a fuzzy subset of universe U, whose elements make y_{j}true. We have y_{j}(x) ≡“x∈ θ_{j}” ≡ “x belongs to θ_{j}”(“≡” means they are logically equivalent according to the definition). The θ_{j}may also be a model or a group of model parameters. - A probability that is defined with “=”, such as P(y
_{j}) ≡ P(Y = y_{j}), is a statistical probability. A probability that is defined with “∈”, such as P(X ∈ θ_{j}), is a logical probability. To distinguish P(Y = y_{j}) and P(X ∈ θ_{j}), we define T(y_{j}) ≡ T(θ_{j}) ≡ P(X ∈ θ_{j}) as the logical probability of y_{j}. - T(y
_{j}|x) ≡ T(θ_{j}|x) ≡ P(X ∈ θ_{j}|X = x) is the truth function of y_{j}and the membership function of θ_{j}. It changes between 0 and 1, and the maximum of T(y|x) is 1.

_{j}|x), j = 1, 2, …, n.

_{j}is a crispy set, T(θ

_{j}) becomes the cumulative probability or two cumulative probabilities’ difference [36].

_{1}) + T(y

_{2}) + … + T(y

_{n}) > 1. For example, the sum of the logical probabilities of four labels in Figure 1 is greater than 1 since T(y

_{1}) + T(y

_{3}) = 1. The detailed discussions about the distinctions and relations between statistical probability and logical probability can be found in [32].

_{j}|x) and P(x) into the Bayes’ formula to obtain a likelihood function:

_{j}) is called the semantic Bayes’ prediction. It is often written as P(x|y

_{j}, θ) in popular methods. We call the above formula the semantic Bayes’ formula.

_{j}), we can obtain:

_{j}) = 1/max[P(x|θ

_{j})/P(x)]. This change in Equation (20) can ensure that truth functions are symmetrical, e.g., T(x|y) = T(y|x), as well as distortion functions. This change is also for comparing two truth functions T(y

_{j}|x) and T(y

_{k}|x) for classification according to the correlation between x and y (see Equation (23)). Since P(x|θ

_{j}) in Equation (19) is changeless, as T(θ

_{j}|x) is replaced with cT(θ

_{j}|x) (where c is a positive constant), this change does not influence the other uses of T(θ

_{j}|x).

#### 3.2. The Semantic Information G Measure

_{j}in relation to x

_{i}with log-normalized-likelihood:

_{i}; θ

_{j}), or its average, is the semantic information G measure or, simply, the G measure. If T(θ

_{j}|x) is always 1, the G measure becomes Carnap and Bar-Hillel’s semantic information measure [21].

_{i}; θ

_{j}) is also explained as the verisimilitude between y

_{j}and x

_{i}[32].

_{j}|x) is the confusion probability function of x

_{j}with x or the discrimination function of x

_{j}[31].

_{i}; θ

_{j}), we obtain generalized KL information:

_{i}|y

_{j}), i = 1, 2, …, is the sampling distribution, which may be unsmooth or discontinuous. It is easy to prove I(X; θ

_{j}) ≤ I(X; y

_{j}) [31].

_{j}) is smooth, we may let P(x|θ

_{j}) = P(x|y

_{j}) or T(θ

_{j}|x) ∝ P(y

_{j}|x) to obtain the optimized truth function:

_{j}|x

_{i}) = T*(θ

_{xi}|y

_{j}) or T*(y

_{j}|x

_{i}) = T*(x

_{i}|y

_{j}), where θ

_{xi}is a fuzzy subset of V. Furthermore, we have:

_{j}|x) = cP(y

_{j}|x)

_{j}|x) is proportional to TPF P(y

_{j}|x). This formula accords with Wittgenstein’s thoughts: meaning lies in uses [38] (p. 80).

_{j}) is unsmooth, we may achieve a smooth T*(θ

_{j}|x) with parameters by:

_{j}) for different y, we obtain semantic mutual information:

_{θ}) is a cross-entropy. Since ∑

_{j}T(θ

_{j}) ≥ 1, we also call H(Y

_{θ}) a generalized entropy or a semantic entropy. H(Y

_{θ}|X) is called a fuzzy entropy.

_{j}) = P(x|y

_{j}) or T(θ

_{j}|x) ∝ P(y

_{j}|x) for every j (Matching I), I(X; Y

_{θ}) reaches its maximum I(X; Y). If we use a group of truth functions or a semantic channel T(y|x) as the constraint function to seek MMI, we need to let P(x|y

_{j}) = P(x|θ

_{j}) or P(y

_{j}|x) ∝ T(θ

_{j}|x) as far as possible for every j (Matching II). Section 4.3 and Section 4.4 further discuss Matching II.

_{j}|x) = exp[−(x − x

_{j})

^{2}/(2σ

^{2})], we have:

_{θ}) is like the regularization term, and the next one is the relative squared error term. Therefore, we can treat the maximum semantic mutual information criterion as a particular RLS criterion.

## 4. Theoretical Results

#### 4.1. The New Explanations of the MMI Distribution and the Rate-Distortion Function

_{i}, y)] in the rate-distortion function as a truth function and Z

_{i}as a logical probability. We can let θ

_{xi}be a fuzzy subset of V (V = {y

_{1}, y

_{2}, …}) and T(x

_{i}|y) ≡ T(θ

_{xi}|y) be the truth function of y(x

_{i}), and T(x

_{i}) = T(θ

_{xi}) = ∑

_{j}P(y

_{j})T(y

_{j}|x

_{i}) be the logical probability of x

_{i}. Hence, we can observe that the MMI distribution P(y|x) in the rate-distortion function is produced by the semantic Bayes’ formula:

_{i}) = P(y)exp[sd(x

_{i}, y)]/Z

_{i}= P(y)T(x

_{i}|y)/T(x

_{i}), i = 1, 2, …, m.

#### 4.2. Setting Up the Relation between the Truth Function and the Distortion Function

_{j}to represent x

_{i}. If the truth value of y

_{j}(x

_{i}) is 1, the distortion d(x

_{i}, y

_{j}) should be 0. The distortion increases as the truth value decreases. Hence, we use the following definition:

**Definition**

**3.**

_{θ}|X) = $\overline{d}$. Therefore, we can use H(Y

_{θ}|X) to replace $\overline{d}$ for the constraint condition.

#### 4.3. Rate-Truth Function R(Θ)

_{i}) of y:

_{i}) in Equation (34) into the mutual information formula, we obtain the rate-truth function R(Θ):

_{θ}) − H(X

_{θ}|Y) = I(Y; X

_{θ}), which means that the MMI can be expressed by semantic mutual information: I(Y; X

_{θ}). The constraint is tighter when |s| > 1 and looser when |s| < 1, than the constraint |s| = 1. When the constraint condition is the original truth function without s or with s = −1, we have:

_{θ}) ≥ I(X; Y

_{θ}).

_{θ}) ≥ I(X; Y

_{θ}) is that the iteration can only let P(y

_{j}|x) be proximately proportional to T(y

_{j}|x), and not exactly in general (see Equation (34) and Section 5.2).

_{θ}|Y) can be understood as the average distortion, R(Θ) also means the lower limit of average codeword length.

#### 4.4. Rate-Verisimilitude Function R(G)

_{ij}= d(x

_{i}, y

_{j}) with I

_{ij}= I(x

_{i}; θ

_{j}). The constraint condition $\overline{d}\le D$ becomes I(X; Y

_{ϴ}) ≥ G, where G denotes the lower limit of semantic mutual information. Following the rate-distortion function’s derivation, we can obtain:

_{i}) is now like a Softmax function, in which the numerator P(y)[T(y|x

_{i})/T(y)]

^{s}may be greater than 1. In the E-step of the EM algorithm for mixture models, there is a similar formula, which is also used for MMI [40].

#### 4.5. The New Explanation of the Maximum Entropy Distribution and the Extension

_{i}′ as a logical probability, and the above formula as a semantic Bayes’ formula.

_{θ}) is semantic mutual information, and H(X) + H(Y) is the extremely maximum value of H(X, Y). Therefore, we can explain that maximum entropy is equal to extremely maximum entropy minus semantic mutual information.

#### 4.6. The New Explanation of the Boltzmann Distribution and the Maximum Entropy Law

_{i}means state i, N is the particles’ number, and P(x

_{i}|T) denotes the probability of a particle (or the density of particles) in state i for the given absolute temperature T.

_{i}means energy e

_{i}, G

_{i}is the number of states with e

_{i}, and G is the total number of all states, then P(x

_{i}) = G

_{i}/G is the prior probability of x

_{i}. Hence, Equations (43) and (44) become:

_{i}/(kT)] as a truth function T(θ

_{j}|x), Z′ as a logical probability T(θ

_{j}), and Equation (46) as a semantic Bayes’ formula.

_{j}, j = 1, 2, …, have different temperatures T

_{j}, j = 1, 2, … The above G becomes G

_{j}. Then we can derive (see Appendix A for the details):

_{θ}). This formula indicates that the maximum entropy law in physics can be equivalently stated as the MMI law; this MMI can be expressed as semantic mutual information I(X; Y

_{θ}).

## 5. Experimental Results

#### 5.1. An Example Shows the Shannon Channel’s Changes in the MMI Iterations for R(Θ) and R(G)

- the MMI iteration lets the Shannon channel match the semantic channel, e.g., lets P(y
_{j}|x) ∝ T(y_{j}|x), j = 1, 2, …, as far as possible for the functions R(D), R(Θ), and R(G); - the MMI iteration can reduce mutual information.

**Example**

**1.**

_{1}), P(y

_{2}), P(y

_{3}), P(y

_{4})} = {0.3499, 0.0022, 0.6367, 0}. The MMI is 0.845 bits. The iterative process not only lets every P(y

_{j}|x) be approximatively proportional to T(y

_{j}|x) but also makes P(y

_{4}) closer to 0. The value of P(y

_{4}) became 0 because y

_{4}implies y

_{3}and hence can be replaced with y

_{3}. The latter has a larger logical probability. By replacing y

_{4}with y

_{3}, we can save the average codeword length.

_{i}, y

_{j}) is replaced with I(x

_{i}; θ

_{j}) = log[T(y

_{j}|x

_{i})/T(y

_{j})] for every i and j. Figure 4 shows the iterative results and the process (see Supplementary Materials to find how the data for Figure 4 are produced).

_{1}), P(y

_{2}), P(y

_{3}), P(y

_{4})} = {0.3619, 0.0200, 0.6120, 0.0057}. The MMI is 0.883 bits. This P(y

_{4}) is not 0. Both P(y

_{2}) and P(y

_{4}) are larger than those for R(Θ). The reason is that a label with less logical probability can convey more semantic information and, hence, should be more frequently selected if we use the semantic information criterion. For the same s = 1 and Θ, R(G) = 0.883 bits is greater than R(Θ) = 0.845 bits.

#### 5.2. An Example about Greyscale Image Compression Shows the Convergent P(y|x) for the R(Θ)

**Example**

**2.**

_{i}, i = 0, 1, …, 255) into a 3-bit grey image with 8 grey levels (denoted by y

_{j}, j = 1, 2, …, 8) [42]. Considering that human visual discrimination changes with grey levels (the higher the grey level is, the lower the discrimination is). We use 8 truth functions, as shown in Figure 5a, to represent 8 fuzzy classes. Appendix C shows how these lines are produced. The task is to solve the MMI R and the corresponding channel P(y|x) with s = 1.

_{ϴ}), and H(Y) change in the iterative process (see Supplementary Materials to find how the data for Figure 5 are produced).

_{j}) instead of the truth function T(y

_{j}|x), j = 1, 2, …, 8, it is not easy to design d(x, y

_{j}). It is also difficult to predict the convergent P(y|x) according to d(x, y).

_{θ}) is difficult to approach I(X; Y

_{θ}) because P(y

_{j}|x) cannot be strictly proportional to T(y

_{j}|x) for j = 2, 3, …, 6.

_{j}|x) covers an area in U identical to the area covered by T(y

_{j}|x) as s = 1, and hence, P(y|x) is satisfied with the constraint condition defined by T(y|x).

## 6. Discussions

#### 6.1. Connecting Machine Learning and Data Compression by Truth Functions

^{|s|}with |s| > 1 to reduce the average distortion or with |s| < 1 to loosen the distortion limit. In this way, we can overcome the rate-distortion function’s three disadvantages mentioned in Section 1. In addition, a group of truth functions describe a group of labels’ extensions and semantic meanings and are therefore intuitional. They have stronger descriptive power than distortion functions because we can always use a truth function to replace a distortion function by T(y

_{j}|x) = exp[−d(x, y

_{j})]. Although, we may not be able to use a distortion function to replace a truth function, such as the truth function of y

_{3}= “Adult” in Figure 1.

_{θ}) ≥ G to obtain the rate-verisimilitude function R(G). The function R(G) uses the likelihood criterion or the semantic information criterion. It can reduce the underreports of events with less logical probabilities. For example, the selected probabilities P(y

_{2}) and P(y

_{4}) in Figure 1 for R(G) (see Figure 4a) are higher than P(y

_{2}) and P(y

_{4}) for R(Θ) (see Figure 3c). The function R(G) is more suitable than R(D) and R(Θ) when y is a prediction where information is more important than truth.

#### 6.2. Viewing Rate-Distortion Functions and Maximum Entropy Distributions from the Perspective of Semantic Information G Theory

- For the rate-distortion function R(D), we seek MMI I(X; Y) = H(X) − H(X|Y), which is equivalent to maximizing the posterior entropy H(X|Y) of X. For R(D), we use an iterative algorithm to find the proper P(y). However, in the maximum entropy method, we maximize H(X, Y) or H(Y|X) for given P(x) without the iteration for P(y);
- The rate-distortion function can be expressed by the semantic information G measure (see Equation (31)). In contrast, the maximum entropy is equal or proportional to the extremely maximum entropy minus semantic mutual information (see Equations (41) and (47)).

#### 6.3. How the Experimental Results Support the Explanation for the MMI Iterations

_{j}|x) ∝ T(y

_{j}|x), j = 1, 2, … as far as possible. Example 1 in Section 5.1 shows that the iterative processes for R(Θ) and R(G) not only find proper probability distribution P(y) (which means a label y

_{j}with larger logical probability will be selected more frequently) but also modify every TPF P(y

_{j}|x) so that P(y

_{j}|x) is approximatively proportional to T(y

_{j}|x). Therefore, the results in Figure 3 support the above theoretical analyses.

_{2}and y

_{4}, with less logical probabilities, have larger P(y

_{2}) and P(y

_{4}) than those for R(Θ). This result also accords with the theoretical analysis, which indicates that the semantic information criterion can reduce the underreports of events with less logical probabilities.

#### 6.4. Considering Other Kinds of Semantic Information

_{j}and y

_{k}:

_{j}, y

_{k})), and logical probabilities, and it does not require that subsets θ

_{1}, θ

_{2}, …, θ

_{n}form a partition of U. We use an example to explain the reason for using P(x). If P(x) is a population age distribution, we ca use two labels y

_{1}= “Person in his 50 s” and y

_{2}= “Elder person” for a person at age x. The average life span (denoted by t) in different areas and eras might change from 50 years to 80 years. The above formula can ensure that the semantic information between y

_{1}and y

_{2}for t = 50 is more than that for t = 80.

_{j}; θ

_{k}), we have semantic mutual information:

_{j}|x) becomes the DCF, P(x|y) becomes the realized distribution, information means control complexity or control amount, G means the effective control amount, and R means the control cost. To increase G = I(X; Y

_{θ}), we need to fix T(θ

_{j}|x) and optimize control to improve P(x|y). The R(G) function tells us that we can improve P(y|x) with the MMI iteration and increase G by amplifying s. In [32], the author improperly uses a different information formula (Equation (24) in [32]) for the above purpose. That formula seemingly only fits cases where the control results are continuous distributions. The author will further discuss this topic in another paper.

- The G measure is only related to labels’ extensions, not to labels’ intensions. For example, “Old” means senility and closeness to death, whereas the G measure is only related to T(“Old”|x), which represents the extension of “Old”;
- The G measure is not enough for measuring the semantic information from fuzzy reasoning according to the context, although the author has discussed how to calculate compound propositions’ truth functions and fuzzy reasoning [32].

#### 6.5. About the Importance of Semantic Information’s Studies

“These semantic aspects of communication are irrelevant to the engineering problem. The significance aspects is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just one which will actually be chosen since this is unknown at the time of design.”

_{θ}).

## 7. Conclusions

- Why does the Bayes-like formula (see Equation (2)) with NEFs and partition functions widely exist in the rate-distortion theory, statistical mechanics, the maximum entropy method, and machine learning?
- Can we combine machine learning (for the distortion function) and data compression (with the rate-distortion function)?
- Can we use the rate-distortion function or similar functions for semantic compression?

_{θ}|X) (see Section 4.3). We can also extend R(D) to the rate-verisimilitude function R(G) by replacing the upper limit D of average distortion with the lower limit G of semantic mutual information (see Section 4.4). We have used two examples to show how the iteration algorithm let the Shannon channel P(y|x) match the semantic channel T(y|x) under the semantic constraints to achieve MMI R. Example 1 reveals that the iteration algorithm can use the logical implication between labels to reduce mutual information. Example 2 indicates that it is easy to control the Shannon channel P(y|x) with the semantic channel T(y|x) for data reduction.

## Supplementary Materials

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A. The Derivation of Equation (47)

## Appendix B

## Appendix C

y_{j} | T(y_{2}|x) | T(y_{3}|x) | T(y_{4}|x) | T(y_{5}|x) | T(y_{6}|x) | T(y_{7}|x) |

μ_{j} | 14 | 30 | 52 | 80 | 120 | 170 |

σ_{j}^{2} | 16 | 24 | 50 | 80 | 160 | 240 |

_{j}|x), j = 2, 3, …, 6, has not been used by others in the above formulas. This formula represents a group of trapezoidal curves with rounded corners. These curves are often used as membership functions and constructed by some piecewise functions. With this formula, the piecewise functions are not necessary.

## References

- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J.
**1948**, 27, 379–429+623–656. [Google Scholar] [CrossRef] [Green Version] - Shannon, C.E. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec.
**1959**, 4, 142–163. [Google Scholar] - Berger, T. Rate Distortion Theory; Prentice-Hall: Enklewood Cliffs, NJ, USA, 1971. [Google Scholar]
- Jaynes, E.T. Information Theory and Statistical Mechanics. Phys. Rev.
**1957**, 106, 620. [Google Scholar] [CrossRef] - Jaynes, E.T. Information Theory and Statistical Mechanics II. Phys. Rev.
**1957**, 108, 171. [Google Scholar] [CrossRef] - Smolensky, P. Chapter 6: Information Processing in Dynamical Systems: Foundations of Harmony Theory. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations; Rumelhart, D.E., McLelland, J.L., Eds.; MIT Press: Cambridge, UK, 1986; pp. 194–281. [Google Scholar]
- Hinton, G.E. A Practical Guide to Training Restricted Boltzmann Machines. In Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science; Montavon, G., Orr, G.B., Müller, K.R., Eds.; Springer: Berlin, Germany, 2012; Volume 7700, pp. 599–619. [Google Scholar] [CrossRef]
- Salakhutdinov, R.; Hinton, G.E. Replicated softmax: An undirected topic model. Neural Inf. Process. Syst.
**2009**, 22, 1607–1614. [Google Scholar] [CrossRef] - Goodfellow, I.; Bengio, Y.; Courville, A. Softmax Units for Multinoulli Output Distributions. In Deep Learning; MIT Press: Cambridge, UK, 2016; pp. 180–184. [Google Scholar]
- Zadeh, L.A. Fuzzy Sets. Inf. Control.
**1965**, 8, 338–353. [Google Scholar] [CrossRef] [Green Version] - Harremoës, P. Maximum Entropy on Compact Groups. Entropy
**2009**, 11, 222–237. [Google Scholar] [CrossRef] [Green Version] - Berger, T.; Gibson, J.D. Lossy Source Coding. IEEE Trans. Inf. Theory
**1998**, 44, 2693–2723. [Google Scholar] [CrossRef] [Green Version] - Gibson, J. Special Issue on Rate Distortion Theory and Information Theory. Entropy
**2018**, 20, 825. [Google Scholar] [CrossRef] [Green Version] - Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: New York, NY, USA, 2006. [Google Scholar]
- Davidson, D. Truth and meaning. Synthese
**1967**, 17, 304–323. [Google Scholar] [CrossRef] - Willems, F.M.J.; Kalker, T. Semantic compaction, transmission, and compression codes. In Proceedings of the International Symposium on Information Theory ISIT 2005, Adelaide, Australia, 4–9 September 2005; pp. 214–218. [Google Scholar] [CrossRef]
- Babu, S.; Garofalakis, M.; Rastogi, R. SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables. ACM SIGMOD Rec.
**2001**, 30, 283–294. [Google Scholar] [CrossRef] - Ceglarek, D.; Haniewicz, K.; Rutkowski, W. Semantic Compression for Specialised Information Retrieval Systems. Adv. Intell. Inf. Database Syst.
**2010**, 283, 111–121. [Google Scholar] - Blau, Y.; Michaeli, T. The perception-distortion tradeoff. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18 June 2018; pp. 6228–6237. [Google Scholar]
- Bardera, A.; Bramon, R.; Ruiz, M.; Boada, I. Rate-Distortion Theory for Clustering in the Perceptual Space. Entropy
**2017**, 19, 438. [Google Scholar] [CrossRef] [Green Version] - Carnap, R.; Bar-Hillel, Y. An Outline of a Theory of Semantic Information. Available online: http://dspace.mit.edu/bitstream/handle/1721.1/4821/RLE-TR-247-03150899.pdf;sequence=1 (accessed on 1 July 2021).
- Klir, G. Generalized information theory. Fuzzy Sets Syst.
**1991**, 40, 127–142. [Google Scholar] [CrossRef] - Floridi, L. Outline of a theory of strongly semantic information. Minds Mach.
**2004**, 14, 197–221. [Google Scholar] [CrossRef] [Green Version] - Zhong, Y.X. A theory of semantic information. China Commun.
**2017**, 14, 1–17. [Google Scholar] [CrossRef] - D’Alfonso, S. On Quantifying Semantic Information. Information
**2011**, 2, 61–101. [Google Scholar] [CrossRef] [Green Version] - Bhandari, D.; Pal, N.R. Some new information measures of fuzzy sets. Inf. Sci.
**1993**, 67, 209–228. [Google Scholar] [CrossRef] - Dębowski, Ł. Approximating Information Measures for Fields. Entropy
**2020**, 22, 79. [Google Scholar] [CrossRef] [Green Version] - Lu, C. A Generalized Information Theory; China Science and Technology University Press: Hefei, China, 1993; ISBN 7-312-00501-2. (In Chinese) [Google Scholar]
- Lu, C. Meanings of generalized entropy and generalized mutual information for coding. J. China Inst. Commun.
**1994**, 15, 37–44. (In Chinese) [Google Scholar] - Lu, C. A generalization of Shannon’s information theory. Int. J. Gen. Syst.
**1999**, 28, 453–490. [Google Scholar] [CrossRef] - Lu, C. Semantic information G theory and logical Bayesian inference for machine learning. Information
**2019**, 10, 261. [Google Scholar] [CrossRef] [Green Version] - Lu, C. The P–T probability framework for semantic communication, falsification, confirmation, and Bayesian reasoning. Philosophies
**2020**, 5, 25. [Google Scholar] [CrossRef] - Lu, C. Channels’ Confirmation and Predictions’ Confirmation: From the Medical Test to the Raven Paradox. Entropy
**2020**, 22, 384. [Google Scholar] [CrossRef] [Green Version] - Shannon, C.E.; Weaver, W. The Mathematical Theory of Communication; The University of Illinois Press: Urbana, IL, USA, 1963. [Google Scholar]
- Zadeh, L.A. Probability measures of fuzzy events. J. Math. Anal. Appl.
**1986**, 23, 421–427. [Google Scholar] [CrossRef] [Green Version] - Cumulative_Distribution_Function. Available online: https://en.wikipedia.org/wiki/Cumulative_distribution_function (accessed on 10 April 2021).
- Popper, K. Conjectures and Refutations; Routledge: London, UK; New York, NY, USA, 2002. [Google Scholar]
- Wittgenstein, L. Philosophical Investigations; Basil Blackwell Ltd.: Oxford, UK, 1958. [Google Scholar]
- Sow, D.M.; Eleftheriadis, A. Complexity distortion theory. IEEE Trans. Inf. Theory
**2003**, 49, 604–608. [Google Scholar] [CrossRef] [Green Version] - Lu, C. Understanding and Accelerating EM Algorithm’s Convergence by Fair Competition Principle and Rate-Verisimilitude Function. arXiv
**2021**, arXiv:2104.12592. [Google Scholar] - Boltzmann Distribution. Available online: https://en.wikipedia.org/wiki/Boltzmann_distribution (accessed on 10 April 2021).
- Binary Images. Available online: https://www.cis.rit.edu/people/faculty/pelz/courses/SIMG203/res.pdf (accessed on 25 June 2021).
- Kutyniok, G. A Rate-Distortion Framework for Explaining Deep Learning. Available online: https://maths-of-data.github.io/Talk_Edinburgh_2020.pdf (accessed on 30 June 2021).
- Nokleby, M.; Beirami, A.; Calderbank, R. A rate-distortion framework for supervised learning. In Proceedings of the 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, USA, 17–20 September 2015; pp. 1–6. [Google Scholar] [CrossRef]
- John, S.; Gadde, A.; Adsumilli, B. Rate Distortion Optimization Over Large Scale Video Corpus with Machine Learning. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 1286–1290. [Google Scholar] [CrossRef]
- Song, J.; Yuan, C. Learning Boltzmann Machine with EM-like Method. arXiv
**2016**, arXiv:1609.01840. [Google Scholar]

**Figure 1.**Four labels’ (fuzzy) truth functions about people’s ages. T(y

_{j}|x) denotes the truth function of proposition function y

_{j}(x) and j = 1, 2, 3, 4, as the constraint condition.

**Figure 3.**P(y|x) and I(X; Y) change in the iterative process for R(Θ). (

**a**) P(y|x) after the first iteration (this P(y|x) also results in maximum entropies H(Y|X) and H(X, Y)); (

**b**) P(y|x) after the second iteration; (

**c**) P(y|x) after eight iterations; (

**d**) I(X; Y), I(Y; X

_{θ}), and H(Y) change in the iterative process.

**Figure 4.**The iterative results for the R(G) function. (

**a**) The convergent Shannon channel P(y|x); (

**b**) I(X; Y

_{θ}) and I(X; Y) = I(Y; X

_{θ}) change in the iterative process.

**Figure 5.**The results of Example 2. (

**a**) Eight truth functions or the semantic channel T(y|x); (

**b**) The convergent Shannon channel P(y|x); (

**c**) I(X; Y

_{θ}), I(X; Y) and H(Y) change in the iterative process.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lu, C.
Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions. *Entropy* **2021**, *23*, 1050.
https://doi.org/10.3390/e23081050

**AMA Style**

Lu C.
Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions. *Entropy*. 2021; 23(8):1050.
https://doi.org/10.3390/e23081050

**Chicago/Turabian Style**

Lu, Chenguang.
2021. "Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions" *Entropy* 23, no. 8: 1050.
https://doi.org/10.3390/e23081050