Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions

In the rate-distortion function and the Maximum Entropy (ME) method, Minimum Mutual Information (MMI) distributions and ME distributions are expressed by Bayes-like formulas, including Negative Exponential Functions (NEFs) and partition functions. Why do these non-probability functions exist in Bayes-like formulas? On the other hand, the rate-distortion function has three disadvantages: (1) the distortion function is subjectively defined; (2) the definition of the distortion function between instances and labels is often difficult; (3) it cannot be used for data compression according to the labels’ semantic meanings. The author has proposed using the semantic information G measure with both statistical probability and logical probability before. We can now explain NEFs as truth functions, partition functions as logical probabilities, Bayes-like formulas as semantic Bayes’ formulas, MMI as Semantic Mutual Information (SMI), and ME as extreme ME minus SMI. In overcoming the above disadvantages, this paper sets up the relationship between truth functions and distortion functions, obtains truth functions from samples by machine learning, and constructs constraint conditions with truth functions to extend rate-distortion functions. Two examples are used to help readers understand the MMI iteration and to support the theoretical results. Using truth functions and the semantic information G measure, we can combine machine learning and data compression, including semantic compression. We need further studies to explore general data compression and recovery, according to the semantic meaning.


Introduction
Bayes' formula is used for probability predictions. Using Bayes' formula, from a joint (probability) distribution P(x, y), or distributions P(x|y) and P(y) (where x and y are variables), we can obtain the posterior distribution of y: P(y|x) = P(x, y) ∑ y P(x, y) = P(y)P(x|y) ∑ y P(y)P(x|y) = P(y)P(x|y) P(x) .
Similar expressions with Negative Exponential Functions (NEFs) and partition functions often appear in the rate-distortion theory [1][2][3], statistical mechanics, the maximum entropy method [4,5], and machine learning (see the Restricted Boltzmann Machine (RBM) [6,7] and the Softmax function [8,9]). For example, in the rate-distortion theory, the Minimum Mutual Information (MMI) distribution is: • Distortion function d(x, y) is subjectively defined, lacking the objective standard; • It is hard to define the distortion function using distances when we use labels to replace instances in machine learning; where possible the labels' number should be much less than the possible instances' number. For example, we need to use "Light rain", "Moderate rain", "Heavy rain", "Light to moderate rain", etc., to replace daily precipitations in millimeters, or use "Child", "Youth", "Adult", "Elder", etc., to replace people's ages. In these cases, it is not easy to construct distortion functions; • We cannot apply the rate-distortion function for semantic compression, e.g., data compression according to labels' semantic meanings.
The first two disadvantages remind us that we need to obtain allowable distortion functions from samples or sampling distributions by machine learning.
We consider the example of semantic compression as related to ages. For instance, x denotes an age (instance) and y j denotes the labels: y 1 = "Non-adult", y 2 = "Youth", y 3 = "Adult", and y 4 = "Elder". All x values that make y j true form a fuzzy set. The truth functions of these labels are also the membership functions of the fuzzy sets (see Figure 1). According to Davidson's truth-conditional semantics [15], truth function T(y j |x) ascertains the semantic meaning of y j . For a given P(x) and the constraint, we need to find the Shannon channel P(y|x) that minimizes the mutual information between y and x. The Minimum Mutual Information (MMI) should be the lower limit of the average codeword length for coding x to y. The constraint condition states that labels' selections should accord with the rules of the langauges used, expressed by the truth functions. In this case, the rate-distortion function cannot work well because it is not easy to define a distortion function which is compatible with the labels' semantic meanings. There have been meaningful studies on semantic compression [16][17][18]. However, in these studies, either the information-theoretic method related to the rate-distortion function or similar aspects, has not been adopted [18], or no semantic information measure has been used. The data compression or clustering related to perception has been studied in [19,20]; however, the discrimination functions (like the truth function) and the sensory information measure have not been adopted. For semantic compression, we need a proper information measure to measure semantic information and sensory information. We also want a function, like the rate-distortion function, related to labels' semantic meanings.
To measure semantic information, researchers have proposed many semantic information measures [21][22][23][24][25] or information measures related to semantics [26,27]. However, it is not easy to use them for machine learning or semantic compression. For the similar purpose, the author proposed the semantic information G theory, or simply the G theory, in the 1990s [28][29][30]. The letter "G" means the generalization of Shannon's information theory. The semantic information measure, e.g., the G measure, is defined with log(truth_function/logical_probability) or its average. The truth function and the logical probability are similar to the NEF and the partition function, respectively. The truth function can be expressed by not only the NEF, but also the Logistic function and other functions between 0 and 1. The semantic information G measure can also be used to measure semantic information conveyed not only by natural languages but also by sensory organs and measuring instruments, such as thermometers, scales, and GPS devices [31]. For sensory information, truth functions become discrimination functions or confusion probability functions [30].
The G measure measures labels' semantic information only according to their extensions ascertained by truth functions, without considering their intentions. Therefore, we may call this semantic information, formally semantic information. For simplicity, this paper mainly considers the (formally) semantic information between label y and instance x. The semantic information between two labels or sentences is simply discussed in Section 6.4.
The G theory uses the P-T probability framework [32], which consists of both statistical probability (denoted by P) and logical probability (by T). The P-T probability framework and the G theory have been applied to several areas, such as data compression related to visual discrimination [30], machine learning [31], Bayesian confirmation [33], and the philosophy of science [32].
For overcoming the rate-distortion function's disadvantages, this paper sets up a transformation relation between the truth function and the distortion function and uses the truth function to replace the distortion function for the constraint condition to obtain the rate-truth function. Since a truth function can also be explained as a membership function, similarity function, confusion probability function, or DCF [32], it can be used as a learning function. It is often expressed as an NEF or Logistic function. In this way, we can overcome rate-distortion functions' three disadvantages because: • the truth function is a learning function; it can be obtained from a sample or sampling distribution [31] and hence is not subjectively defined; • using the transformation relation, we can indirectly express the distortion function between any instance x and any label y by the truth function that may come from machine learning; • truth functions indicate labels' semantic meanings and, hence, can be used as the constraint condition for semantic compression.
Combining the author's previous studies [31,32], we can explain that: • the NEF and the partition functions are the truth function and the logical probability, respectively; • the formula, such as Equation (2), for the distribution of Minimum Mutual Information (MMI) or maximum entropy is the semantic Bayes' formula; • MMI R(D) can be expressed by the semantic mutual information formula; • maximum entropy is equal to extremely maximum entropy minus semantic mutual information.
The new explanations are not against the existing explanations but complement them. We can use the fuzzy truth criterion to replace the distortion criterion so that the ratedistortion function becomes the rate-truth function R(Θ), where Θ is a group of fuzzy sets or (fuzzy) truth functions. We can also use the semantic information criterion, which is compatible with the likelihood criterion, to replace the distortion criterion so that the rate-distortion function becomes the rate-verisimilitude function R(G), where G is the lower limit of semantic mutual information. Both functions can be used for semantic compression. The rate-verisimilitude function has been introduced before [30,31], whereas the rate-truth function is provided first in this paper.
This paper mainly aims to: • help readers understand rate-distortion functions and maximum entropy distributions from the new perspective; • combine machine learning (for the distortion function) and data compression; • show that the rate-distortion function can be extended to the rate-truth function and the rate-verisimilitude function for communication data's semantic compression.
This paper provides an example to show how the Shannon channel matches the semantic channel to achieve MMI R(Θ) and R(G) for a given source P(x) and a group of truth functions. It provides another example to show that the rate-truth function can be used for data reduction (e.g., the compression of decreasing data resolution). The results support the theoretical analyses.
The new explanations should more cohesively combine classical information theory, semantic information G theory, maximum entropy theory, the likelihood method, and fuzzy set theory, with each other. Moreover, the rate-distortion function's extensions should be practical for semantic compression and helpful for explaining machine learning with NEFs and Soft-max functions. In turn, the new explanations and extensions also support the P-T probability framework and the semantic information G theory. • Variable x denotes an instance; X denotes a discrete random variable taking a value x ∈ U = {x 1 , x 2 , . . . , x m }. • Variable y denotes a hypothesis or label; Y denotes a discrete random variable taking a value y ∈ V = {y 1 , y 2 , . . . , y n }.
Shannon calls P(X) the source, P(Y) the destination, P(Y|X) the channel, and P(y j |x) (with a certain y j and variable x) a Transition Probability Function (TPF) [34] (p. 18). A Shannon channel is a transition probability matrix or a group of TPFs: P(y|x): P(y j |x), j = 1, 2, . . . , n.
Shannon's mutual information is: where H(X) and H(Y) are Shannon's entropies of X and Y; H(X|Y) and H(Y|X) are Shannon's conditional entropies of X and Y.
When Y = y j is given, I(X; Y) becomes the Kullback-Leibler (KL) divergence: It is greater or equal to 0. It is equal to 0 as P(x|y j ) = P(x).

Rate-Distortion Function R(D)
Shannon proposed the information rate-distortion function [1,2]. Since the ratedistortion function for an i.i.d. source P(X) and bounded function d(x, y) is equal to the associated information rate-distortion function, Cover and Thomas, in [14] (p. 307), do not distinguish the two functions in most cases. They call both functions the rate-distortion function and use R(D) to denote them. We follow their example. The following is the definition of the (information) rate-distortion function.
Let the distortion function be d(x, y) or d ij = d(x i , y j ), i = 1, 2, . . . ; j = 1, 2, . . . ; d be the average of d(x, y); and D be the upper limit of d. An often-used distortion function is d(x, y) = (x − y) 2 . This function fits cases where x and y have the same universe (e.g., U = V), and distortion only depends on the distance between x and y.
The MMI for the given P(X) and D is defined as: We can obtain the parameter solution of the rate-distortion function by the variational method [2,3]. The constraint conditions are: ∑ j P(y j ) = 1.
The Lagrange function is therefore: Since P(y|x) and P(y) are interdependent, we need to fix one to optimize another. To optimize P(y|x), we fix P(y) and order ∂F/∂P(y j x i ) = 0 . Then we derive the optimized TPFs or channel: where exp( ) is the inverse function of log( ); λ i is defined with λ i ≡ exp(µ i /P(x i )). To optimize P(y), we fix P(y j |x i ) in F and order ∂F/∂P(y j ) = 0. Hence, we derive α = 1 and the optimized P(y): Since P(y|x) and P(y) are interdependent, we first suppose P(y 1 ) = P(y 2 ) = . . . 1/n and then repeat Equations (10) and (11) until P(y) is unchanged. We call this iteration the MMI iteration.
It is worth noting that our purpose is to find proper TPFs P(y j |x), j = 1, 2, . . . , but it is difficult to find them directly because of the constraint conditions in Equations (7) and (8).
Therefore, we can only find the proper posterior distributions P(y|x i ) of y, i = 1, 2, . . . , by the MMI iteration to obtain the TPFs indirectly.
The rate-distortion function R(D) with parameter s is [3] (p. 32): where Z i is the partition function. The parameter s = dR/dD is negative, and hence exp(sd ij ) is a negative exponential function. Thus, a larger |s| will result in narrower exp(sd ij ), larger R, and less D.
Shannon has proved that R(D) is the lower limit of the average codeword length for i.i.d. source P(X) with an average distortion limit, D. The rate-distortion theory is the basic theory of data compression for digital communication.
Since I(X; Y) = H(X) − H(X|Y) and P(X) is unchanged, minimizing R = I(X; Y) is equivalent to maximizing H(X|Y). Therefore, the MMI distribution P(y|x) also maximizes posterior entropy H(X|Y).

The Maximum Entropy Method
Jaynes [4,5] first expounded the maximum entropy method and argued that entropy in statistical mechanics should simply be viewed as a particular application of entropy in information theory.
Supposing that we need to maximize joint entropy H(Y, X) for a given source P(x), maximizing H(X, Y) is the equivalent to maximizing H(Y|X): Morevoer, supposing there are feature functions f k (x, y), k = 1, 2, . . . the constraint conditions are: The Lagrange function is therefore: By ordering ∂F/∂P(y|x i ) = 0 , we derive the optimized channel P(y|x): This P(y|x) can maximize H(X, Y).

The P-T Probability Framework
The semantic information G theory is based on the P-T probability framework [29,30]. This framework includes two types of probabilities: the statistical probability denoted by P and the logical probability by T.

Definition 2.
(for the P-T probability framework):

•
The y j is a label or a hypothesis; y j (x i ) is a proposition. The θ j is a fuzzy subset of universe U, whose elements make y j true. We have y j (x) ≡ "x ∈ θ j " ≡ "x belongs to θ j "("≡" means they are logically equivalent according to the definition). The θ j may also be a model or a group of model parameters.

•
A probability that is defined with "=", such as P(y j ) ≡ P(Y = y j ), is a statistical probability. A probability that is defined with "∈", such as P(X ∈ θ j ), is a logical probability. To distinguish P(Y = y j ) and P(X ∈ θ j ), we define T(y j ) ≡ T(θ j ) ≡ P(X ∈ θ j ) as the logical probability of y j .
is the truth function of y j and the membership function of θ j . It changes between 0 and 1, and the maximum of T(y|x) is 1.
A semantic channel consists of a group of truth functions: According to the above definition, we have the logical probability: Zadeh calls this probability the fuzzy event's probability [35]. If θ j is a crispy set, T(θ j ) becomes the cumulative probability or two cumulative probabilities' difference [36].
Generally, T(y 1 ) + T(y 2 ) + . . . + T(y n ) > 1. For example, the sum of the logical probabilities of four labels in Figure 1 is greater than 1 since T(y 1 ) + T(y 3 ) = 1. The detailed discussions about the distinctions and relations between statistical probability and logical probability can be found in [32].
We can put T(θ j |x) and P(x) into the Bayes' formula to obtain a likelihood function: P(x|θ j ) is called the semantic Bayes' prediction. It is often written as P(x|y j , θ) in popular methods. We call the above formula the semantic Bayes' formula.
Since the maximum of T(y|x) is 1, from P(x) and P(x|θ j ), we can obtain: where max[P(x|θ)/P(x)] means the maximum of P(x|θ)/P(x) for different x and y. In the author's earlier articles [31], T(θ j ) = 1/max[P(x|θ j )/P(x)]. This change in Equation (20) can ensure that truth functions are symmetrical, e.g., T(x|y) = T(y|x), as well as distortion functions. This change is also for comparing two truth functions T(y j |x) and T(y k |x) for classification according to the correlation between x and y (see Equation (23)). Since P(x|θ j ) in Equation (19) is changeless, as T(θ j |x) is replaced with cT(θ j |x) (where c is a positive constant), this change does not influence the other uses of T(θ j |x). Equations (19) and (20) form the third Bayes' theorem [31], can be used to convert the likelihood function and the truth function from one to another.

The Semantic Information G Measure
The author [28] defines the (amount of) semantic information conveyed by y j in relation to x i with log-normalized-likelihood: The value I(x i ; θ j ), or its average, is the semantic information G measure or, simply, the G measure. If T(θ j |x) is always 1, the G measure becomes Carnap and Bar-Hillel's semantic information measure [21].
The above formula is illustrated in Figure 2. Figure 2 indicates that the less the logical probability is (e.g., the lower the horizontal line is), the more information there is available; the larger the deviation is, the less information there is available; a wrong hypothesis conveys negative information. These conclusions accord with Popper's thoughts [37] (p. 294). For this reason, I(x i ; θ j ) is also explained as the verisimilitude between y j and x i [32]. We can also use the above formula to measure sensory information, for which T(θ j |x) is the confusion probability function of x j with x or the discrimination function of x j [31].
By averaging I(x i ; θ j ), we obtain generalized KL information: where P(x i |y j ), i = 1, 2, . . . , is the sampling distribution, which may be unsmooth or discontinuous. It is easy to prove I(X; θ j ) ≤ I(X; y j ) [31]. When the sample is enormous, so that P(x|y j ) is smooth, we may let P(x|θ j ) = P(x|y j ) or T(θ j |x) ∝ P(y j |x) to obtain the optimized truth function: According to this equation, T*(θ j |x i ) = T*(θ xi |y j ) or T*(y j |x i ) = T*(x i |y j ), where θ xi is a fuzzy subset of V. Furthermore, we have: where c is a constant. The above formula means the optimized truth function T*(θ j |x) is proportional to TPF P(y j |x). This formula accords with Wittgenstein's thoughts: meaning lies in uses [38] (p. 80).
If P(x|y j ) is unsmooth, we may achieve a smooth T*(θ j |x) with parameters by: By averaging I(X; θ j ) for different y, we obtain semantic mutual information: where: H(Y θ ) is a cross-entropy. Since ∑ j T(θ j ) ≥ 1, we also call H(Y θ ) a generalized entropy or a semantic entropy. H(Y θ |X) is called a fuzzy entropy.
When we fix the Shannon channel P(y|x) and let P(x|θ j ) = P(x|y j ) or T(θ j |x) ∝ P(y j |x) for every j (Matching I), I(X; Y θ ) reaches its maximum I(X; Y). If we use a group of truth functions or a semantic channel T(y|x) as the constraint function to seek MMI, we need to let P(x|y j ) = P(x|θ j ) or P(y j |x) ∝ T(θ j |x) as far as possible for every j (Matching II). Sections 4.3 and 4.4 further discuss Matching II.
Letting 2σ 2 )], we have: It is easy to find that the above mutual information is like the Regularized Least Square (RLS). H(Y θ ) is like the regularization term, and the next one is the relative squared error term. Therefore, we can treat the maximum semantic mutual information criterion as a particular RLS criterion.

The New Explanations of the MMI Distribution and the Rate-Distortion Function
It is easy to regard exp[sd(x i , y)] in the rate-distortion function as a truth function and Z i as a logical probability. We can let θ xi be a fuzzy subset of V (V = {y 1 , y 2 , . . . }) and T(x i |y) ≡ T(θ xi |y) be the truth function of y(x i ), and T(x i ) = T(θ xi ) = ∑ j P(y j )T(y j |x i ) be the logical probability of x i . Hence, we can observe that the MMI distribution P(y|x) in the rate-distortion function is produced by the semantic Bayes' formula: R(D) can be expressed by the semantic mutual information formula because:

Setting Up the Relation between the Truth Function and the Distortion Function
We can now improve the G theory by setting up the relation between the truth function and the distortion function.
Four truth functions in Figure 1 also reveal the distortion when we use y j to represent x i . If the truth value of y j (x i ) is 1, the distortion d(x i , y j ) should be 0. The distortion increases as the truth value decreases. Hence, we use the following definition: Definition 3. The transformation relation between the distortion function and the truth function is defined as: According to this definition, we have T(y|x) = exp[−d(x, y)] and H(Y θ |X) = d. Therefore, we can use H(Y θ |X) to replace d for the constraint condition.

Rate-Truth Function R(Θ)
The author has proposed the rate-of-limiting-error function previously in [30], an immature study. This function is more akin to the extension of the complexity-distortion function [39], unlike the rate-truth function, which is the extension of the rate-distortion function.
In the following, we use almost the same method used for R(D) to obtain the rate-truth function R(Θ), where Θ means a group of truth functions or fuzzy sets. The constraint condition d ≤ D becomes: Following the rate-distortion function's derivation (see Equation (10)), we obtain the optimized posterior distribution P(y|x i ) of y: where we replace −s with |s|, which is positive. The larger the |s| value is, the clearer the boundaries of the fuzzy sets are, and hence the larger the R is.
We can obtain the Shannon channel P(y|x) that minimizes R by repeating Equations (11) and (34) until P(y) converges, e.g., by the MMI iteration.
Bringing P(y|x i ) in Equation (34) into the mutual information formula, we obtain the rate-truth function R(Θ): where: We have R(Θ) = H(X θ ) − H(X θ |Y) = I(Y; X θ ), which means that the MMI can be expressed by semantic mutual information: I(Y; X θ ). The constraint is tighter when |s| > 1 and looser when |s| < 1, than the constraint |s| = 1. When the constraint condition is the original truth function without s or with s = −1, we have: The reason for I(Y; X θ ) ≥ I(X; Y θ ) is that the iteration can only let P(y j |x) be proximately proportional to T(y j |x), and not exactly in general (see Equation (34) and Section 5.2).
According to Shannon's lossy coding theorem of discrete memoryless sources [14] (p. 307), for the given sources P(X) and D, we can use block coding to achieve a minimum average codeword length, whose lower limit is R(D). Since H(X θ |Y) can be understood as the average distortion, R(Θ) also means the lower limit of average codeword length.
For any distortion function d(x, y), we can always express the corresponding truth function as exp[−d(x, y)]. But for truth functions, such as those in Figure 1, we may not be able to express them by a distortion function. Therefore, the rate-distortion function may be regarded as the rate-truth function's particular case, as the truth function is exp[−d(x, y)].

Rate-Verisimilitude Function R(G)
If we change the average distortion criterion into the semantic mutual information criterion, which is compatible with the likelihood criterion, then the rate-distortion function becomes the rate-verisimilitude function [31]. In this case, we replace d ij = d(x i , y j ) with I ij = I(x i ; θ j ). The constraint condition d ≤ D becomes I(X; Y θ ) ≥ G, where G denotes the lower limit of semantic mutual information. Following the rate-distortion function's derivation, we can obtain: where s is positive, and: We also need the MMI iteration to optimize P(y) and P(y|x). The function P(y|x i ) is now like a Softmax function, in which the numerator P(y)[T(y|x i )/T(y)] s may be greater than 1. In the E-step of the EM algorithm for mixture models, there is a similar formula, which is also used for MMI [40].
R(G) is more suitable than R(D) and R(Θ) when y is a prediction, such as a weather prediction, where information is more important than truth. More discussions and applications of the R(G) function can be found in [30,31].

The New Explanation of the Maximum Entropy Distribution and the Extension
We consider maximizing joint entropy H(X, Y) for the given P(x). Consider Equation (17) for the maximum entropy distribution. We may assume that P(y) is a constant (1/n) so that H(Y) is the maximum. Then Equation (17) becomes: The above exp(.) is an NEF. We can explain exp(.) as a truth function, Z i as a logical probability, and the above formula as a semantic Bayes' formula.
Then, maximum joint entropy can be expressed as: where H(Y) is equal to log n, I(X; Y θ ) is semantic mutual information, and H(X) + H(Y) is the extremely maximum value of H(X, Y). Therefore, we can explain that maximum entropy is equal to extremely maximum entropy minus semantic mutual information.
If we use a group of truth functions or membership functions, such as those in Figure 1, as the constraint condition, to extend the maximum entropy method, the above distribution becomes: This distribution is the same as in Equation (34) with P(y) = 1/n.

The New Explanation of the Boltzmann Distribution and the Maximum Entropy Law
Using Stirling's formula lnN! = NlnN − N as N → ∞, Jaynes proved that Boltzmann's entropy and Shannon's entropy have a simple relation [4,5]: where k is the Boltzmann constant, W is the microstate number, x i means state i, N is the particles' number, and P(x i |T) denotes the probability of a particle (or the density of particles) in state i for the given absolute temperature T. The Boltzmann distribution [41] is: where Z is the partition function. If x i means energy e i , G i is the number of states with e i , and G is the total number of all states, then P(x i ) = G i /G is the prior probability of x i . Hence, Equations (43) and (44) become: Now, we can explain exp[−e i /(kT)] as a truth function T(θ j |x), Z as a logical probability T(θ j ), and Equation (46) as a semantic Bayes' formula.
Assuming that for a local equilibrium system, its different areas y j , j = 1, 2, . . . , have different temperatures T j , j = 1, 2, . . . The above G becomes G j . Then we can derive (see Appendix A for the details): which means that the thermodynamic entropy S is proportional to the extremely maximum entropy ∑ j P(y j ) ln G j minus the semantic mutual information I(X; Y θ ). This formula indicates that the maximum entropy law in physics can be equivalently stated as the MMI law; this MMI can be expressed as semantic mutual information I(X; Y θ ).

An Example Shows the Shannon Channel's Changes in the MMI Iterations for R(Θ) and R(G)
The author experimented with Example 1 to test two theoretical results: • the MMI iteration lets the Shannon channel match the semantic channel, e.g., lets P(y j |x) ∝ T(y j |x), j = 1, 2, . . . , as far as possible for the functions R(D), R(Θ), and R(G); • the MMI iteration can reduce mutual information.
Example 1. The four truth functions and the population age distribution P(x) are shown in Figure 1 (see Appendix B for the formulas producing these lines). The task is to use the above truth functions as the constraint condition to obtain P(y|x) for R(Θ) and P(y|x) for R(G).
First, the author uses the above truth functions as the constraint condition for the R(Θ) with |s| = 1. The iteration process for P(y|x) is shown in Figure 3 (see Supplementary Materials to find how the data for Figure 3 are produced). The convergent P(y) is {P(y 1 ), P(y 2 ), P(y 3 ), P(y 4 )} = {0.3499, 0.0022, 0.6367, 0}. The MMI is 0.845 bits. The iterative process not only lets every P(y j |x) be approximatively proportional to T(y j |x) but also makes P(y 4 ) closer to 0. The value of P(y 4 ) became 0 because y 4 implies y 3 and hence can be replaced with y 3 . The latter has a larger logical probability. By replacing y 4 with y 3 , we can save the average codeword length. Figure 3c indicates that H(Y) also decreases with the iteration. Since H(Y) is much less than P(X), we can also simply replace the instance x with the label y to compress data.
To maximize the entropy H(X, Y) or H(Y|X) for a given P(x), we do not need the iteration for P(y|x). The function P(y|x) in Figure 3a also results in the maximum entropies H(Y|X) and H(X, Y), where P(y|x) matches T(y|x) only once.
Next, the author uses the above truth functions as the constant condition for the R(G) with |s| = 1. For R(G), d(x i , y j ) is replaced with I(x i ; θ j ) = log[T(y j |x i )/T(y j )] for every i and j. Figure 4 shows the iterative results and the process (see Supplementary Materials to find how the data for Figure 4 are produced). Figure 4a intuitively displays that P(y|x) accords with the four labels' semantic meanings. The convergent distribution P(y) for R(G) is {P(y 1 ), P(y 2 ), P(y 3 ), P(y 4 )} = {0.3619, 0.0200, 0.6120, 0.0057}. The MMI is 0.883 bits. This P(y 4 ) is not 0. Both P(y 2 ) and P(y 4 ) are larger than those for R(Θ). The reason is that a label with less logical probability can convey more semantic information and, hence, should be more frequently selected if we use the semantic information criterion. For the same s = 1 and Θ, R(G) = 0.883 bits is greater than R(Θ) = 0.845 bits.

An Example about Greyscale Image Compression Shows the Convergent P(y|x) for the R(Θ)
This example was used to explain how we replace the distortion function with the truth function to obtain R(Θ) and the corresponding channel P(y|x) for the image compression of decreasing the resolution of pixels' grey levels. The conclusion is also suitable for data reduction, where too high a resolution is unnecessary. Considering that human visual discrimination changes with grey levels (the higher the grey level is, the lower the discrimination is). We use 8 truth functions, as shown in Figure 5a, to represent 8 fuzzy classes. Appendix C shows how these lines are produced. The task is to solve the MMI R and the corresponding channel P(y|x) with s = 1.
The author obtains the convergent P(y|x), as shown in Figure 5b. Figure 5c shows that R(Θ), I(X; Y θ ), and H(Y) change in the iterative process (see Supplementary Materials to find how the data for Figure 5 are produced). Figure 5 shows how the Shannon channel matches the semantic channel for MMI. Comparing Figure 5a,b, we find that it is easy to control P(y|x) with T(y|x). If we use the distortion function d(x, y j ) instead of the truth function T(y j |x), j = 1, 2, . . . , 8, it is not easy to design d(x, y j ). It is also difficult to predict the convergent P(y|x) according to d(x, y).
In this example, I(X; Y) = I(Y; X θ ) is difficult to approach I(X; Y θ ) because P(y j |x) cannot be strictly proportional to T(y j |x) for j = 2, 3, . . . , 6.
The author has used different prior distributions and semantic channels to calculate Shannon channels for MMI using the above method and achieving similar results. Every convergent TPF P(y j |x) covers an area in U identical to the area covered by T(y j |x) as s = 1, and hence, P(y|x) is satisfied with the constraint condition defined by T(y|x).

Connecting Machine Learning and Data Compression by Truth Functions
Researchers have applied the rate-distortion theory to machine learning and achieved meaningful results [43][44][45]. However, we also need to obtain the distortion function from machine learning.
In the rate-distortion theory, the distortion function d(x, y) is subjectively defined, lacking the objective standard. When the instance x and label y have different universes, U and V and |U| >> |V|, the definition of the distortion between x and y is problematic. Now we can obtain labels' truth functions from sampling distributions by machine learning (see Equations (23)- (25)) and indirectly express distortion functions using truth functions (see Equation (32)). After the source P(x) is changed, the semantic channel T(y|x) is still functional [31]. We can use T(y|x) and the new P(x) to obtain the new Shannon channel P(y|x) for MMI R(Θ). We can also use T(y|x) |s| with |s| > 1 to reduce the average distortion or with |s| < 1 to loosen the distortion limit. In this way, we can overcome the rate-distortion function's three disadvantages mentioned in Section 1. In addition, a group of truth functions describe a group of labels' extensions and semantic meanings and are therefore intuitional. They have stronger descriptive power than distortion functions because we can always use a truth function to replace a distortion function by T(y j |x) = exp[−d(x, y j )]. Although, we may not be able to use a distortion function to replace a truth function, such as the truth function of y 3 = "Adult" in Figure 1.
Two often-used learning functions are the truth function (or the similarity function, or the membership function) and the likelihood function. We can also replace the distortion function d(x, y) with a log-likelihood function logP(x|θ), or equivalently replace d ≤ D with I(X; Y θ ) ≥ G to obtain the rate-verisimilitude function R(G). The function R(G) uses the likelihood criterion or the semantic information criterion. It can reduce the underreports of events with less logical probabilities. For example, the selected probabilities P(y 2 ) and P(y 4 ) in Figure 1 for R(G) (see Figure 4a) are higher than P(y 2 ) and P(y 4 ) for R(Θ) (see Figure 3c). The function R(G) is more suitable than R(D) and R(Θ) when y is a prediction where information is more important than truth.
Like R(D), R(Θ) and R(G) also mean the lower limit of average codeword length.

Viewing Rate-Distortion Functions and Maximum Entropy Distributions from the Perspective of Semantic Information G Theory
There are many similarities between the MMI distribution in the rate-distortion function and the maximum entropy distribution. According to the analyses in Sections 4.1 and 4.5, we can regard NEFs as truth functions and partition functions as logical probabilities. These distributions or Shannon's channels, such as P(y|x) in Equation (2), are semantic Bayes' distributions produced by the semantic Bayes' formula (see Equation (19)).
The main differences between the rate-distortion theory and the maximum entropy method are:

•
For the rate-distortion function R(D), we seek MMI I(X; Y) = H(X) − H(X|Y), which is equivalent to maximizing the posterior entropy H(X|Y) of X. For R(D), we use an iterative algorithm to find the proper P(y). However, in the maximum entropy method, we maximize H(X, Y) or H(Y|X) for given P(x) without the iteration for P(y); • The rate-distortion function can be expressed by the semantic information G measure (see Equation (31)). In contrast, the maximum entropy is equal or proportional to the extremely maximum entropy minus semantic mutual information (see Equations (41) and (47)).

How the Experimental Results Support the Explanation for the MMI Iterations
The theoretical analyses in Sections 4.2-4.4 indicate that the MMI iterations for R(D), R(Θ), and R(G) impel the Shannon channel to match the semantic channel, e.g., impel P(y j |x) ∝ T(y j |x), j = 1, 2, . . . as far as possible. Example 1 in Section 5.1 shows that the iterative processes for R(Θ) and R(G) not only find proper probability distribution P(y) (which means a label y j with larger logical probability will be selected more frequently) but also modify every TPF P(y j |x) so that P(y j |x) is approximatively proportional to T(y j |x). Therefore, the results in Figure 3 support the above theoretical analyses. Figure 4 indicates that the Shannon channel P(y|x) for R(G) is a little different from that for R(Θ). For example, two labels y 2 and y 4 , with less logical probabilities, have larger P(y 2 ) and P(y 4 ) than those for R(Θ). This result also accords with the theoretical analysis, which indicates that the semantic information criterion can reduce the underreports of events with less logical probabilities. Figure 5 shows an example of data reduction. For this example, it is hard to define distortion function d(x, y), but it is easy to use truth functions to represent fuzzy classes.
The results indicate that we can easily control the convergent Shannon channel P(y|x) with the semantic channel T(y|x).

Considering Other Kinds of Semantic Information
Wyner and Debowski have independently defined an information measure (see Equation (1) in [27]), which is meaningful for measuring the semantic information between two sentences or labels. Combining this with the G theory, we may develop a different measure from the above for the semantic information between two labels y j and y k : Unlike Equation (1) in [27], the above formula includes both the statistical probabilities, (P(x) and P(x|y j , y k )), and logical probabilities, and it does not require that subsets θ 1 , θ 2 , . . . , θ n form a partition of U. We use an example to explain the reason for using P(x). If P(x) is a population age distribution, we ca use two labels y 1 = "Person in his 50 s" and y 2 = "Elder person" for a person at age x. The average life span (denoted by t) in different areas and eras might change from 50 years to 80 years. The above formula can ensure that the semantic information between y 1 and y 2 for t = 50 is more than that for t = 80.
Averaging I(θ j ; θ k ), we have semantic mutual information: This formula can ensure that wrong translations or replacements may convey negative semantic information.
The distortion function between the two labels becomes: We can use this function for the data compression of a sequence of labels or sentences. Considering the semantic information of designs, decisions, and control plans, the G measure and the R(G) function are also useful. In these cases, the truth function T(θ j |x) becomes the DCF, P(x|y) becomes the realized distribution, information means control complexity or control amount, G means the effective control amount, and R means the control cost. To increase G = I(X; Y θ ), we need to fix T(θ j |x) and optimize control to improve P(x|y). The R(G) function tells us that we can improve P(y|x) with the MMI iteration and increase G by amplifying s. In [32], the author improperly uses a different information formula (Equation (24) in [32]) for the above purpose. That formula seemingly only fits cases where the control results are continuous distributions. The author will further discuss this topic in another paper.
The G measure also has its limitations, which include: • The G measure is only related to labels' extensions, not to labels' intensions. For example, "Old" means senility and closeness to death, whereas the G measure is only related to T("Old"|x), which represents the extension of "Old"; • The G measure is not enough for measuring the semantic information from fuzzy reasoning according to the context, although the author has discussed how to calculate compound propositions' truth functions and fuzzy reasoning [32].
Therefore, we need more studies to measure more kinds of semantic information.

About the Importance of Semantic Information's Studies
Shannon does not consider semantic information. He writes [34] (p. 3): "These semantic aspects of communication are irrelevant to the engineering problem. The significance aspects is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just one which will actually be chosen since this is unknown at the time of design." However, the author of this paper does not think that Shannon opposed researching semantic information because, in the book co-written by Shannon and Weaver, Weaver [34] (pp. 95-97) initiated the study of semantic information. If we extend Shannon's information theory for machine learning, which frequently deals with the selection of messages, we must consider semantic information.
Section 4.2 reveals that distortion is related to truth, truth is related to semantic meaning, and hence, the rate-distortion function is related to semantic information. Equation (31) indicates that the rate-distortion function R(D) can be expressed as semantic mutual information I(Y; X θ ).
With machine learning's developments, semantic information theory is becoming more important. From the perspective of the semantic information G theory, the EM algorithm for mixture models, where the E-step uses a formula like Equation (39), can be explained by the mutual matching of the semantic channel and the Shannon channel [40], so that two kinds of information are approximately equal. In the Restricted Boltzmann Machine (RBM), the Softmax function with the NEF and the partition function is used as the learning function [7]. Some researchers have discussed the similarity between the RBM and the EM algorithm [46]. It seems that we can also explain the RBM by the two channels' mutual matching. We need more studies for the G theory's applications to machine learning with neural networks.

Conclusions
In the introduction section, we raised three questions:

•
Why does the Bayes-like formula (see Equation (2)) with NEFs and partition functions widely exist in the rate-distortion theory, statistical mechanics, the maximum entropy method, and machine learning? • Can we combine machine learning (for the distortion function) and data compression (with the rate-distortion function)? • Can we use the rate-distortion function or similar functions for semantic compression?
Using the semantic information G measure based on the P-T probability framework, we have explained that the NEFs are truth functions, the partition functions are logical probabilities, and the Bayes-like formula is a semantic Bayes' formula. We have also explained that the semantic mutual information formula can express MMI R (see Equation (35)); maximum entropy (in statistical mechanics or the maximum entropy method) is equal or proportional to extremely maximum entropy minus semantic mutual information (see Equations (41) and (47)).
We have shown that we can obtain truth functions from sampling distributions by machine learning (see Equations (23) and (25)). Furthermore, after setting the relationship between the truth function and the distortion function (see Section 4.2), we can replace the distortion function with log(1/truth_function). Therefore, we can combine machine learning (for the distortion function) and data compression.
Since truth functions represent labels' semantic meanings [15], we can extend the rate-distortion function R(D) to the rate-truth function R(Θ) by replacing the average distortion d with fuzzy entropy H(Y θ |X) (see Section 4.3). We can also extend R(D) to the rate-verisimilitude function R(G) by replacing the upper limit D of average distortion with the lower limit G of semantic mutual information (see Section 4.4). We have used two examples to show how the iteration algorithm let the Shannon channel P(y|x) match the semantic channel T(y|x) under the semantic constraints to achieve MMI R. Example 1 reveals that the iteration algorithm can use the logical implication between labels to reduce mutual information. Example 2 indicates that it is easy to control the Shannon channel P(y|x) with the semantic channel T(y|x) for data reduction.
However, we also need to compress and recover text data (according to the context), image data, and speech data (according to the results of pattern recognition or species recognition). The above extension is useful, but it is far from sufficient, and so further studies are necessary.
Funding: This research received no external funding.

Data Availability Statement: Not applicable.
Acknowledgments: This article's initial version was reviewed by two reviewers; the later version was reviewed by three reviewers. Their comments resulted in many meaningful revisions. Especially, the first two reviewers' criticisms reminded me to stress the combination of machine learning and data compression. I thank five reviewers for their revision opinions, criticisms, and support. I also thank the guest editor and the assistant editor for encouraging me to resubmit the improved manuscript.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A. The Derivation of Equation (47)
From Equation (45), we derive the mutual information between X and Y: Further, according to Equation (46) Only the formula for T(y j |x), j = 2, 3, . . . , 6, has not been used by others in the above formulas. This formula represents a group of trapezoidal curves with rounded corners. These curves are often used as membership functions and constructed by some piecewise functions. With this formula, the piecewise functions are not necessary.