Next Article in Journal
Numerical Study of Air Flow Induced by Shock Impact on an Array of Perforated Plates
Next Article in Special Issue
Kernel Estimation of Cumulative Residual Tsallis Entropy and Its Dynamic Version under ρ-Mixing Dependent Data
Previous Article in Journal
Biometric Identification Systems with Noisy Enrollment for Gaussian Sources and Channels
Previous Article in Special Issue
Upper Bounds for the Capacity for Severely Fading MIMO Channels under a Scale Mixture Assumption
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions

1
School of Computer Engineering and Applied Mathematics, Changsha University, Changsha 410000, China
2
Institute of Intelligence Engineering and Mathematics, Liaoning Technical University, Fuxin 123000, China
Entropy 2021, 23(8), 1050; https://doi.org/10.3390/e23081050
Submission received: 28 June 2021 / Revised: 8 August 2021 / Accepted: 13 August 2021 / Published: 15 August 2021
(This article belongs to the Special Issue Measures of Information)

Abstract

:
In the rate-distortion function and the Maximum Entropy (ME) method, Minimum Mutual Information (MMI) distributions and ME distributions are expressed by Bayes-like formulas, including Negative Exponential Functions (NEFs) and partition functions. Why do these non-probability functions exist in Bayes-like formulas? On the other hand, the rate-distortion function has three disadvantages: (1) the distortion function is subjectively defined; (2) the definition of the distortion function between instances and labels is often difficult; (3) it cannot be used for data compression according to the labels’ semantic meanings. The author has proposed using the semantic information G measure with both statistical probability and logical probability before. We can now explain NEFs as truth functions, partition functions as logical probabilities, Bayes-like formulas as semantic Bayes’ formulas, MMI as Semantic Mutual Information (SMI), and ME as extreme ME minus SMI. In overcoming the above disadvantages, this paper sets up the relationship between truth functions and distortion functions, obtains truth functions from samples by machine learning, and constructs constraint conditions with truth functions to extend rate-distortion functions. Two examples are used to help readers understand the MMI iteration and to support the theoretical results. Using truth functions and the semantic information G measure, we can combine machine learning and data compression, including semantic compression. We need further studies to explore general data compression and recovery, according to the semantic meaning.

1. Introduction

Bayes’ formula is used for probability predictions. Using Bayes’ formula, from a joint (probability) distribution P(x, y), or distributions P(x|y) and P(y) (where x and y are variables), we can obtain the posterior distribution of y:
P ( y | x ) = P ( x , y ) y P ( x , y ) = P ( y ) P ( x | y ) y P ( y ) P ( x | y ) = P ( y ) P ( x | y ) P ( x ) .
Similar expressions with Negative Exponential Functions (NEFs) and partition functions often appear in the rate-distortion theory [1,2,3], statistical mechanics, the maximum entropy method [4,5], and machine learning (see the Restricted Boltzmann Machine (RBM) [6,7] and the Softmax function [8,9]). For example, in the rate-distortion theory, the Minimum Mutual Information (MMI) distribution is:
P ( y | x ) = P ( y ) exp [ s d ( x , y ) ] y P ( y ) exp [ s d ( x , y ) ]
where s is a negative parameter, and d(x, y) is the distortion function.
However, an NEF, such as exp[sd(x, y)], is not a (statistical) probability function because its sums, ∑x exp[sd(x, y)] and ∑y exp[sd(x, y)], are not 1. Its main feature is that its maximum value is exp(0) = 1. An NEF is more like a membership function [10], similarity function, or Distribution Constraint Function (DCF), which softly constrains a probability distribution. Why can we put an NEF as a probability function into a Bayes-like formula? Can we find a simple explanation for why these NEFs and partition functions exist in different areas? Although the relationship between the rate-distortion function and the maximum entropy method has been studied [11], we still need a more general probability theory or probability framework to explain the above phenomenon.
On the other hand, although the rate-distortion theory has achieved great successes [2,3,12,13,14], it still has the following disadvantages:
  • Distortion function d(x, y) is subjectively defined, lacking the objective standard;
  • It is hard to define the distortion function using distances when we use labels to replace instances in machine learning; where possible the labels’ number should be much less than the possible instances’ number. For example, we need to use “Light rain”, “Moderate rain”, “Heavy rain”, “Light to moderate rain”, etc., to replace daily precipitations in millimeters, or use “Child”, “Youth”, “Adult”, “Elder”, etc., to replace people’s ages. In these cases, it is not easy to construct distortion functions;
  • We cannot apply the rate-distortion function for semantic compression, e.g., data compression according to labels’ semantic meanings.
The first two disadvantages remind us that we need to obtain allowable distortion functions from samples or sampling distributions by machine learning.
We consider the example of semantic compression as related to ages. For instance, x denotes an age (instance) and yj denotes the labels: y1 = “Non-adult”, y2 = “Youth”, y3 = “Adult”, and y4 = “Elder”. All x values that make yj true form a fuzzy set. The truth functions of these labels are also the membership functions of the fuzzy sets (see Figure 1). According to Davidson’s truth-conditional semantics [15], truth function T(yj|x) ascertains the semantic meaning of yj. For a given P(x) and the constraint, we need to find the Shannon channel P(y|x) that minimizes the mutual information between y and x. The Minimum Mutual Information (MMI) should be the lower limit of the average codeword length for coding x to y. The constraint condition states that labels’ selections should accord with the rules of the langauges used, expressed by the truth functions. In this case, the rate-distortion function cannot work well because it is not easy to define a distortion function which is compatible with the labels’ semantic meanings.
There have been meaningful studies on semantic compression [16,17,18]. However, in these studies, either the information-theoretic method related to the rate-distortion function or similar aspects, has not been adopted [18], or no semantic information measure has been used. The data compression or clustering related to perception has been studied in [19,20]; however, the discrimination functions (like the truth function) and the sensory information measure have not been adopted. For semantic compression, we need a proper information measure to measure semantic information and sensory information. We also want a function, like the rate-distortion function, related to labels’ semantic meanings.
To measure semantic information, researchers have proposed many semantic information measures [21,22,23,24,25] or information measures related to semantics [26,27]. However, it is not easy to use them for machine learning or semantic compression. For the similar purpose, the author proposed the semantic information G theory, or simply the G theory, in the 1990s [28,29,30]. The letter “G” means the generalization of Shannon’s information theory. The semantic information measure, e.g., the G measure, is defined with log(truth_function/logical_probability) or its average. The truth function and the logical probability are similar to the NEF and the partition function, respectively. The truth function can be expressed by not only the NEF, but also the Logistic function and other functions between 0 and 1. The semantic information G measure can also be used to measure semantic information conveyed not only by natural languages but also by sensory organs and measuring instruments, such as thermometers, scales, and GPS devices [31]. For sensory information, truth functions become discrimination functions or confusion probability functions [30].
The G measure measures labels’ semantic information only according to their extensions ascertained by truth functions, without considering their intentions. Therefore, we may call this semantic information, formally semantic information. For simplicity, this paper mainly considers the (formally) semantic information between label y and instance x. The semantic information between two labels or sentences is simply discussed in Section 6.4.
The G theory uses the P-T probability framework [32], which consists of both statistical probability (denoted by P) and logical probability (by T). The P-T probability framework and the G theory have been applied to several areas, such as data compression related to visual discrimination [30], machine learning [31], Bayesian confirmation [33], and the philosophy of science [32].
For overcoming the rate-distortion function’s disadvantages, this paper sets up a transformation relation between the truth function and the distortion function and uses the truth function to replace the distortion function for the constraint condition to obtain the rate-truth function. Since a truth function can also be explained as a membership function, similarity function, confusion probability function, or DCF [32], it can be used as a learning function. It is often expressed as an NEF or Logistic function. In this way, we can overcome rate-distortion functions’ three disadvantages because:
  • the truth function is a learning function; it can be obtained from a sample or sampling distribution [31] and hence is not subjectively defined;
  • using the transformation relation, we can indirectly express the distortion function between any instance x and any label y by the truth function that may come from machine learning;
  • truth functions indicate labels’ semantic meanings and, hence, can be used as the constraint condition for semantic compression.
Combining the author’s previous studies [31,32], we can explain that:
  • the NEF and the partition functions are the truth function and the logical probability, respectively;
  • the formula, such as Equation (2), for the distribution of Minimum Mutual Information (MMI) or maximum entropy is the semantic Bayes’ formula;
  • MMI R(D) can be expressed by the semantic mutual information formula;
  • maximum entropy is equal to extremely maximum entropy minus semantic mutual information.
The new explanations are not against the existing explanations but complement them. We can use the fuzzy truth criterion to replace the distortion criterion so that the rate-distortion function becomes the rate-truth function R(Θ), where Θ is a group of fuzzy sets or (fuzzy) truth functions. We can also use the semantic information criterion, which is compatible with the likelihood criterion, to replace the distortion criterion so that the rate-distortion function becomes the rate-verisimilitude function R(G), where G is the lower limit of semantic mutual information. Both functions can be used for semantic compression. The rate-verisimilitude function has been introduced before [30,31], whereas the rate-truth function is provided first in this paper.
This paper mainly aims to:
  • help readers understand rate-distortion functions and maximum entropy distributions from the new perspective;
  • combine machine learning (for the distortion function) and data compression;
  • show that the rate-distortion function can be extended to the rate-truth function and the rate-verisimilitude function for communication data’s semantic compression.
This paper provides an example to show how the Shannon channel matches the semantic channel to achieve MMI R(Θ) and R(G) for a given source P(x) and a group of truth functions. It provides another example to show that the rate-truth function can be used for data reduction (e.g., the compression of decreasing data resolution). The results support the theoretical analyses.
The new explanations should more cohesively combine classical information theory, semantic information G theory, maximum entropy theory, the likelihood method, and fuzzy set theory, with each other. Moreover, the rate-distortion function’s extensions should be practical for semantic compression and helpful for explaining machine learning with NEFs and Soft-max functions. In turn, the new explanations and extensions also support the P-T probability framework and the semantic information G theory.

2. Background

2.1. Shannon’s Entropies and Mutual Information

Definition 1.
  • Variable x denotes an instance; X denotes a discrete random variable taking a value x ∈ U = {x1, x2, …, xm}.
  • Variable y denotes a hypothesis or label; Y denotes a discrete random variable taking a value y ∈ V = {y1, y2, …, yn}.
Shannon calls P(X) the source, P(Y) the destination, P(Y|X) the channel, and P(yj|x) (with a certain yj and variable x) a Transition Probability Function (TPF) [34] (p. 18). A Shannon channel is a transition probability matrix or a group of TPFs:
P(y|x): P(yj|x), j = 1, 2, …, n.
Shannon’s mutual information is:
I ( X ; Y ) = j i P ( x i , y j ) log P ( x i | y j ) P ( x i ) = H ( X ) H ( X | Y ) = j i P ( x i , y j ) log P ( y j | x i ) P ( y j ) = H ( Y ) H ( Y | X ) ,
where H(X) and H(Y) are Shannon’s entropies of X and Y; H(X|Y) and H(Y|X) are Shannon’s conditional entropies of X and Y.
When Y = yj is given, I(X; Y) becomes the Kullback–Leibler (KL) divergence:
I ( X ; y j ) = i P ( x i | y j ) log P ( x i | y j ) P ( x i ) .
It is greater or equal to 0. It is equal to 0 as P(x|yj) = P(x).

2.2. Rate-Distortion Function R(D)

Shannon proposed the information rate-distortion function [1,2]. Since the rate-distortion function for an i.i.d. source P(X) and bounded function d(x, y) is equal to the associated information rate-distortion function, Cover and Thomas, in [14] (p. 307), do not distinguish the two functions in most cases. They call both functions the rate-distortion function and use R(D) to denote them. We follow their example. The following is the definition of the (information) rate-distortion function.
Let the distortion function be d(x, y) or dij = d(xi, yj), i = 1, 2, …; j = 1, 2, …; d ¯ be the average of d(x, y); and D be the upper limit of d ¯ . An often-used distortion function is d(x, y) = (xy)2. This function fits cases where x and y have the same universe (e.g., U = V), and distortion only depends on the distance between x and y.
The MMI for the given P(X) and D is defined as:
R ( D ) = min P ( y | x ) :   d ¯ D I ( X ; Y )
We can obtain the parameter solution of the rate-distortion function by the variational method [2,3]. The constraint conditions are:
D = j i P ( x i ) P ( y j | x i ) d i j
j P ( y j | x i ) = 1 ,   i = 1 , 2 , , m
j P ( y j ) = 1 .
The Lagrange function is therefore:
F = I ( X ; Y ) s D μ i j P ( y j | x i ) α j P ( y j ) .
Since P(y|x) and P(y) are interdependent, we need to fix one to optimize another. To optimize P(y|x), we fix P(y) and order F / P ( y j | x i ) = 0 . Then we derive the optimized TPFs or channel:
P ( y j | x i ) = P ( y j ) λ i exp ( s d i j ) , i = 1 , 2 , ; j = 1 , 2 , λ i = 1 / k P ( y k ) exp ( s d i k ) ,
where exp( ) is the inverse function of log( ); λi is defined with λi ≡ exp(μi/P(xi)). To optimize P(y), we fix P(yj|xi) in F and order F / P ( y j ) = 0 . Hence, we derive α = 1 and the optimized P(y):
P ( y j ) = i P ( x j ) P ( y j | x i )
Since P(y|x) and P(y) are interdependent, we first suppose P(y1) = P(y2) = … 1/n and then repeat Equations (10) and (11) until P(y) is unchanged. We call this iteration the MMI iteration.
It is worth noting that our purpose is to find proper TPFs P(yj|x), j = 1, 2, …, but it is difficult to find them directly because of the constraint conditions in Equations (7) and (8). Therefore, we can only find the proper posterior distributions P(y|xi) of y, i = 1, 2, …, by the MMI iteration to obtain the TPFs indirectly.
The rate-distortion function R(D) with parameter s is [3] (p. 32):
D ( s ) = i j d i j P ( x i ) P ( y j ) exp ( s d i j ) / Z i , R ( s ) = s D ( s ) i P ( x i ) log Z i , Z i = 1 / λ i = k P ( y k ) exp ( s d i k ) .
where Zi is the partition function.
The parameter s = dR/dD is negative, and hence exp(sdij) is a negative exponential function. Thus, a larger |s| will result in narrower exp(sdij), larger R, and less D.
Shannon has proved that R(D) is the lower limit of the average codeword length for i.i.d. source P(X) with an average distortion limit, D. The rate-distortion theory is the basic theory of data compression for digital communication.
Since I(X; Y) = H(X) − H(X|Y) and P(X) is unchanged, minimizing R = I(X; Y) is equivalent to maximizing H(X|Y). Therefore, the MMI distribution P(y|x) also maximizes posterior entropy H(X|Y).

2.3. The Maximum Entropy Method

Jaynes [4,5] first expounded the maximum entropy method and argued that entropy in statistical mechanics should simply be viewed as a particular application of entropy in information theory.
Supposing that we need to maximize joint entropy H(Y, X) for a given source P(x), maximizing H(X, Y) is the equivalent to maximizing H(Y|X):
H ( Y | X ) = i j P ( x i ) P ( y j | x i ) log P ( y j | x i ) .
Morevoer, supposing there are feature functions fk(x, y), k = 1, 2, … the constraint conditions are:
i j P ( x i , y j ) f k ( x i , y j ) F k ,   k = 1 , 2 ,
j P ( y j | x i ) = 1 ,   i = 1 , 2 ,
The Lagrange function is therefore:
F = H ( Y | X ) k α k i j P ( x i ) P ( y j | x i ) f k ( x i , y j ) μ i j P ( y j | x i )
By ordering F / P ( y | x i ) = 0 , we derive the optimized channel P(y|x):
P ( y | x i ) = exp [ k α k f k ( x i , y ) ] / Z i , i = 1 , 2 ,   Z i = k exp ( j α k f k ( x i , y j ) .
This P(y|x) can maximize H(X, Y).

3. The Author’s Related Work

3.1. The P-T Probability Framework

The semantic information G theory is based on the P-T probability framework [29,30]. This framework includes two types of probabilities: the statistical probability denoted by P and the logical probability by T.
Definition 2.
(for the P-T probability framework):
  • The yj is a label or a hypothesis; yj(xi) is a proposition. The θj is a fuzzy subset of universe U, whose elements make yj true. We have yj(x) ≡“x∈ θj” ≡ “x belongs to θj”(“≡” means they are logically equivalent according to the definition). The θj may also be a model or a group of model parameters.
  • A probability that is defined with “=”, such as P(yj) ≡ P(Y = yj), is a statistical probability. A probability that is defined with “∈”, such as P(X ∈ θj), is a logical probability. To distinguish P(Y = yj) and P(X ∈ θj), we define T(yj) ≡ T(θj) ≡ P(X ∈ θj) as the logical probability of yj.
  • T(yj|x) ≡ T(θj|x) ≡ P(X ∈ θj|X = x) is the truth function of yj and the membership function of θj. It changes between 0 and 1, and the maximum of T(y|x) is 1.
A semantic channel consists of a group of truth functions:
T(y|x): T(θj|x), j = 1, 2, …, n.
According to the above definition, we have the logical probability:
T ( y j ) T ( θ j ) P ( X θ j ) = i P ( x i ) T ( θ j | x i )
Zadeh calls this probability the fuzzy event’s probability [35]. If θj is a crispy set, T(θj) becomes the cumulative probability or two cumulative probabilities’ difference [36].
Generally, T(y1) + T(y2) + … + T(yn) > 1. For example, the sum of the logical probabilities of four labels in Figure 1 is greater than 1 since T(y1) + T(y3) = 1. The detailed discussions about the distinctions and relations between statistical probability and logical probability can be found in [32].
We can put T(θj|x) and P(x) into the Bayes’ formula to obtain a likelihood function:
P ( x | θ j ) = T ( θ j | x ) P ( x ) T ( θ j ) ,   T ( θ j ) = i T ( θ j | x i ) P ( x i )
P(x|θj) is called the semantic Bayes’ prediction. It is often written as P(x|yj, θ) in popular methods. We call the above formula the semantic Bayes’ formula.
Since the maximum of T(y|x) is 1, from P(x) and P(x|θj), we can obtain:
T ( θ j | x ) = T ( θ j ) P ( x | θ j ) P ( x ) ,   T ( θ j ) = 1 / max [ P ( x | θ ) / P ( x ) ] ,
where max[P(x|θ)/P(x)] means the maximum of P(x|θ)/P(x) for different x and y. In the author’s earlier articles [31], T(θj) = 1/max[P(x|θj)/P(x)]. This change in Equation (20) can ensure that truth functions are symmetrical, e.g., T(x|y) = T(y|x), as well as distortion functions. This change is also for comparing two truth functions T(yj|x) and T(yk|x) for classification according to the correlation between x and y (see Equation (23)). Since P(x|θj) in Equation (19) is changeless, as T(θj|x) is replaced with cT(θj|x) (where c is a positive constant), this change does not influence the other uses of T(θj|x).
Equations (19) and (20) form the third Bayes’ theorem [31], can be used to convert the likelihood function and the truth function from one to another.

3.2. The Semantic Information G Measure

The author [28] defines the (amount of) semantic information conveyed by yj in relation to xi with log-normalized-likelihood:
I ( x i ; θ j ) = log P ( x i | θ j ) P ( x i ) = log T ( θ j | x i ) T ( θ j )
The value I(xi; θj), or its average, is the semantic information G measure or, simply, the G measure. If T(θj|x) is always 1, the G measure becomes Carnap and Bar-Hillel’s semantic information measure [21].
The above formula is illustrated in Figure 2. Figure 2 indicates that the less the logical probability is (e.g., the lower the horizontal line is), the more information there is available; the larger the deviation is, the less information there is available; a wrong hypothesis conveys negative information. These conclusions accord with Popper’s thoughts [37] (p. 294). For this reason, I(xi; θj) is also explained as the verisimilitude between yj and xi [32].
We can also use the above formula to measure sensory information, for which T(θj|x) is the confusion probability function of xj with x or the discrimination function of xj [31].
By averaging I(xi; θj), we obtain generalized KL information:
I ( X ; θ j ) = i P ( x i | y j ) log P ( x i | θ j ) P ( x i ) = i P ( x i | y j ) log T ( θ j | x i ) T ( θ j )
where P(xi|yj), i = 1, 2, …, is the sampling distribution, which may be unsmooth or discontinuous. It is easy to prove I(X; θj) ≤ I(X; yj) [31].
When the sample is enormous, so that P(x|yj) is smooth, we may let P(x|θj) = P(x|yj) or T(θj|x) ∝ P(yj|x) to obtain the optimized truth function:
T * ( θ j | x ) = P * ( x | θ j ) P ( x ) / max P * ( x | θ ) P ( x )   = P ( x | y j ) P ( x ) / max P ( x | y ) P ( x ) = P ( x , y j ) P ( x ) P ( y j ) / max P ( x , y ) P ( x ) P ( y ) .
According to this equation, T*(θj|xi) = T*(θxi|yj) or T*(yj|xi) = T*(xi|yj), where θxi is a fuzzy subset of V. Furthermore, we have:
T*(θj|x) = cP(yj|x)
where c is a constant. The above formula means the optimized truth function T*(θj|x) is proportional to TPF P(yj|x). This formula accords with Wittgenstein’s thoughts: meaning lies in uses [38] (p. 80).
If P(x|yj) is unsmooth, we may achieve a smooth T*(θj|x) with parameters by:
T * ( θ j | x ) = arg max T ( θ j | x ) i P ( x i | y j ) log T ( θ j | x i ) T ( θ j ) .
By averaging I(X; θj) for different y, we obtain semantic mutual information:
I ( X ; Y θ ) = j P ( y j ) i P ( x i | y j ) log P ( x i | θ j ) P ( x i ) = i j P ( x i ) P ( y j | x i ) log T ( θ j | x i ) T ( θ j ) = H ( Y θ ) H ( Y θ | X ) ,
where:
H ( Y θ ) = j P ( y j ) log T ( θ j ) ,
H ( Y θ | X ) = j i P ( x i , y j ) log T ( θ j | x i )
H(Yθ) is a cross-entropy. Since ∑j T(θj) ≥ 1, we also call H(Yθ) a generalized entropy or a semantic entropy. H(Yθ|X) is called a fuzzy entropy.
When we fix the Shannon channel P(y|x) and let P(x|θj) = P(x|yj) or T(θj|x) ∝ P(yj|x) for every j (Matching I), I(X; Yθ) reaches its maximum I(X; Y). If we use a group of truth functions or a semantic channel T(y|x) as the constraint function to seek MMI, we need to let P(x|yj) = P(x|θj) or P(yj|x) ∝ T(θj|x) as far as possible for every j (Matching II). Section 4.3 and Section 4.4 further discuss Matching II.
Letting T(θj|x) = exp[−(xxj)2/(2σ2)], we have:
I ( X ; Y θ ) = H ( Y θ ) H ( Y θ | X ) = j P ( y j ) log T ( θ j ) j i P ( x i , y j ) ( x i μ j ) 2 / ( 2 σ j 2 ) .
It is easy to find that the above mutual information is like the Regularized Least Square (RLS). H(Yθ) is like the regularization term, and the next one is the relative squared error term. Therefore, we can treat the maximum semantic mutual information criterion as a particular RLS criterion.

4. Theoretical Results

4.1. The New Explanations of the MMI Distribution and the Rate-Distortion Function

It is easy to regard exp[sd(xi, y)] in the rate-distortion function as a truth function and Zi as a logical probability. We can let θxi be a fuzzy subset of V (V = {y1, y2, …}) and T(xi|y) ≡ T(θxi|y) be the truth function of y(xi), and T(xi) = T(θxi) = ∑j P(yj)T(yj|xi) be the logical probability of xi. Hence, we can observe that the MMI distribution P(y|x) in the rate-distortion function is produced by the semantic Bayes’ formula:
P(y|xi) = P(y)exp[sd(xi, y)]/Zi = P(y)T(xi|y)/T(xi), i = 1, 2, …, m.
R(D) can be expressed by the semantic mutual information formula because:
I ( Y ; X θ ) = j i P ( x i , y j ) log [ T ( x i | y j ) / T ( x i ) ] = j i P ( x i , y j ) log [ exp ( s d ( x i , y j ) ] i P ( x i ) ln Z i = s D ( s ) i P ( x i ) log Z i = R ( D ) .

4.2. Setting Up the Relation between the Truth Function and the Distortion Function

We can now improve the G theory by setting up the relation between the truth function and the distortion function.
Four truth functions in Figure 1 also reveal the distortion when we use yj to represent xi. If the truth value of yj(xi) is 1, the distortion d(xi, yj) should be 0. The distortion increases as the truth value decreases. Hence, we use the following definition:
Definition 3.
The transformation relation between the distortion function and the truth function is defined as:
d(x, y) ≡ log[1/T(y|x)].
According to this definition, we have T(y|x) = exp[−d(x, y)] and H(Yθ|X) = d ¯ . Therefore, we can use H(Yθ|X) to replace d ¯ for the constraint condition.
Since T(y|x) = T(x|y), we also have d(x, y) = log[1/T(x|y)].

4.3. Rate-Truth Function R(Θ)

The author has proposed the rate-of-limiting-error function previously in [30], an immature study. This function is more akin to the extension of the complexity-distortion function [39], unlike the rate-truth function, which is the extension of the rate-distortion function.
In the following, we use almost the same method used for R(D) to obtain the rate-truth function R(Θ), where Θ means a group of truth functions or fuzzy sets. The constraint condition d ¯ D becomes:
H ( Y Θ | X ) = i j P ( x i ) P ( y j , x i ) log T ( y j | x i ) D
Following the rate-distortion function’s derivation (see Equation (10)), we obtain the optimized posterior distribution P(y|xi) of y:
P ( y | x i ) = P ( y ) T ( y | x i ) | s | / k P ( y k ) T ( y k | x i ) | s | i = 1 , 2 , = P ( y ) T ( x i | y ) | s | / T ( x i ) = P ( y | θ x i ) ,
where we replace −s with |s|, which is positive. The larger the |s| value is, the clearer the boundaries of the fuzzy sets are, and hence the larger the R is.
Next, we obtain the optimized P(y) (see Equation (11)).
We can obtain the Shannon channel P(y|x) that minimizes R by repeating Equations (11) and (34) until P(y) converges, e.g., by the MMI iteration.
Bringing P(y|xi) in Equation (34) into the mutual information formula, we obtain the rate-truth function R(Θ):
R ( Θ ) = i j P ( x i ) P ( y j | x i ) log [ P ( y j | x i ) / P ( y j ) ] = i j P ( x i ) P ( y j | x i ) log [ T ( x i | y j ) | s | / T ( x i ) ] = H ( X θ ) H ( X θ | Y ) = I ( Y ; X θ ) ,
where:
H ( X θ | Y ) = i j P ( x i , y j ) log T ( x i | y j ) | s | T ( x i ) = j T ( x i | y j ) | s | P ( y j ) , H ( X θ ) = i P ( x i ) log T ( x i ) .
We have R(Θ) = H(Xθ) − H(Xθ|Y) = I(Y; Xθ), which means that the MMI can be expressed by semantic mutual information: I(Y; Xθ). The constraint is tighter when |s| > 1 and looser when |s| < 1, than the constraint |s| = 1. When the constraint condition is the original truth function without s or with s = −1, we have:
R(Θ) = I(X; Y) = I(Y; Xθ) ≥ I(X; Yθ).
The reason for I(Y; Xθ) ≥ I(X; Yθ) is that the iteration can only let P(yj|x) be proximately proportional to T(yj|x), and not exactly in general (see Equation (34) and Section 5.2).
According to Shannon’s lossy coding theorem of discrete memoryless sources [14] (p. 307), for the given sources P(X) and D, we can use block coding to achieve a minimum average codeword length, whose lower limit is R(D). Since H(Xθ|Y) can be understood as the average distortion, R(Θ) also means the lower limit of average codeword length.
For any distortion function d(x, y), we can always express the corresponding truth function as exp[−d(x, y)]. But for truth functions, such as those in Figure 1, we may not be able to express them by a distortion function. Therefore, the rate-distortion function may be regarded as the rate-truth function’s particular case, as the truth function is exp[−d(x, y)].

4.4. Rate-Verisimilitude Function R(G)

If we change the average distortion criterion into the semantic mutual information criterion, which is compatible with the likelihood criterion, then the rate-distortion function becomes the rate-verisimilitude function [31]. In this case, we replace dij = d(xi, yj) with Iij = I(xi; θj). The constraint condition d ¯ D becomes I(X; Yϴ) ≥ G, where G denotes the lower limit of semantic mutual information. Following the rate-distortion function’s derivation, we can obtain:
G ( s ) = i j P ( x i ) P ( y j | x i ) I i j = i j I i j P ( x i ) P ( y j ) exp ( s I i j ) / Z i , R ( s ) = s G ( s ) i P ( x i ) log Z i ,
where s is positive, and:
P ( y j | x i ) = P ( y j ) T ( y j | x i ) T ( y j ) s / Z i ,   i = 1 , 2 , ; j = 1 , 2 ,   Z i = k P ( y k ) T ( y k | x i ) T ( y k ) s .
We also need the MMI iteration to optimize P(y) and P(y|x). The function P(y|xi) is now like a Softmax function, in which the numerator P(y)[T(y|xi)/T(y)]s may be greater than 1. In the E-step of the EM algorithm for mixture models, there is a similar formula, which is also used for MMI [40].
R(G) is more suitable than R(D) and R(Θ) when y is a prediction, such as a weather prediction, where information is more important than truth. More discussions and applications of the R(G) function can be found in [30,31].

4.5. The New Explanation of the Maximum Entropy Distribution and the Extension

We consider maximizing joint entropy H(X, Y) for the given P(x).
Consider Equation (17) for the maximum entropy distribution. We may assume that P(y) is a constant (1/n) so that H(Y) is the maximum. Then Equation (17) becomes:
P ( y | x i ) = P ( y ) exp ( k α k f k ( x i , y ) ) / Z i , i = 1 , 2 , Z i = k P ( y ) exp ( j α k f k ( x i , y j ) ) .
The above exp(.) is an NEF. We can explain exp(.) as a truth function, Zi′ as a logical probability, and the above formula as a semantic Bayes’ formula.
Then, maximum joint entropy can be expressed as:
H ( X , Y ) = H ( X ) + H ( Y | X ) = H ( X ) i P ( x i ) j P ( y j | x i ) log P ( y j | x i ) = H ( X ) i j P ( x i , y i ) log [ P ( y j ) exp ( j α k f k ( x i , y j ) ) / Z i ] = H ( X ) + H ( Y ) I ( Y ; X θ ) ,
where H(Y) is equal to log n, I(X; Yθ) is semantic mutual information, and H(X) + H(Y) is the extremely maximum value of H(X, Y). Therefore, we can explain that maximum entropy is equal to extremely maximum entropy minus semantic mutual information.
If we use a group of truth functions or membership functions, such as those in Figure 1, as the constraint condition, to extend the maximum entropy method, the above distribution becomes:
P ( y | x i ) = T ( y | x i ) | s | / k T ( y k | x i ) | s | ,   i = 1 ,   2 ,    
This distribution is the same as in Equation (34) with P(y) = 1/n.

4.6. The New Explanation of the Boltzmann Distribution and the Maximum Entropy Law

Using Stirling’s formula lnN! = NlnNN as N → ∞, Jaynes proved that Boltzmann’s entropy and Shannon’s entropy have a simple relation [4,5]:
S = k ln W = k ln N ! i N i ! = k N i P ( x i | T ) ln P ( x i | T ) = k N H ( X | T ) ,
where k is the Boltzmann constant, W is the microstate number, xi means state i, N is the particles’ number, and P(xi|T) denotes the probability of a particle (or the density of particles) in state i for the given absolute temperature T.
The Boltzmann distribution [41] is:
P ( x i | T ) = exp ( e i k T ) / Z , Z = i exp ( e i k T ) ,
where Z is the partition function.
If xi means energy ei, Gi is the number of states with ei, and G is the total number of all states, then P(xi) = Gi/G is the prior probability of xi. Hence, Equations (43) and (44) become:
S = k ln N ! i N i ! / G i N i = k N i P ( x i | T ) ln P ( x i | T ) G i = k N i P ( x i | T ) ln P ( x i | T ) P ( x i ) + k N ln G ,
P ( x i | T ) = P ( x i ) exp ( e i k T ) / Z , Z = i P ( x i ) exp ( e i k T )
Now, we can explain exp[−ei/(kT)] as a truth function T(θj|x), Z′ as a logical probability T(θj), and Equation (46) as a semantic Bayes’ formula.
Assuming that for a local equilibrium system, its different areas yj, j = 1, 2, …, have different temperatures Tj, j = 1, 2, … The above G becomes Gj. Then we can derive (see Appendix A for the details):
S / ( k N ) = j P ( y j ) ln G j I ( X ; Y θ )
which means that the thermodynamic entropy S is proportional to the extremely maximum entropy j P ( y j ) ln G j minus the semantic mutual information I(X; Yθ). This formula indicates that the maximum entropy law in physics can be equivalently stated as the MMI law; this MMI can be expressed as semantic mutual information I(X; Yθ).

5. Experimental Results

5.1. An Example Shows the Shannon Channel’s Changes in the MMI Iterations for R(Θ) and R(G)

The author experimented with Example 1 to test two theoretical results:
  • the MMI iteration lets the Shannon channel match the semantic channel, e.g., lets P(yj|x) ∝ T(yj|x), j = 1, 2, …, as far as possible for the functions R(D), R(Θ), and R(G);
  • the MMI iteration can reduce mutual information.
Example 1.
The four truth functions and the population age distribution P(x) are shown in Figure 1 (see Appendix B for the formulas producing these lines). The task is to use the above truth functions as the constraint condition to obtain P(y|x) for R(Θ) and P(y|x) for R(G).
First, the author uses the above truth functions as the constraint condition for the R(Θ) with |s| = 1. The iteration process for P(y|x) is shown in Figure 3 (see Supplementary Materials to find how the data for Figure 3 are produced).
The convergent P(y) is {P(y1), P(y2), P(y3), P(y4)} = {0.3499, 0.0022, 0.6367, 0}. The MMI is 0.845 bits. The iterative process not only lets every P(yj|x) be approximatively proportional to T(yj|x) but also makes P(y4) closer to 0. The value of P(y4) became 0 because y4 implies y3 and hence can be replaced with y3. The latter has a larger logical probability. By replacing y4 with y3, we can save the average codeword length.
Figure 3c indicates that H(Y) also decreases with the iteration. Since H(Y) is much less than P(X), we can also simply replace the instance x with the label y to compress data.
To maximize the entropy H(X, Y) or H(Y|X) for a given P(x), we do not need the iteration for P(y|x). The function P(y|x) in Figure 3a also results in the maximum entropies H(Y|X) and H(X, Y), where P(y|x) matches T(y|x) only once.
Next, the author uses the above truth functions as the constant condition for the R(G) with |s| = 1. For R(G), d(xi, yj) is replaced with I(xi; θj) = log[T(yj|xi)/T(yj)] for every i and j. Figure 4 shows the iterative results and the process (see Supplementary Materials to find how the data for Figure 4 are produced).
Figure 4a intuitively displays that P(y|x) accords with the four labels’ semantic meanings. The convergent distribution P(y) for R(G) is {P(y1), P(y2), P(y3), P(y4)} = {0.3619, 0.0200, 0.6120, 0.0057}. The MMI is 0.883 bits. This P(y4) is not 0. Both P(y2) and P(y4) are larger than those for R(Θ). The reason is that a label with less logical probability can convey more semantic information and, hence, should be more frequently selected if we use the semantic information criterion. For the same s = 1 and Θ, R(G) = 0.883 bits is greater than R(Θ) = 0.845 bits.

5.2. An Example about Greyscale Image Compression Shows the Convergent P(y|x) for the R(Θ)

This example was used to explain how we replace the distortion function with the truth function to obtain R(Θ) and the corresponding channel P(y|x) for the image compression of decreasing the resolution of pixels’ grey levels. The conclusion is also suitable for data reduction, where too high a resolution is unnecessary.
Example 2.
The goal is to compress an 8-bit greyscale image with 256 grey levels (denoted by xi, i = 0, 1, …, 255) into a 3-bit grey image with 8 grey levels (denoted by yj, j = 1, 2, …, 8) [42]. Considering that human visual discrimination changes with grey levels (the higher the grey level is, the lower the discrimination is). We use 8 truth functions, as shown in Figure 5a, to represent 8 fuzzy classes. Appendix C shows how these lines are produced. The task is to solve the MMI R and the corresponding channel P(y|x) with s = 1.
The author obtains the convergent P(y|x), as shown in Figure 5b. Figure 5c shows that R(Θ), I(X; Yϴ), and H(Y) change in the iterative process (see Supplementary Materials to find how the data for Figure 5 are produced).
Figure 5 shows how the Shannon channel matches the semantic channel for MMI. Comparing Figure 5a,b, we find that it is easy to control P(y|x) with T(y|x). If we use the distortion function d(x, yj) instead of the truth function T(yj|x), j = 1, 2, …, 8, it is not easy to design d(x, yj). It is also difficult to predict the convergent P(y|x) according to d(x, y).
In this example, I(X; Y) = I(Y; Xθ) is difficult to approach I(X; Yθ) because P(yj|x) cannot be strictly proportional to T(yj|x) for j = 2, 3, …, 6.
The author has used different prior distributions and semantic channels to calculate Shannon channels for MMI using the above method and achieving similar results. Every convergent TPF P(yj|x) covers an area in U identical to the area covered by T(yj|x) as s = 1, and hence, P(y|x) is satisfied with the constraint condition defined by T(y|x).

6. Discussions

6.1. Connecting Machine Learning and Data Compression by Truth Functions

Researchers have applied the rate-distortion theory to machine learning and achieved meaningful results [43,44,45]. However, we also need to obtain the distortion function from machine learning.
In the rate-distortion theory, the distortion function d(x, y) is subjectively defined, lacking the objective standard. When the instance x and label y have different universes, U and V and |U| >> |V|, the definition of the distortion between x and y is problematic. Now we can obtain labels’ truth functions from sampling distributions by machine learning (see Equations (23)–(25)) and indirectly express distortion functions using truth functions (see Equation (32)). After the source P(x) is changed, the semantic channel T(y|x) is still functional [31]. We can use T(y|x) and the new P(x) to obtain the new Shannon channel P(y|x) for MMI R(Θ). We can also use T(y|x)|s| with |s| > 1 to reduce the average distortion or with |s| < 1 to loosen the distortion limit. In this way, we can overcome the rate-distortion function’s three disadvantages mentioned in Section 1. In addition, a group of truth functions describe a group of labels’ extensions and semantic meanings and are therefore intuitional. They have stronger descriptive power than distortion functions because we can always use a truth function to replace a distortion function by T(yj|x) = exp[−d(x, yj)]. Although, we may not be able to use a distortion function to replace a truth function, such as the truth function of y3 = “Adult” in Figure 1.
Two often-used learning functions are the truth function (or the similarity function, or the membership function) and the likelihood function. We can also replace the distortion function d(x, y) with a log-likelihood function logP(x|θ), or equivalently replace d ¯ D with I(X; Yθ) ≥ G to obtain the rate-verisimilitude function R(G). The function R(G) uses the likelihood criterion or the semantic information criterion. It can reduce the underreports of events with less logical probabilities. For example, the selected probabilities P(y2) and P(y4) in Figure 1 for R(G) (see Figure 4a) are higher than P(y2) and P(y4) for R(Θ) (see Figure 3c). The function R(G) is more suitable than R(D) and R(Θ) when y is a prediction where information is more important than truth.
Like R(D), R(Θ) and R(G) also mean the lower limit of average codeword length.

6.2. Viewing Rate-Distortion Functions and Maximum Entropy Distributions from the Perspective of Semantic Information G Theory

There are many similarities between the MMI distribution in the rate-distortion function and the maximum entropy distribution. According to the analyses in Section 4.1 and Section 4.5, we can regard NEFs as truth functions and partition functions as logical probabilities. These distributions or Shannon’s channels, such as P(y|x) in Equation (2), are semantic Bayes’ distributions produced by the semantic Bayes’ formula (see Equation (19)).
The main differences between the rate-distortion theory and the maximum entropy method are:
  • For the rate-distortion function R(D), we seek MMI I(X; Y) = H(X) − H(X|Y), which is equivalent to maximizing the posterior entropy H(X|Y) of X. For R(D), we use an iterative algorithm to find the proper P(y). However, in the maximum entropy method, we maximize H(X, Y) or H(Y|X) for given P(x) without the iteration for P(y);
  • The rate-distortion function can be expressed by the semantic information G measure (see Equation (31)). In contrast, the maximum entropy is equal or proportional to the extremely maximum entropy minus semantic mutual information (see Equations (41) and (47)).

6.3. How the Experimental Results Support the Explanation for the MMI Iterations

The theoretical analyses in Section 4.2, Section 4.3 and Section 4.4 indicate that the MMI iterations for R(D), R(Θ), and R(G) impel the Shannon channel to match the semantic channel, e.g., impel P(yj|x) ∝ T(yj|x), j = 1, 2, … as far as possible. Example 1 in Section 5.1 shows that the iterative processes for R(Θ) and R(G) not only find proper probability distribution P(y) (which means a label yj with larger logical probability will be selected more frequently) but also modify every TPF P(yj|x) so that P(yj|x) is approximatively proportional to T(yj|x). Therefore, the results in Figure 3 support the above theoretical analyses.
Figure 4 indicates that the Shannon channel P(y|x) for R(G) is a little different from that for R(Θ). For example, two labels y2 and y4, with less logical probabilities, have larger P(y2) and P(y4) than those for R(Θ). This result also accords with the theoretical analysis, which indicates that the semantic information criterion can reduce the underreports of events with less logical probabilities.
Figure 5 shows an example of data reduction. For this example, it is hard to define distortion function d(x, y), but it is easy to use truth functions to represent fuzzy classes. The results indicate that we can easily control the convergent Shannon channel P(y|x) with the semantic channel T(y|x).

6.4. Considering Other Kinds of Semantic Information

Wyner and Debowski have independently defined an information measure (see Equation (1) in [27]), which is meaningful for measuring the semantic information between two sentences or labels. Combining this with the G theory, we may develop a different measure from the above for the semantic information between two labels yj and yk:
I ( θ j ; θ k ) = i P ( x i | y j , y k ) log T ( θ j θ k | x i ) T ( θ j ) T ( θ k ) .
Unlike Equation (1) in [27], the above formula includes both the statistical probabilities, (P(x) and P(x|yj, yk)), and logical probabilities, and it does not require that subsets θ1, θ2, …, θn form a partition of U. We use an example to explain the reason for using P(x). If P(x) is a population age distribution, we ca use two labels y1 = “Person in his 50 s” and y2 = “Elder person” for a person at age x. The average life span (denoted by t) in different areas and eras might change from 50 years to 80 years. The above formula can ensure that the semantic information between y1 and y2 for t = 50 is more than that for t = 80.
Averaging I(θj; θk), we have semantic mutual information:
I ( Y θ 1 ; Y θ 2 ) = j k i P ( x i , y j , y k ) log T ( θ j θ k | x i ) T ( θ j ) T ( θ k ) .
This formula can ensure that wrong translations or replacements may convey negative semantic information.
The distortion function between the two labels becomes:
d ( y j , y k ) = i P ( x i | y j , y k ) log [ 1 / T ( θ j θ k | x i ) ] .
We can use this function for the data compression of a sequence of labels or sentences.
Considering the semantic information of designs, decisions, and control plans, the G measure and the R(G) function are also useful. In these cases, the truth function T(θj|x) becomes the DCF, P(x|y) becomes the realized distribution, information means control complexity or control amount, G means the effective control amount, and R means the control cost. To increase G = I(X; Yθ), we need to fix T(θj|x) and optimize control to improve P(x|y). The R(G) function tells us that we can improve P(y|x) with the MMI iteration and increase G by amplifying s. In [32], the author improperly uses a different information formula (Equation (24) in [32]) for the above purpose. That formula seemingly only fits cases where the control results are continuous distributions. The author will further discuss this topic in another paper.
The G measure also has its limitations, which include:
  • The G measure is only related to labels’ extensions, not to labels’ intensions. For example, “Old” means senility and closeness to death, whereas the G measure is only related to T(“Old”|x), which represents the extension of “Old”;
  • The G measure is not enough for measuring the semantic information from fuzzy reasoning according to the context, although the author has discussed how to calculate compound propositions’ truth functions and fuzzy reasoning [32].
Therefore, we need more studies to measure more kinds of semantic information.

6.5. About the Importance of Semantic Information’s Studies

Shannon does not consider semantic information. He writes [34] (p. 3):
“These semantic aspects of communication are irrelevant to the engineering problem. The significance aspects is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just one which will actually be chosen since this is unknown at the time of design.”
However, the author of this paper does not think that Shannon opposed researching semantic information because, in the book co-written by Shannon and Weaver, Weaver [34] (pp. 95–97) initiated the study of semantic information. If we extend Shannon’s information theory for machine learning, which frequently deals with the selection of messages, we must consider semantic information.
Section 4.2 reveals that distortion is related to truth, truth is related to semantic meaning, and hence, the rate-distortion function is related to semantic information. Equation (31) indicates that the rate-distortion function R(D) can be expressed as semantic mutual information I(Y; Xθ).
With machine learning’s developments, semantic information theory is becoming more important. From the perspective of the semantic information G theory, the EM algorithm for mixture models, where the E-step uses a formula like Equation (39), can be explained by the mutual matching of the semantic channel and the Shannon channel [40], so that two kinds of information are approximately equal. In the Restricted Boltzmann Machine (RBM), the Softmax function with the NEF and the partition function is used as the learning function [7]. Some researchers have discussed the similarity between the RBM and the EM algorithm [46]. It seems that we can also explain the RBM by the two channels’ mutual matching. We need more studies for the G theory’s applications to machine learning with neural networks.

7. Conclusions

In the introduction section, we raised three questions:
  • Why does the Bayes-like formula (see Equation (2)) with NEFs and partition functions widely exist in the rate-distortion theory, statistical mechanics, the maximum entropy method, and machine learning?
  • Can we combine machine learning (for the distortion function) and data compression (with the rate-distortion function)?
  • Can we use the rate-distortion function or similar functions for semantic compression?
Using the semantic information G measure based on the P-T probability framework, we have explained that the NEFs are truth functions, the partition functions are logical probabilities, and the Bayes-like formula is a semantic Bayes’ formula. We have also explained that the semantic mutual information formula can express MMI R (see Equation (35)); maximum entropy (in statistical mechanics or the maximum entropy method) is equal or proportional to extremely maximum entropy minus semantic mutual information (see Equations (41) and (47)).
We have shown that we can obtain truth functions from sampling distributions by machine learning (see Equations (23) and (25)). Furthermore, after setting the relationship between the truth function and the distortion function (see Section 4.2), we can replace the distortion function with log(1/truth_function). Therefore, we can combine machine learning (for the distortion function) and data compression.
Since truth functions represent labels’ semantic meanings [15], we can extend the rate-distortion function R(D) to the rate-truth function R(Θ) by replacing the average distortion d ¯ with fuzzy entropy H(Yθ|X) (see Section 4.3). We can also extend R(D) to the rate-verisimilitude function R(G) by replacing the upper limit D of average distortion with the lower limit G of semantic mutual information (see Section 4.4). We have used two examples to show how the iteration algorithm let the Shannon channel P(y|x) match the semantic channel T(y|x) under the semantic constraints to achieve MMI R. Example 1 reveals that the iteration algorithm can use the logical implication between labels to reduce mutual information. Example 2 indicates that it is easy to control the Shannon channel P(y|x) with the semantic channel T(y|x) for data reduction.
However, we also need to compress and recover text data (according to the context), image data, and speech data (according to the results of pattern recognition or species recognition). The above extension is useful, but it is far from sufficient, and so further studies are necessary.

Supplementary Materials

Excel files that produced Figures S3–S5 are available online at http://survivor99.com/lcg/cm/mmi-iteration.zip (accessed on 8 August 2021).

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Acknowledgments

This article’s initial version was reviewed by two reviewers; the later version was reviewed by three reviewers. Their comments resulted in many meaningful revisions. Especially, the first two reviewers’ criticisms reminded me to stress the combination of machine learning and data compression. I thank five reviewers for their revision opinions, criticisms, and support. I also thank the guest editor and the assistant editor for encouraging me to resubmit the improved manuscript.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. The Derivation of Equation (47)

From Equation (45), we derive the mutual information between X and Y:
I ( X ; Y ) = j P ( y j ) i P ( x i | y j ) log P ( x i | y j ) P ( x i ) = j P ( y j ) log G j S / ( k N ) .
Further, according to Equation (46), we have MMI:
I ( X ; Y ) = j i P ( x i , y j ) log exp [ e i / ( k T j ) ] Z = j i P ( x i , y j ) log T ( θ j | x i ) T ( θ j ) = H ( Y θ ) H ( Y θ | X ) = I ( X ; Y θ ) .
From the above two equations, we obtain:
S / ( k N ) = j P ( y j ) log G j I ( X ; Y θ ) .

Appendix B

Formulas that Produce Data for Figure 1 in Section 1 and Example 1 in Section 5.1
T ( y 3 | x ) = T ( adult | x ) = 1 1 + exp [ 1.5 ( x 18 ) ] , T ( y 1 | x ) = T ( non - adult | x ) = 1 T ( y 3 | x ) , T ( y 2 | x ) = T ( youth | x ) = 1 [ 1 exp ( ( x 22 ) 2 2 × 5 2 ) ] 2 , T ( y 4 | x ) = T ( elder | x ) = 1 1 + exp [ ( x 60 ) ] ,
  P ( x ) = exp ( x 2 2 × 37 2 ) / x 0 exp ( x 2 2 × 37 2 ) ,   x 0 .

Appendix C

Formulas and Parameters for Figure 5a in Section 5.2
T ( y 0 | x ) = 1 1 1 + exp [ ( x 2 ) ] , T ( y j | x ) = 1 [ 1 exp ( ( x μ j ) 2 2 σ j 2 ) ] 3 ,   j = 2 , , 6 , T ( y 7 | x ) = 1 1 + exp [ 0.2 ( x 200 ) ] ,
P ( x ) = exp ( x 2 2 × 37 2 ) / x 0 exp ( x 2 2 × 37 2 ) ,   x 0 .
Table A1. Six pairs of parameters for six truth functions T(yj|x), j = 2, 3, …, 7.
Table A1. Six pairs of parameters for six truth functions T(yj|x), j = 2, 3, …, 7.
yjT(y2|x)T(y3|x)T(y4|x)T(y5|x)T(y6|x)T(y7|x)
μj14305280120170
σj216245080160240
Only the formula for T(yj|x), j = 2, 3, …, 6, has not been used by others in the above formulas. This formula represents a group of trapezoidal curves with rounded corners. These curves are often used as membership functions and constructed by some piecewise functions. With this formula, the piecewise functions are not necessary.

References

  1. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–429+623–656. [Google Scholar] [CrossRef] [Green Version]
  2. Shannon, C.E. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec. 1959, 4, 142–163. [Google Scholar]
  3. Berger, T. Rate Distortion Theory; Prentice-Hall: Enklewood Cliffs, NJ, USA, 1971. [Google Scholar]
  4. Jaynes, E.T. Information Theory and Statistical Mechanics. Phys. Rev. 1957, 106, 620. [Google Scholar] [CrossRef]
  5. Jaynes, E.T. Information Theory and Statistical Mechanics II. Phys. Rev. 1957, 108, 171. [Google Scholar] [CrossRef]
  6. Smolensky, P. Chapter 6: Information Processing in Dynamical Systems: Foundations of Harmony Theory. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations; Rumelhart, D.E., McLelland, J.L., Eds.; MIT Press: Cambridge, UK, 1986; pp. 194–281. [Google Scholar]
  7. Hinton, G.E. A Practical Guide to Training Restricted Boltzmann Machines. In Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science; Montavon, G., Orr, G.B., Müller, K.R., Eds.; Springer: Berlin, Germany, 2012; Volume 7700, pp. 599–619. [Google Scholar] [CrossRef]
  8. Salakhutdinov, R.; Hinton, G.E. Replicated softmax: An undirected topic model. Neural Inf. Process. Syst. 2009, 22, 1607–1614. [Google Scholar] [CrossRef]
  9. Goodfellow, I.; Bengio, Y.; Courville, A. Softmax Units for Multinoulli Output Distributions. In Deep Learning; MIT Press: Cambridge, UK, 2016; pp. 180–184. [Google Scholar]
  10. Zadeh, L.A. Fuzzy Sets. Inf. Control. 1965, 8, 338–353. [Google Scholar] [CrossRef] [Green Version]
  11. Harremoës, P. Maximum Entropy on Compact Groups. Entropy 2009, 11, 222–237. [Google Scholar] [CrossRef] [Green Version]
  12. Berger, T.; Gibson, J.D. Lossy Source Coding. IEEE Trans. Inf. Theory 1998, 44, 2693–2723. [Google Scholar] [CrossRef] [Green Version]
  13. Gibson, J. Special Issue on Rate Distortion Theory and Information Theory. Entropy 2018, 20, 825. [Google Scholar] [CrossRef] [Green Version]
  14. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: New York, NY, USA, 2006. [Google Scholar]
  15. Davidson, D. Truth and meaning. Synthese 1967, 17, 304–323. [Google Scholar] [CrossRef]
  16. Willems, F.M.J.; Kalker, T. Semantic compaction, transmission, and compression codes. In Proceedings of the International Symposium on Information Theory ISIT 2005, Adelaide, Australia, 4–9 September 2005; pp. 214–218. [Google Scholar] [CrossRef]
  17. Babu, S.; Garofalakis, M.; Rastogi, R. SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables. ACM SIGMOD Rec. 2001, 30, 283–294. [Google Scholar] [CrossRef]
  18. Ceglarek, D.; Haniewicz, K.; Rutkowski, W. Semantic Compression for Specialised Information Retrieval Systems. Adv. Intell. Inf. Database Syst. 2010, 283, 111–121. [Google Scholar]
  19. Blau, Y.; Michaeli, T. The perception-distortion tradeoff. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18 June 2018; pp. 6228–6237. [Google Scholar]
  20. Bardera, A.; Bramon, R.; Ruiz, M.; Boada, I. Rate-Distortion Theory for Clustering in the Perceptual Space. Entropy 2017, 19, 438. [Google Scholar] [CrossRef] [Green Version]
  21. Carnap, R.; Bar-Hillel, Y. An Outline of a Theory of Semantic Information. Available online: http://dspace.mit.edu/bitstream/handle/1721.1/4821/RLE-TR-247-03150899.pdf;sequence=1 (accessed on 1 July 2021).
  22. Klir, G. Generalized information theory. Fuzzy Sets Syst. 1991, 40, 127–142. [Google Scholar] [CrossRef]
  23. Floridi, L. Outline of a theory of strongly semantic information. Minds Mach. 2004, 14, 197–221. [Google Scholar] [CrossRef] [Green Version]
  24. Zhong, Y.X. A theory of semantic information. China Commun. 2017, 14, 1–17. [Google Scholar] [CrossRef]
  25. D’Alfonso, S. On Quantifying Semantic Information. Information 2011, 2, 61–101. [Google Scholar] [CrossRef] [Green Version]
  26. Bhandari, D.; Pal, N.R. Some new information measures of fuzzy sets. Inf. Sci. 1993, 67, 209–228. [Google Scholar] [CrossRef]
  27. Dębowski, Ł. Approximating Information Measures for Fields. Entropy 2020, 22, 79. [Google Scholar] [CrossRef] [Green Version]
  28. Lu, C. A Generalized Information Theory; China Science and Technology University Press: Hefei, China, 1993; ISBN 7-312-00501-2. (In Chinese) [Google Scholar]
  29. Lu, C. Meanings of generalized entropy and generalized mutual information for coding. J. China Inst. Commun. 1994, 15, 37–44. (In Chinese) [Google Scholar]
  30. Lu, C. A generalization of Shannon’s information theory. Int. J. Gen. Syst. 1999, 28, 453–490. [Google Scholar] [CrossRef]
  31. Lu, C. Semantic information G theory and logical Bayesian inference for machine learning. Information 2019, 10, 261. [Google Scholar] [CrossRef] [Green Version]
  32. Lu, C. The P–T probability framework for semantic communication, falsification, confirmation, and Bayesian reasoning. Philosophies 2020, 5, 25. [Google Scholar] [CrossRef]
  33. Lu, C. Channels’ Confirmation and Predictions’ Confirmation: From the Medical Test to the Raven Paradox. Entropy 2020, 22, 384. [Google Scholar] [CrossRef] [Green Version]
  34. Shannon, C.E.; Weaver, W. The Mathematical Theory of Communication; The University of Illinois Press: Urbana, IL, USA, 1963. [Google Scholar]
  35. Zadeh, L.A. Probability measures of fuzzy events. J. Math. Anal. Appl. 1986, 23, 421–427. [Google Scholar] [CrossRef] [Green Version]
  36. Cumulative_Distribution_Function. Available online: https://en.wikipedia.org/wiki/Cumulative_distribution_function (accessed on 10 April 2021).
  37. Popper, K. Conjectures and Refutations; Routledge: London, UK; New York, NY, USA, 2002. [Google Scholar]
  38. Wittgenstein, L. Philosophical Investigations; Basil Blackwell Ltd.: Oxford, UK, 1958. [Google Scholar]
  39. Sow, D.M.; Eleftheriadis, A. Complexity distortion theory. IEEE Trans. Inf. Theory 2003, 49, 604–608. [Google Scholar] [CrossRef] [Green Version]
  40. Lu, C. Understanding and Accelerating EM Algorithm’s Convergence by Fair Competition Principle and Rate-Verisimilitude Function. arXiv 2021, arXiv:2104.12592. [Google Scholar]
  41. Boltzmann Distribution. Available online: https://en.wikipedia.org/wiki/Boltzmann_distribution (accessed on 10 April 2021).
  42. Binary Images. Available online: https://www.cis.rit.edu/people/faculty/pelz/courses/SIMG203/res.pdf (accessed on 25 June 2021).
  43. Kutyniok, G. A Rate-Distortion Framework for Explaining Deep Learning. Available online: https://maths-of-data.github.io/Talk_Edinburgh_2020.pdf (accessed on 30 June 2021).
  44. Nokleby, M.; Beirami, A.; Calderbank, R. A rate-distortion framework for supervised learning. In Proceedings of the 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, USA, 17–20 September 2015; pp. 1–6. [Google Scholar] [CrossRef]
  45. John, S.; Gadde, A.; Adsumilli, B. Rate Distortion Optimization Over Large Scale Video Corpus with Machine Learning. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 1286–1290. [Google Scholar] [CrossRef]
  46. Song, J.; Yuan, C. Learning Boltzmann Machine with EM-like Method. arXiv 2016, arXiv:1609.01840. [Google Scholar]
Figure 1. Four labels’ (fuzzy) truth functions about people’s ages. T(yj|x) denotes the truth function of proposition function yj(x) and j = 1, 2, 3, 4, as the constraint condition.
Figure 1. Four labels’ (fuzzy) truth functions about people’s ages. T(yj|x) denotes the truth function of proposition function yj(x) and j = 1, 2, 3, 4, as the constraint condition.
Entropy 23 01050 g001
Figure 2. The semantic information conveyed by yj about xi.
Figure 2. The semantic information conveyed by yj about xi.
Entropy 23 01050 g002
Figure 3. P(y|x) and I(X; Y) change in the iterative process for R(Θ). (a) P(y|x) after the first iteration (this P(y|x) also results in maximum entropies H(Y|X) and H(X, Y)); (b) P(y|x) after the second iteration; (c) P(y|x) after eight iterations; (d) I(X; Y), I(Y; Xθ), and H(Y) change in the iterative process.
Figure 3. P(y|x) and I(X; Y) change in the iterative process for R(Θ). (a) P(y|x) after the first iteration (this P(y|x) also results in maximum entropies H(Y|X) and H(X, Y)); (b) P(y|x) after the second iteration; (c) P(y|x) after eight iterations; (d) I(X; Y), I(Y; Xθ), and H(Y) change in the iterative process.
Entropy 23 01050 g003
Figure 4. The iterative results for the R(G) function. (a) The convergent Shannon channel P(y|x); (b) I(X; Yθ) and I(X; Y) = I(Y; Xθ) change in the iterative process.
Figure 4. The iterative results for the R(G) function. (a) The convergent Shannon channel P(y|x); (b) I(X; Yθ) and I(X; Y) = I(Y; Xθ) change in the iterative process.
Entropy 23 01050 g004
Figure 5. The results of Example 2. (a) Eight truth functions or the semantic channel T(y|x); (b) The convergent Shannon channel P(y|x); (c) I(X; Yθ), I(X; Y) and H(Y) change in the iterative process.
Figure 5. The results of Example 2. (a) Eight truth functions or the semantic channel T(y|x); (b) The convergent Shannon channel P(y|x); (c) I(X; Yθ), I(X; Y) and H(Y) change in the iterative process.
Entropy 23 01050 g005aEntropy 23 01050 g005b
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Lu, C. Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions. Entropy 2021, 23, 1050. https://doi.org/10.3390/e23081050

AMA Style

Lu C. Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions. Entropy. 2021; 23(8):1050. https://doi.org/10.3390/e23081050

Chicago/Turabian Style

Lu, Chenguang. 2021. "Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions" Entropy 23, no. 8: 1050. https://doi.org/10.3390/e23081050

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop