Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions

Lu, Chenguang

doi:10.3390/e23081050

Open AccessArticle

Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions

by

Chenguang Lu

^1,2

¹

School of Computer Engineering and Applied Mathematics, Changsha University, Changsha 410000, China

²

Institute of Intelligence Engineering and Mathematics, Liaoning Technical University, Fuxin 123000, China

Entropy 2021, 23(8), 1050; https://doi.org/10.3390/e23081050

Submission received: 28 June 2021 / Revised: 8 August 2021 / Accepted: 13 August 2021 / Published: 15 August 2021

(This article belongs to the Special Issue Measures of Information)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In the rate-distortion function and the Maximum Entropy (ME) method, Minimum Mutual Information (MMI) distributions and ME distributions are expressed by Bayes-like formulas, including Negative Exponential Functions (NEFs) and partition functions. Why do these non-probability functions exist in Bayes-like formulas? On the other hand, the rate-distortion function has three disadvantages: (1) the distortion function is subjectively defined; (2) the definition of the distortion function between instances and labels is often difficult; (3) it cannot be used for data compression according to the labels’ semantic meanings. The author has proposed using the semantic information G measure with both statistical probability and logical probability before. We can now explain NEFs as truth functions, partition functions as logical probabilities, Bayes-like formulas as semantic Bayes’ formulas, MMI as Semantic Mutual Information (SMI), and ME as extreme ME minus SMI. In overcoming the above disadvantages, this paper sets up the relationship between truth functions and distortion functions, obtains truth functions from samples by machine learning, and constructs constraint conditions with truth functions to extend rate-distortion functions. Two examples are used to help readers understand the MMI iteration and to support the theoretical results. Using truth functions and the semantic information G measure, we can combine machine learning and data compression, including semantic compression. We need further studies to explore general data compression and recovery, according to the semantic meaning.

Keywords:

rate-distortion function; boltzmann distribution; semantic information measure; machine learning; maximum entropy; minimum mutual information; Bayes’ formula

1. Introduction

Bayes’ formula is used for probability predictions. Using Bayes’ formula, from a joint (probability) distribution P(x, y), or distributions P(x|y) and P(y) (where x and y are variables), we can obtain the posterior distribution of y:

P (y | x) = \frac{P (x, y)}{\sum_{y} P (x, y)} = \frac{P (y) P (x | y)}{\sum_{y} P (y) P (x | y)} = \frac{P (y) P (x | y)}{P (x)} .

(1)

Similar expressions with Negative Exponential Functions (NEFs) and partition functions often appear in the rate-distortion theory [1,2,3], statistical mechanics, the maximum entropy method [4,5], and machine learning (see the Restricted Boltzmann Machine (RBM) [6,7] and the Softmax function [8,9]). For example, in the rate-distortion theory, the Minimum Mutual Information (MMI) distribution is:

P (y | x) = \frac{P (y) \exp [s d (x, y)]}{\sum_{y} P (y) \exp [s d (x, y)]}

(2)

where s is a negative parameter, and d(x, y) is the distortion function.

However, an NEF, such as exp[sd(x, y)], is not a (statistical) probability function because its sums, ∑_x exp[sd(x, y)] and ∑_y exp[sd(x, y)], are not 1. Its main feature is that its maximum value is exp(0) = 1. An NEF is more like a membership function [10], similarity function, or Distribution Constraint Function (DCF), which softly constrains a probability distribution. Why can we put an NEF as a probability function into a Bayes-like formula? Can we find a simple explanation for why these NEFs and partition functions exist in different areas? Although the relationship between the rate-distortion function and the maximum entropy method has been studied [11], we still need a more general probability theory or probability framework to explain the above phenomenon.

On the other hand, although the rate-distortion theory has achieved great successes [2,3,12,13,14], it still has the following disadvantages:

Distortion function d(x, y) is subjectively defined, lacking the objective standard;
It is hard to define the distortion function using distances when we use labels to replace instances in machine learning; where possible the labels’ number should be much less than the possible instances’ number. For example, we need to use “Light rain”, “Moderate rain”, “Heavy rain”, “Light to moderate rain”, etc., to replace daily precipitations in millimeters, or use “Child”, “Youth”, “Adult”, “Elder”, etc., to replace people’s ages. In these cases, it is not easy to construct distortion functions;
We cannot apply the rate-distortion function for semantic compression, e.g., data compression according to labels’ semantic meanings.

The first two disadvantages remind us that we need to obtain allowable distortion functions from samples or sampling distributions by machine learning.

We consider the example of semantic compression as related to ages. For instance, x denotes an age (instance) and y_j denotes the labels: y₁ = “Non-adult”, y₂ = “Youth”, y₃ = “Adult”, and y₄ = “Elder”. All x values that make y_j true form a fuzzy set. The truth functions of these labels are also the membership functions of the fuzzy sets (see Figure 1). According to Davidson’s truth-conditional semantics [15], truth function T(y_j|x) ascertains the semantic meaning of y_j. For a given P(x) and the constraint, we need to find the Shannon channel P(y|x) that minimizes the mutual information between y and x. The Minimum Mutual Information (MMI) should be the lower limit of the average codeword length for coding x to y. The constraint condition states that labels’ selections should accord with the rules of the langauges used, expressed by the truth functions. In this case, the rate-distortion function cannot work well because it is not easy to define a distortion function which is compatible with the labels’ semantic meanings.

There have been meaningful studies on semantic compression [16,17,18]. However, in these studies, either the information-theoretic method related to the rate-distortion function or similar aspects, has not been adopted [18], or no semantic information measure has been used. The data compression or clustering related to perception has been studied in [19,20]; however, the discrimination functions (like the truth function) and the sensory information measure have not been adopted. For semantic compression, we need a proper information measure to measure semantic information and sensory information. We also want a function, like the rate-distortion function, related to labels’ semantic meanings.

To measure semantic information, researchers have proposed many semantic information measures [21,22,23,24,25] or information measures related to semantics [26,27]. However, it is not easy to use them for machine learning or semantic compression. For the similar purpose, the author proposed the semantic information G theory, or simply the G theory, in the 1990s [28,29,30]. The letter “G” means the generalization of Shannon’s information theory. The semantic information measure, e.g., the G measure, is defined with log(truth_function/logical_probability) or its average. The truth function and the logical probability are similar to the NEF and the partition function, respectively. The truth function can be expressed by not only the NEF, but also the Logistic function and other functions between 0 and 1. The semantic information G measure can also be used to measure semantic information conveyed not only by natural languages but also by sensory organs and measuring instruments, such as thermometers, scales, and GPS devices [31]. For sensory information, truth functions become discrimination functions or confusion probability functions [30].

The G measure measures labels’ semantic information only according to their extensions ascertained by truth functions, without considering their intentions. Therefore, we may call this semantic information, formally semantic information. For simplicity, this paper mainly considers the (formally) semantic information between label y and instance x. The semantic information between two labels or sentences is simply discussed in Section 6.4.

The G theory uses the P-T probability framework [32], which consists of both statistical probability (denoted by P) and logical probability (by T). The P-T probability framework and the G theory have been applied to several areas, such as data compression related to visual discrimination [30], machine learning [31], Bayesian confirmation [33], and the philosophy of science [32].

For overcoming the rate-distortion function’s disadvantages, this paper sets up a transformation relation between the truth function and the distortion function and uses the truth function to replace the distortion function for the constraint condition to obtain the rate-truth function. Since a truth function can also be explained as a membership function, similarity function, confusion probability function, or DCF [32], it can be used as a learning function. It is often expressed as an NEF or Logistic function. In this way, we can overcome rate-distortion functions’ three disadvantages because:

the truth function is a learning function; it can be obtained from a sample or sampling distribution [31] and hence is not subjectively defined;
using the transformation relation, we can indirectly express the distortion function between any instance x and any label y by the truth function that may come from machine learning;
truth functions indicate labels’ semantic meanings and, hence, can be used as the constraint condition for semantic compression.

Combining the author’s previous studies [31,32], we can explain that:

the NEF and the partition functions are the truth function and the logical probability, respectively;
the formula, such as Equation (2), for the distribution of Minimum Mutual Information (MMI) or maximum entropy is the semantic Bayes’ formula;
MMI R(D) can be expressed by the semantic mutual information formula;
maximum entropy is equal to extremely maximum entropy minus semantic mutual information.

The new explanations are not against the existing explanations but complement them. We can use the fuzzy truth criterion to replace the distortion criterion so that the rate-distortion function becomes the rate-truth function R(Θ), where Θ is a group of fuzzy sets or (fuzzy) truth functions. We can also use the semantic information criterion, which is compatible with the likelihood criterion, to replace the distortion criterion so that the rate-distortion function becomes the rate-verisimilitude function R(G), where G is the lower limit of semantic mutual information. Both functions can be used for semantic compression. The rate-verisimilitude function has been introduced before [30,31], whereas the rate-truth function is provided first in this paper.

This paper mainly aims to:

help readers understand rate-distortion functions and maximum entropy distributions from the new perspective;
combine machine learning (for the distortion function) and data compression;
show that the rate-distortion function can be extended to the rate-truth function and the rate-verisimilitude function for communication data’s semantic compression.

This paper provides an example to show how the Shannon channel matches the semantic channel to achieve MMI R(Θ) and R(G) for a given source P(x) and a group of truth functions. It provides another example to show that the rate-truth function can be used for data reduction (e.g., the compression of decreasing data resolution). The results support the theoretical analyses.

The new explanations should more cohesively combine classical information theory, semantic information G theory, maximum entropy theory, the likelihood method, and fuzzy set theory, with each other. Moreover, the rate-distortion function’s extensions should be practical for semantic compression and helpful for explaining machine learning with NEFs and Soft-max functions. In turn, the new explanations and extensions also support the P-T probability framework and the semantic information G theory.

2. Background

2.1. Shannon’s Entropies and Mutual Information

Definition 1.

Variable x denotes an instance; X denotes a discrete random variable taking a value x ∈ U = {x₁, x₂, …, x_m}.
Variable y denotes a hypothesis or label; Y denotes a discrete random variable taking a value y ∈ V = {y₁, y₂, …, y_n}.

Shannon calls P(X) the source, P(Y) the destination, P(Y|X) the channel, and P(y_j|x) (with a certain y_j and variable x) a Transition Probability Function (TPF) [34] (p. 18). A Shannon channel is a transition probability matrix or a group of TPFs:

P(y|x): P(y_j|x), j = 1, 2, …, n.

Shannon’s mutual information is:

\begin{array}{l} I (X; Y) = \sum_{j} \sum_{i} P (x_{i}, y_{j}) \log \frac{P (x_{i} | y_{j})}{P (x_{i})} = H (X) - H (X | Y) \\ = \sum_{j} \sum_{i} P (x_{i}, y_{j}) \log \frac{P (y_{j} | x_{i})}{P (y_{j})} = H (Y) - H (Y | X), \end{array}

(3)

where H(X) and H(Y) are Shannon’s entropies of X and Y; H(X|Y) and H(Y|X) are Shannon’s conditional entropies of X and Y.

When Y = y_j is given, I(X; Y) becomes the Kullback–Leibler (KL) divergence:

I (X; y_{j}) = \sum_{i} P (x_{i} | y_{j}) \log \frac{P (x_{i} | y_{j})}{P (x_{i})} .

(4)

It is greater or equal to 0. It is equal to 0 as P(x|y_j) = P(x).

2.2. Rate-Distortion Function R(D)

Shannon proposed the information rate-distortion function [1,2]. Since the rate-distortion function for an i.i.d. source P(X) and bounded function d(x, y) is equal to the associated information rate-distortion function, Cover and Thomas, in [14] (p. 307), do not distinguish the two functions in most cases. They call both functions the rate-distortion function and use R(D) to denote them. We follow their example. The following is the definition of the (information) rate-distortion function.

Let the distortion function be d(x, y) or d_ij = d(x_i, y_j), i = 1, 2, …; j = 1, 2, …;

\bar{d}

be the average of d(x, y); and D be the upper limit of

\bar{d}

. An often-used distortion function is d(x, y) = (x − y)². This function fits cases where x and y have the same universe (e.g., U = V), and distortion only depends on the distance between x and y.

The MMI for the given P(X) and D is defined as:

R (D) = \min_{P (y | x) : \bar{d} \leq D} I (X; Y)

(5)

We can obtain the parameter solution of the rate-distortion function by the variational method [2,3]. The constraint conditions are:

D = \sum_{j} \sum_{i} P (x_{i}) P (y_{j} | x_{i}) d_{i j}

(6)

\sum_{j} P (y_{j} | x_{i}) = 1, i = 1, 2, \dots, m

(7)

\sum_{j} P (y_{j}) = 1 .

(8)

The Lagrange function is therefore:

F = I (X; Y) - s D - μ_{i} \sum_{j} P (y_{j} | x_{i}) - α \sum_{j} P (y_{j}) .

(9)

Since P(y|x) and P(y) are interdependent, we need to fix one to optimize another. To optimize P(y|x), we fix P(y) and order

\partial F / \partial P (y_{j} | x_{i}) = 0

. Then we derive the optimized TPFs or channel:

\begin{array}{l} P (y_{j} | x_{i}) = P (y_{j}) λ_{i} \exp (s d_{i j}), i = 1, 2, \dots; j = 1, 2, \dots \\ λ_{i} = 1 / \sum_{k} P (y_{k}) \exp (s d_{i k}), \end{array}

(10)

where exp( ) is the inverse function of log( ); λ_i is defined with λ_i ≡ exp(μ_i/P(x_i)). To optimize P(y), we fix P(y_j|x_i) in F and order

\partial F / \partial P (y_{j}) = 0

. Hence, we derive α = 1 and the optimized P(y):

P (y_{j}) = \sum_{i} P (x_{j}) P (y_{j} | x_{i})

(11)

Since P(y|x) and P(y) are interdependent, we first suppose P(y₁) = P(y₂) = … 1/n and then repeat Equations (10) and (11) until P(y) is unchanged. We call this iteration the MMI iteration.

It is worth noting that our purpose is to find proper TPFs P(y_j|x), j = 1, 2, …, but it is difficult to find them directly because of the constraint conditions in Equations (7) and (8). Therefore, we can only find the proper posterior distributions P(y|x_i) of y, i = 1, 2, …, by the MMI iteration to obtain the TPFs indirectly.

The rate-distortion function R(D) with parameter s is [3] (p. 32):

\begin{array}{l} D (s) = \sum_{i} \sum_{j} d_{i j} P (x_{i}) P (y_{j}) \exp (s d_{i j}) / Z_{i}, \\ R (s) = s D (s) - \sum_{i} P (x_{i}) \log Z_{i}, \\ Z_{i} = 1 / λ_{i} = \sum_{k} P (y_{k}) \exp (s d_{i k}) . \end{array}

(12)

where Z_i is the partition function.

The parameter s = dR/dD is negative, and hence exp(sd_ij) is a negative exponential function. Thus, a larger |s| will result in narrower exp(sd_ij), larger R, and less D.

Shannon has proved that R(D) is the lower limit of the average codeword length for i.i.d. source P(X) with an average distortion limit, D. The rate-distortion theory is the basic theory of data compression for digital communication.

Since I(X; Y) = H(X) − H(X|Y) and P(X) is unchanged, minimizing R = I(X; Y) is equivalent to maximizing H(X|Y). Therefore, the MMI distribution P(y|x) also maximizes posterior entropy H(X|Y).

2.3. The Maximum Entropy Method

Jaynes [4,5] first expounded the maximum entropy method and argued that entropy in statistical mechanics should simply be viewed as a particular application of entropy in information theory.

Supposing that we need to maximize joint entropy H(Y, X) for a given source P(x), maximizing H(X, Y) is the equivalent to maximizing H(Y|X):

H (Y | X) = - \sum_{i} \sum_{j} P (x_{i}) P (y_{j} | x_{i}) \log P (y_{j} | x_{i}) .

(13)

Morevoer, supposing there are feature functions f_k(x, y), k = 1, 2, … the constraint conditions are:

\sum_{i} \sum_{j} P (x_{i}, y_{j}) f_{k} (x_{i}, y_{j}) \geq F_{k}, k = 1, 2, \dots

(14)

\sum_{j} P (y_{j} | x_{i}) = 1, i = 1, 2, \dots

(15)

The Lagrange function is therefore:

F = H (Y | X) - \sum_{k} α_{k} \sum_{i} \sum_{j} P (x_{i}) P (y_{j} | x_{i}) f_{k} (x_{i}, y_{j}) - μ_{i} \sum_{j} P (y_{j} | x_{i})

(16)

By ordering

\partial F / \partial P (y | x_{i}) = 0

, we derive the optimized channel P(y|x):

\begin{array}{l} P (y | x_{i}) = \exp [\sum_{k} α_{k} f_{k} (x_{i}, y)] / Z_{i}, i = 1, 2, \dots \\ Z_{i} = \sum_{k} \exp (\sum_{j} α_{k} f_{k} (x_{i}, y_{j}) . \end{array}

(17)

This P(y|x) can maximize H(X, Y).

3. The Author’s Related Work

3.1. The P-T Probability Framework

The semantic information G theory is based on the P-T probability framework [29,30]. This framework includes two types of probabilities: the statistical probability denoted by P and the logical probability by T.

Definition 2.

(for the P-T probability framework):

The y_j is a label or a hypothesis; y_j(x_i) is a proposition. The θ_j is a fuzzy subset of universe U, whose elements make y_j true. We have y_j(x) ≡“x∈ θ_j” ≡ “x belongs to θ_j”(“≡” means they are logically equivalent according to the definition). The θ_j may also be a model or a group of model parameters.
A probability that is defined with “=”, such as P(y_j) ≡ P(Y = y_j), is a statistical probability. A probability that is defined with “∈”, such as P(X ∈ θ_j), is a logical probability. To distinguish P(Y = y_j) and P(X ∈ θ_j), we define T(y_j) ≡ T(θ_j) ≡ P(X ∈ θ_j) as the logical probability of y_j.
T(y_j|x) ≡ T(θ_j|x) ≡ P(X ∈ θ_j|X = x) is the truth function of y_j and the membership function of θ_j. It changes between 0 and 1, and the maximum of T(y|x) is 1.

A semantic channel consists of a group of truth functions:

T(y|x): T(θ_j|x), j = 1, 2, …, n.

According to the above definition, we have the logical probability:

T (y_{j}) \equiv T (θ_{j}) \equiv P (X \in θ_{j}) = \sum_{i} P (x_{i}) T (θ_{j} | x_{i})

(18)

Zadeh calls this probability the fuzzy event’s probability [35]. If θ_j is a crispy set, T(θ_j) becomes the cumulative probability or two cumulative probabilities’ difference [36].

Generally, T(y₁) + T(y₂) + … + T(y_n) > 1. For example, the sum of the logical probabilities of four labels in Figure 1 is greater than 1 since T(y₁) + T(y₃) = 1. The detailed discussions about the distinctions and relations between statistical probability and logical probability can be found in [32].

We can put T(θ_j|x) and P(x) into the Bayes’ formula to obtain a likelihood function:

P (x | θ_{j}) = \frac{T (θ_{j} | x) P (x)}{T (θ_{j})}, T (θ_{j}) = \sum_{i} T (θ_{j} | x_{i}) P (x_{i})

(19)

P(x|θ_j) is called the semantic Bayes’ prediction. It is often written as P(x|y_j, θ) in popular methods. We call the above formula the semantic Bayes’ formula.

Since the maximum of T(y|x) is 1, from P(x) and P(x|θ_j), we can obtain:

T (θ_{j} | x) = \frac{T (θ_{j}) P (x | θ_{j})}{P (x)}, T (θ_{j}) = 1 / \max [P (x | θ) / P (x)],

(20)

where max[P(x|θ)/P(x)] means the maximum of P(x|θ)/P(x) for different x and y. In the author’s earlier articles [31], T(θ_j) = 1/max[P(x|θ_j)/P(x)]. This change in Equation (20) can ensure that truth functions are symmetrical, e.g., T(x|y) = T(y|x), as well as distortion functions. This change is also for comparing two truth functions T(y_j|x) and T(y_k|x) for classification according to the correlation between x and y (see Equation (23)). Since P(x|θ_j) in Equation (19) is changeless, as T(θ_j|x) is replaced with cT(θ_j|x) (where c is a positive constant), this change does not influence the other uses of T(θ_j|x).

Equations (19) and (20) form the third Bayes’ theorem [31], can be used to convert the likelihood function and the truth function from one to another.

3.2. The Semantic Information G Measure

The author [28] defines the (amount of) semantic information conveyed by y_j in relation to x_i with log-normalized-likelihood:

I (x_{i}; θ_{j}) = \log \frac{P (x_{i} | θ_{j})}{P (x_{i})} = \log \frac{T (θ_{j} | x_{i})}{T (θ_{j})}

(21)

The value I(x_i; θ_j), or its average, is the semantic information G measure or, simply, the G measure. If T(θ_j|x) is always 1, the G measure becomes Carnap and Bar-Hillel’s semantic information measure [21].

The above formula is illustrated in Figure 2. Figure 2 indicates that the less the logical probability is (e.g., the lower the horizontal line is), the more information there is available; the larger the deviation is, the less information there is available; a wrong hypothesis conveys negative information. These conclusions accord with Popper’s thoughts [37] (p. 294). For this reason, I(x_i; θ_j) is also explained as the verisimilitude between y_j and x_i [32].

We can also use the above formula to measure sensory information, for which T(θ_j|x) is the confusion probability function of x_j with x or the discrimination function of x_j [31].

By averaging I(x_i; θ_j), we obtain generalized KL information:

I (X; θ_{j}) = \sum_{i} P (x_{i} | y_{j}) \log \frac{P (x_{i} | θ_{j})}{P (x_{i})} = \sum_{i} P (x_{i} | y_{j}) \log \frac{T (θ_{j} | x_{i})}{T (θ_{j})}

(22)

where P(x_i|y_j), i = 1, 2, …, is the sampling distribution, which may be unsmooth or discontinuous. It is easy to prove I(X; θ_j) ≤ I(X; y_j) [31].

\begin{array}{l} T * (θ_{j} | x) = \frac{P^{*} (x | θ_{j})}{P (x)} / \max (\frac{P^{*} (x | θ)}{P (x)}) \\ = \frac{P (x | y_{j})}{P (x)} / \max (\frac{P (x | y)}{P (x)}) = \frac{P (x, y_{j})}{P (x) P (y_{j})} / \max (\frac{P (x, y)}{P (x) P (y)}) . \end{array}

(23)

According to this equation, T*(θ_j|x_i) = T*(θ_xi|y_j) or T*(y_j|x_i) = T*(x_i|y_j), where θ_xi is a fuzzy subset of V. Furthermore, we have:

T*(θ_j|x) = cP(y_j|x)

(24)

where c is a constant. The above formula means the optimized truth function T*(θ_j|x) is proportional to TPF P(y_j|x). This formula accords with Wittgenstein’s thoughts: meaning lies in uses [38] (p. 80).

If P(x|y_j) is unsmooth, we may achieve a smooth T*(θ_j|x) with parameters by:

T * (θ_{j} | x) = \underset{T (θ_{j} | x)}{\arg \max} \sum_{i} P (x_{i} | y_{j}) \log \frac{T (θ_{j} | x_{i})}{T (θ_{j})} .

(25)

By averaging I(X; θ_j) for different y, we obtain semantic mutual information:

\begin{array}{l} I (X; Y_{θ}) = \sum_{j} P (y_{j}) \sum_{i} P (x_{i} | y_{j}) \log \frac{P (x_{i} | θ_{j})}{P (x_{i})} \\ = \sum_{i} \sum_{j} P (x_{i}) P (y_{j} | x_{i}) \log \frac{T (θ_{j} | x_{i})}{T (θ_{j})} = H (Y_{θ}) - H (Y_{θ} | X), \end{array}

(26)

where:

H (Y_{θ}) = - \sum_{j} P (y_{j}) \log T (θ_{j}),

(27)

H (Y_{θ} | X) = - \sum_{j} \sum_{i} P (x_{i}, y_{j}) \log T (θ_{j} | x_{i})

(28)

H(Y_θ) is a cross-entropy. Since ∑_j T(θ_j) ≥ 1, we also call H(Y_θ) a generalized entropy or a semantic entropy. H(Y_θ|X) is called a fuzzy entropy.

Letting T(θ_j|x) = exp[−(x − x_j)²/(2σ²)], we have:

\begin{array}{l} I (X; Y_{θ}) = H (Y_{θ}) - H (Y_{θ} | X) \\ = - \sum_{j} P (y_{j}) \log T (θ_{j}) - \sum_{j} \sum_{i} P (x_{i}, y_{j}) {(x_{i} - μ_{j})}^{2} / (2 σ_{j}^{2}) . \end{array}

(29)

It is easy to find that the above mutual information is like the Regularized Least Square (RLS). H(Y_θ) is like the regularization term, and the next one is the relative squared error term. Therefore, we can treat the maximum semantic mutual information criterion as a particular RLS criterion.

4. Theoretical Results

4.1. The New Explanations of the MMI Distribution and the Rate-Distortion Function

It is easy to regard exp[sd(x_i, y)] in the rate-distortion function as a truth function and Z_i as a logical probability. We can let θ_xi be a fuzzy subset of V (V = {y₁, y₂, …}) and T(x_i|y) ≡ T(θ_xi|y) be the truth function of y(x_i), and T(x_i) = T(θ_xi) = ∑_j P(y_j)T(y_j|x_i) be the logical probability of x_i. Hence, we can observe that the MMI distribution P(y|x) in the rate-distortion function is produced by the semantic Bayes’ formula:

P(y|x_i) = P(y)exp[sd(x_i, y)]/Z_i = P(y)T(x_i|y)/T(x_i), i = 1, 2, …, m.

(30)

R(D) can be expressed by the semantic mutual information formula because:

\begin{array}{l} I (Y; X_{θ}) = \sum_{j} \sum_{i} P (x_{i}, y_{j}) \log [T (x_{i} | y_{j}) / T (x_{i})] \\ = \sum_{j} \sum_{i} P (x_{i}, y_{j}) \log [\exp (s d (x_{i}, y_{j})] - \sum_{i} P (x_{i}) \ln Z_{i} \\ = s D (s) - \sum_{i} P (x_{i}) \log Z_{i} = R (D) . \end{array}

(31)

4.2. Setting Up the Relation between the Truth Function and the Distortion Function

We can now improve the G theory by setting up the relation between the truth function and the distortion function.

Four truth functions in Figure 1 also reveal the distortion when we use y_j to represent x_i. If the truth value of y_j(x_i) is 1, the distortion d(x_i, y_j) should be 0. The distortion increases as the truth value decreases. Hence, we use the following definition:

Definition 3.

The transformation relation between the distortion function and the truth function is defined as:

d(x, y) ≡ log[1/T(y|x)].

(32)

According to this definition, we have T(y|x) = exp[−d(x, y)] and H(Y_θ|X) =

\bar{d}

. Therefore, we can use H(Y_θ|X) to replace

\bar{d}

for the constraint condition.

Since T(y|x) = T(x|y), we also have d(x, y) = log[1/T(x|y)].

4.3. Rate-Truth Function R(Θ)

The author has proposed the rate-of-limiting-error function previously in [30], an immature study. This function is more akin to the extension of the complexity-distortion function [39], unlike the rate-truth function, which is the extension of the rate-distortion function.

In the following, we use almost the same method used for R(D) to obtain the rate-truth function R(Θ), where Θ means a group of truth functions or fuzzy sets. The constraint condition

\bar{d} \leq D

becomes:

H (Y_{Θ} | X) = - \sum_{i} \sum_{j} P (x_{i}) P (y_{j}, x_{i}) \log T (y_{j} | x_{i}) \leq D

(33)

Following the rate-distortion function’s derivation (see Equation (10)), we obtain the optimized posterior distribution P(y|x_i) of y:

\begin{array}{l} P (y | x_{i}) = P (y) T {(y | x_{i})}^{| s |} / \sum_{k} P (y_{k}) T {(y_{k} | x_{i})}^{| s |} \\ i = 1, 2, \dots \\ = P (y) T {(x_{i} | y)}^{| s |} / T (x_{i}) = P (y | θ_{x i}), \end{array}

(34)

where we replace −s with |s|, which is positive. The larger the |s| value is, the clearer the boundaries of the fuzzy sets are, and hence the larger the R is.

Next, we obtain the optimized P(y) (see Equation (11)).

We can obtain the Shannon channel P(y|x) that minimizes R by repeating Equations (11) and (34) until P(y) converges, e.g., by the MMI iteration.

Bringing P(y|x_i) in Equation (34) into the mutual information formula, we obtain the rate-truth function R(Θ):

\begin{array}{l} R (Θ) = \sum_{i} \sum_{j} P (x_{i}) P (y_{j} | x_{i}) \log [P (y_{j} | x_{i}) / P (y_{j})] \\ = \sum_{i} \sum_{j} P (x_{i}) P (y_{j} | x_{i}) \log [T {(x_{i} | y_{j})}^{| s |} / T (x_{i})] \\ = H (X_{θ}) - H (X_{θ} | Y) = I (Y; X_{θ}), \end{array}

(35)

where:

\begin{array}{l} H (X_{θ} | Y) = - \sum_{i} \sum_{j} P (x_{i}, y_{j}) \log T {(x_{i} | y_{j})}^{| s |} \\ T (x_{i}) = \sum_{j} T {(x_{i} | y_{j})}^{| s |} P (y_{j}), \\ H (X_{θ}) = - \sum_{i} P (x_{i}) \log T (x_{i}) . \end{array}

(36)

R(Θ) = I(X; Y) = I(Y; X_θ) ≥ I(X; Y_θ).

(37)

The reason for I(Y; X_θ) ≥ I(X; Y_θ) is that the iteration can only let P(y_j|x) be proximately proportional to T(y_j|x), and not exactly in general (see Equation (34) and Section 5.2).

According to Shannon’s lossy coding theorem of discrete memoryless sources [14] (p. 307), for the given sources P(X) and D, we can use block coding to achieve a minimum average codeword length, whose lower limit is R(D). Since H(X_θ|Y) can be understood as the average distortion, R(Θ) also means the lower limit of average codeword length.

For any distortion function d(x, y), we can always express the corresponding truth function as exp[−d(x, y)]. But for truth functions, such as those in Figure 1, we may not be able to express them by a distortion function. Therefore, the rate-distortion function may be regarded as the rate-truth function’s particular case, as the truth function is exp[−d(x, y)].

4.4. Rate-Verisimilitude Function R(G)

If we change the average distortion criterion into the semantic mutual information criterion, which is compatible with the likelihood criterion, then the rate-distortion function becomes the rate-verisimilitude function [31]. In this case, we replace d_ij = d(x_i, y_j) with I_ij = I(x_i; θ_j). The constraint condition

\bar{d} \leq D

becomes I(X; Y_ϴ) ≥ G, where G denotes the lower limit of semantic mutual information. Following the rate-distortion function’s derivation, we can obtain:

\begin{array}{l} G (s) = \sum_{i} \sum_{j} P (x_{i}) P (y_{j} | x_{i}) I_{i j} = \sum_{i} \sum_{j} I_{i j} P (x_{i}) P (y_{j}) \exp (s I_{i j}) / Z_{i}, \\ R (s) = s G (s) - \sum_{i} P (x_{i}) \log Z_{i}, \end{array}

(38)

where s is positive, and:

\begin{array}{l} P (y_{j} | x_{i}) = P (y_{j}) {[\frac{T (y_{j} | x_{i})}{T (y_{j})}]}^{s} / Z_{i}, i = 1, 2, \dots; j = 1, 2, \dots \\ Z_{i} = \sum_{k} P (y_{k}) {[\frac{T (y_{k} | x_{i})}{T (y_{k})}]}^{s} . \end{array}

(39)

We also need the MMI iteration to optimize P(y) and P(y|x). The function P(y|x_i) is now like a Softmax function, in which the numerator P(y)[T(y|x_i)/T(y)]^s may be greater than 1. In the E-step of the EM algorithm for mixture models, there is a similar formula, which is also used for MMI [40].

R(G) is more suitable than R(D) and R(Θ) when y is a prediction, such as a weather prediction, where information is more important than truth. More discussions and applications of the R(G) function can be found in [30,31].

4.5. The New Explanation of the Maximum Entropy Distribution and the Extension

We consider maximizing joint entropy H(X, Y) for the given P(x).

Consider Equation (17) for the maximum entropy distribution. We may assume that P(y) is a constant (1/n) so that H(Y) is the maximum. Then Equation (17) becomes:

\begin{array}{l} P (y | x_{i}) = P (y) \exp (\sum_{k} α_{k} f_{k} (x_{i}, y)) / Z_{i}^{'}, i = 1, 2, \dots \\ Z_{i}^{'} = \sum_{k} P (y) \exp (\sum_{j} α_{k} f_{k} (x_{i}, y_{j})) . \end{array}

(40)

The above exp(.) is an NEF. We can explain exp(.) as a truth function, Z_i′ as a logical probability, and the above formula as a semantic Bayes’ formula.

Then, maximum joint entropy can be expressed as:

\begin{array}{l} H (X, Y) = H (X) + H (Y | X) = H (X) - \sum_{i} P (x_{i}) \sum_{j} P (y_{j} | x_{i}) \log P (y_{j} | x_{i}) \\ = H (X) - \sum_{i} \sum_{j} P (x_{i}, y_{i}) \log [P (y_{j}) \exp (\sum_{j} α_{k} f_{k} (x_{i}, y_{j})) / Z_{i}^{'}] \\ = H (X) + H (Y) - I (Y; X_{θ}), \end{array}

(41)

where H(Y) is equal to log n, I(X; Y_θ) is semantic mutual information, and H(X) + H(Y) is the extremely maximum value of H(X, Y). Therefore, we can explain that maximum entropy is equal to extremely maximum entropy minus semantic mutual information.

If we use a group of truth functions or membership functions, such as those in Figure 1, as the constraint condition, to extend the maximum entropy method, the above distribution becomes:

P (y | x_{i}) = T {(y | x_{i})}^{| s |} / \sum_{k} T {(y_{k} | x_{i})}^{| s |}, i = 1, 2, \dots

(42)

This distribution is the same as in Equation (34) with P(y) = 1/n.

4.6. The New Explanation of the Boltzmann Distribution and the Maximum Entropy Law

Using Stirling’s formula lnN! = NlnN − N as N → ∞, Jaynes proved that Boltzmann’s entropy and Shannon’s entropy have a simple relation [4,5]:

S = k \ln W = k \ln \frac{N!}{\prod_{i} N_{i}!} = - k N \sum_{i} P (x_{i} | T) \ln P (x_{i} | T) = k N H (X | T),

(43)

where k is the Boltzmann constant, W is the microstate number, x_i means state i, N is the particles’ number, and P(x_i|T) denotes the probability of a particle (or the density of particles) in state i for the given absolute temperature T.

The Boltzmann distribution [41] is:

P (x_{i} | T) = \exp (- \frac{e_{i}}{k T}) / Z, Z = \sum_{i} \exp (- \frac{e_{i}}{k T}),

(44)

where Z is the partition function.

If x_i means energy e_i, G_i is the number of states with e_i, and G is the total number of all states, then P(x_i) = G_i/G is the prior probability of x_i. Hence, Equations (43) and (44) become:

\begin{array}{l} S = k \ln \frac{N!}{\prod_{i} N_{i}! / G_{i}^{N_{i}}} = - k N \sum_{i} P (x_{i} | T) \ln \frac{P (x_{i} | T)}{G_{i}} \\ = - k N \sum_{i} P (x_{i} | T) \ln \frac{P (x_{i} | T)}{P (x_{i})} + k N \ln G, \end{array}

(45)

P (x_{i} | T) = P (x_{i}) \exp (- \frac{e_{i}}{k T}) / Z^{'}, Z^{'} = \sum_{i} P (x_{i}) \exp (- \frac{e_{i}}{k T})

(46)

Now, we can explain exp[−e_i/(kT)] as a truth function T(θ_j|x), Z′ as a logical probability T(θ_j), and Equation (46) as a semantic Bayes’ formula.

Assuming that for a local equilibrium system, its different areas y_j, j = 1, 2, …, have different temperatures T_j, j = 1, 2, … The above G becomes G_j. Then we can derive (see Appendix A for the details):

S / (k N) = \sum_{j} P (y_{j}) \ln G_{j} - I (X; Y_{θ})

(47)

which means that the thermodynamic entropy S is proportional to the extremely maximum entropy

\sum_{j} P (y_{j}) \ln G_{j}

minus the semantic mutual information I(X; Y_θ). This formula indicates that the maximum entropy law in physics can be equivalently stated as the MMI law; this MMI can be expressed as semantic mutual information I(X; Y_θ).

5. Experimental Results

5.1. An Example Shows the Shannon Channel’s Changes in the MMI Iterations for R(Θ) and R(G)

The author experimented with Example 1 to test two theoretical results:

the MMI iteration lets the Shannon channel match the semantic channel, e.g., lets P(y_j|x) ∝ T(y_j|x), j = 1, 2, …, as far as possible for the functions R(D), R(Θ), and R(G);
the MMI iteration can reduce mutual information.

Example 1.

The four truth functions and the population age distribution P(x) are shown in Figure 1 (see Appendix B for the formulas producing these lines). The task is to use the above truth functions as the constraint condition to obtain P(y|x) for R(Θ) and P(y|x) for R(G).

First, the author uses the above truth functions as the constraint condition for the R(Θ) with |s| = 1. The iteration process for P(y|x) is shown in Figure 3 (see Supplementary Materials to find how the data for Figure 3 are produced).

The convergent P(y) is {P(y₁), P(y₂), P(y₃), P(y₄)} = {0.3499, 0.0022, 0.6367, 0}. The MMI is 0.845 bits. The iterative process not only lets every P(y_j|x) be approximatively proportional to T(y_j|x) but also makes P(y₄) closer to 0. The value of P(y₄) became 0 because y₄ implies y₃ and hence can be replaced with y₃. The latter has a larger logical probability. By replacing y₄ with y₃, we can save the average codeword length.

Figure 3c indicates that H(Y) also decreases with the iteration. Since H(Y) is much less than P(X), we can also simply replace the instance x with the label y to compress data.

Next, the author uses the above truth functions as the constant condition for the R(G) with |s| = 1. For R(G), d(x_i, y_j) is replaced with I(x_i; θ_j) = log[T(y_j|x_i)/T(y_j)] for every i and j. Figure 4 shows the iterative results and the process (see Supplementary Materials to find how the data for Figure 4 are produced).

Figure 4a intuitively displays that P(y|x) accords with the four labels’ semantic meanings. The convergent distribution P(y) for R(G) is {P(y₁), P(y₂), P(y₃), P(y₄)} = {0.3619, 0.0200, 0.6120, 0.0057}. The MMI is 0.883 bits. This P(y₄) is not 0. Both P(y₂) and P(y₄) are larger than those for R(Θ). The reason is that a label with less logical probability can convey more semantic information and, hence, should be more frequently selected if we use the semantic information criterion. For the same s = 1 and Θ, R(G) = 0.883 bits is greater than R(Θ) = 0.845 bits.

5.2. An Example about Greyscale Image Compression Shows the Convergent P(y|x) for the R(Θ)

This example was used to explain how we replace the distortion function with the truth function to obtain R(Θ) and the corresponding channel P(y|x) for the image compression of decreasing the resolution of pixels’ grey levels. The conclusion is also suitable for data reduction, where too high a resolution is unnecessary.

Example 2.

The goal is to compress an 8-bit greyscale image with 256 grey levels (denoted by x_i, i = 0, 1, …, 255) into a 3-bit grey image with 8 grey levels (denoted by y_j, j = 1, 2, …, 8) [42]. Considering that human visual discrimination changes with grey levels (the higher the grey level is, the lower the discrimination is). We use 8 truth functions, as shown in Figure 5a, to represent 8 fuzzy classes. Appendix C shows how these lines are produced. The task is to solve the MMI R and the corresponding channel P(y|x) with s = 1.

The author obtains the convergent P(y|x), as shown in Figure 5b. Figure 5c shows that R(Θ), I(X; Y_ϴ), and H(Y) change in the iterative process (see Supplementary Materials to find how the data for Figure 5 are produced).

Figure 5 shows how the Shannon channel matches the semantic channel for MMI. Comparing Figure 5a,b, we find that it is easy to control P(y|x) with T(y|x). If we use the distortion function d(x, y_j) instead of the truth function T(y_j|x), j = 1, 2, …, 8, it is not easy to design d(x, y_j). It is also difficult to predict the convergent P(y|x) according to d(x, y).

In this example, I(X; Y) = I(Y; X_θ) is difficult to approach I(X; Y_θ) because P(y_j|x) cannot be strictly proportional to T(y_j|x) for j = 2, 3, …, 6.

The author has used different prior distributions and semantic channels to calculate Shannon channels for MMI using the above method and achieving similar results. Every convergent TPF P(y_j|x) covers an area in U identical to the area covered by T(y_j|x) as s = 1, and hence, P(y|x) is satisfied with the constraint condition defined by T(y|x).

6. Discussions

6.1. Connecting Machine Learning and Data Compression by Truth Functions

Researchers have applied the rate-distortion theory to machine learning and achieved meaningful results [43,44,45]. However, we also need to obtain the distortion function from machine learning.

In the rate-distortion theory, the distortion function d(x, y) is subjectively defined, lacking the objective standard. When the instance x and label y have different universes, U and V and |U| >> |V|, the definition of the distortion between x and y is problematic. Now we can obtain labels’ truth functions from sampling distributions by machine learning (see Equations (23)–(25)) and indirectly express distortion functions using truth functions (see Equation (32)). After the source P(x) is changed, the semantic channel T(y|x) is still functional [31]. We can use T(y|x) and the new P(x) to obtain the new Shannon channel P(y|x) for MMI R(Θ). We can also use T(y|x)^|s| with |s| > 1 to reduce the average distortion or with |s| < 1 to loosen the distortion limit. In this way, we can overcome the rate-distortion function’s three disadvantages mentioned in Section 1. In addition, a group of truth functions describe a group of labels’ extensions and semantic meanings and are therefore intuitional. They have stronger descriptive power than distortion functions because we can always use a truth function to replace a distortion function by T(y_j|x) = exp[−d(x, y_j)]. Although, we may not be able to use a distortion function to replace a truth function, such as the truth function of y₃ = “Adult” in Figure 1.

Two often-used learning functions are the truth function (or the similarity function, or the membership function) and the likelihood function. We can also replace the distortion function d(x, y) with a log-likelihood function logP(x|θ), or equivalently replace

\bar{d}

≤ D with I(X; Y_θ) ≥ G to obtain the rate-verisimilitude function R(G). The function R(G) uses the likelihood criterion or the semantic information criterion. It can reduce the underreports of events with less logical probabilities. For example, the selected probabilities P(y₂) and P(y₄) in Figure 1 for R(G) (see Figure 4a) are higher than P(y₂) and P(y₄) for R(Θ) (see Figure 3c). The function R(G) is more suitable than R(D) and R(Θ) when y is a prediction where information is more important than truth.

Like R(D), R(Θ) and R(G) also mean the lower limit of average codeword length.

6.2. Viewing Rate-Distortion Functions and Maximum Entropy Distributions from the Perspective of Semantic Information G Theory

There are many similarities between the MMI distribution in the rate-distortion function and the maximum entropy distribution. According to the analyses in Section 4.1 and Section 4.5, we can regard NEFs as truth functions and partition functions as logical probabilities. These distributions or Shannon’s channels, such as P(y|x) in Equation (2), are semantic Bayes’ distributions produced by the semantic Bayes’ formula (see Equation (19)).

The main differences between the rate-distortion theory and the maximum entropy method are:

For the rate-distortion function R(D), we seek MMI I(X; Y) = H(X) − H(X|Y), which is equivalent to maximizing the posterior entropy H(X|Y) of X. For R(D), we use an iterative algorithm to find the proper P(y). However, in the maximum entropy method, we maximize H(X, Y) or H(Y|X) for given P(x) without the iteration for P(y);
The rate-distortion function can be expressed by the semantic information G measure (see Equation (31)). In contrast, the maximum entropy is equal or proportional to the extremely maximum entropy minus semantic mutual information (see Equations (41) and (47)).

6.3. How the Experimental Results Support the Explanation for the MMI Iterations

The theoretical analyses in Section 4.2, Section 4.3 and Section 4.4 indicate that the MMI iterations for R(D), R(Θ), and R(G) impel the Shannon channel to match the semantic channel, e.g., impel P(y_j|x) ∝ T(y_j|x), j = 1, 2, … as far as possible. Example 1 in Section 5.1 shows that the iterative processes for R(Θ) and R(G) not only find proper probability distribution P(y) (which means a label y_j with larger logical probability will be selected more frequently) but also modify every TPF P(y_j|x) so that P(y_j|x) is approximatively proportional to T(y_j|x). Therefore, the results in Figure 3 support the above theoretical analyses.

Figure 4 indicates that the Shannon channel P(y|x) for R(G) is a little different from that for R(Θ). For example, two labels y₂ and y₄, with less logical probabilities, have larger P(y₂) and P(y₄) than those for R(Θ). This result also accords with the theoretical analysis, which indicates that the semantic information criterion can reduce the underreports of events with less logical probabilities.

Figure 5 shows an example of data reduction. For this example, it is hard to define distortion function d(x, y), but it is easy to use truth functions to represent fuzzy classes. The results indicate that we can easily control the convergent Shannon channel P(y|x) with the semantic channel T(y|x).

6.4. Considering Other Kinds of Semantic Information

Wyner and Debowski have independently defined an information measure (see Equation (1) in [27]), which is meaningful for measuring the semantic information between two sentences or labels. Combining this with the G theory, we may develop a different measure from the above for the semantic information between two labels y_j and y_k:

I (θ_{j}; θ_{k}) = \sum_{i} P (x_{i} | y_{j}, y_{k}) \log \frac{T (θ_{j} \cap θ_{k} | x_{i})}{T (θ_{j}) T (θ_{k})} .

(48)

Unlike Equation (1) in [27], the above formula includes both the statistical probabilities, (P(x) and P(x|y_j, y_k)), and logical probabilities, and it does not require that subsets θ₁, θ₂, …, θ_n form a partition of U. We use an example to explain the reason for using P(x). If P(x) is a population age distribution, we ca use two labels y₁ = “Person in his 50 s” and y₂ = “Elder person” for a person at age x. The average life span (denoted by t) in different areas and eras might change from 50 years to 80 years. The above formula can ensure that the semantic information between y₁ and y₂ for t = 50 is more than that for t = 80.

Averaging I(θ_j; θ_k), we have semantic mutual information:

I (Y_{θ 1}; Y_{θ 2}) = \sum_{j} \sum_{k} \sum_{i} P (x_{i}, y_{j}, y_{k}) \log \frac{T (θ_{j} \cap θ_{k} | x_{i})}{T (θ_{j}) T (θ_{k})} .

(49)

This formula can ensure that wrong translations or replacements may convey negative semantic information.

The distortion function between the two labels becomes:

d (y_{j}, y_{k}) = \sum_{i} P (x_{i} | y_{j}, y_{k}) \log [1 / T (θ_{j} \cap θ_{k} | x_{i})] .

(50)

We can use this function for the data compression of a sequence of labels or sentences.

Considering the semantic information of designs, decisions, and control plans, the G measure and the R(G) function are also useful. In these cases, the truth function T(θ_j|x) becomes the DCF, P(x|y) becomes the realized distribution, information means control complexity or control amount, G means the effective control amount, and R means the control cost. To increase G = I(X; Y_θ), we need to fix T(θ_j|x) and optimize control to improve P(x|y). The R(G) function tells us that we can improve P(y|x) with the MMI iteration and increase G by amplifying s. In [32], the author improperly uses a different information formula (Equation (24) in [32]) for the above purpose. That formula seemingly only fits cases where the control results are continuous distributions. The author will further discuss this topic in another paper.

The G measure also has its limitations, which include:

The G measure is only related to labels’ extensions, not to labels’ intensions. For example, “Old” means senility and closeness to death, whereas the G measure is only related to T(“Old”|x), which represents the extension of “Old”;
The G measure is not enough for measuring the semantic information from fuzzy reasoning according to the context, although the author has discussed how to calculate compound propositions’ truth functions and fuzzy reasoning [32].

Therefore, we need more studies to measure more kinds of semantic information.

6.5. About the Importance of Semantic Information’s Studies

Shannon does not consider semantic information. He writes [34] (p. 3):

“These semantic aspects of communication are irrelevant to the engineering problem. The significance aspects is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just one which will actually be chosen since this is unknown at the time of design.”

However, the author of this paper does not think that Shannon opposed researching semantic information because, in the book co-written by Shannon and Weaver, Weaver [34] (pp. 95–97) initiated the study of semantic information. If we extend Shannon’s information theory for machine learning, which frequently deals with the selection of messages, we must consider semantic information.

Section 4.2 reveals that distortion is related to truth, truth is related to semantic meaning, and hence, the rate-distortion function is related to semantic information. Equation (31) indicates that the rate-distortion function R(D) can be expressed as semantic mutual information I(Y; X_θ).

With machine learning’s developments, semantic information theory is becoming more important. From the perspective of the semantic information G theory, the EM algorithm for mixture models, where the E-step uses a formula like Equation (39), can be explained by the mutual matching of the semantic channel and the Shannon channel [40], so that two kinds of information are approximately equal. In the Restricted Boltzmann Machine (RBM), the Softmax function with the NEF and the partition function is used as the learning function [7]. Some researchers have discussed the similarity between the RBM and the EM algorithm [46]. It seems that we can also explain the RBM by the two channels’ mutual matching. We need more studies for the G theory’s applications to machine learning with neural networks.

7. Conclusions

In the introduction section, we raised three questions:

Why does the Bayes-like formula (see Equation (2)) with NEFs and partition functions widely exist in the rate-distortion theory, statistical mechanics, the maximum entropy method, and machine learning?
Can we combine machine learning (for the distortion function) and data compression (with the rate-distortion function)?
Can we use the rate-distortion function or similar functions for semantic compression?

Using the semantic information G measure based on the P-T probability framework, we have explained that the NEFs are truth functions, the partition functions are logical probabilities, and the Bayes-like formula is a semantic Bayes’ formula. We have also explained that the semantic mutual information formula can express MMI R (see Equation (35)); maximum entropy (in statistical mechanics or the maximum entropy method) is equal or proportional to extremely maximum entropy minus semantic mutual information (see Equations (41) and (47)).

We have shown that we can obtain truth functions from sampling distributions by machine learning (see Equations (23) and (25)). Furthermore, after setting the relationship between the truth function and the distortion function (see Section 4.2), we can replace the distortion function with log(1/truth_function). Therefore, we can combine machine learning (for the distortion function) and data compression.

Since truth functions represent labels’ semantic meanings [15], we can extend the rate-distortion function R(D) to the rate-truth function R(Θ) by replacing the average distortion

\bar{d}

with fuzzy entropy H(Y_θ|X) (see Section 4.3). We can also extend R(D) to the rate-verisimilitude function R(G) by replacing the upper limit D of average distortion with the lower limit G of semantic mutual information (see Section 4.4). We have used two examples to show how the iteration algorithm let the Shannon channel P(y|x) match the semantic channel T(y|x) under the semantic constraints to achieve MMI R. Example 1 reveals that the iteration algorithm can use the logical implication between labels to reduce mutual information. Example 2 indicates that it is easy to control the Shannon channel P(y|x) with the semantic channel T(y|x) for data reduction.

However, we also need to compress and recover text data (according to the context), image data, and speech data (according to the results of pattern recognition or species recognition). The above extension is useful, but it is far from sufficient, and so further studies are necessary.

Supplementary Materials

Excel files that produced Figures S3–S5 are available online at http://survivor99.com/lcg/cm/mmi-iteration.zip (accessed on 8 August 2021).

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Acknowledgments

This article’s initial version was reviewed by two reviewers; the later version was reviewed by three reviewers. Their comments resulted in many meaningful revisions. Especially, the first two reviewers’ criticisms reminded me to stress the combination of machine learning and data compression. I thank five reviewers for their revision opinions, criticisms, and support. I also thank the guest editor and the assistant editor for encouraging me to resubmit the improved manuscript.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. The Derivation of Equation (47)

From Equation (45), we derive the mutual information between X and Y:

I (X; Y) = \sum_{j} P (y_{j}) \sum_{i} P (x_{i} | y_{j}) \log \frac{P (x_{i} | y_{j})}{P (x_{i})} = \sum_{j} P (y_{j}) \log G_{j} - S / (k N) .

(A1)

Further, according to Equation (46), we have MMI:

\begin{array}{l} I (X; Y) = \sum_{j} \sum_{i} P (x_{i}, y_{j}) \log \frac{\exp [- e_{i} / (k T_{j})]}{Z^{'}} = \sum_{j} \sum_{i} P (x_{i}, y_{j}) \log \frac{T (θ_{j} | x_{i})}{T (θ_{j})} \\ = H (Y_{θ}) - H (Y_{θ} | X) = I (X; Y_{θ}) . \end{array}

(A2)

From the above two equations, we obtain:

S / (k N) = \sum_{j} P (y_{j}) \log G_{j} - I (X; Y_{θ}) .

(A3)

Appendix B

Formulas that Produce Data for Figure 1 in Section 1 and Example 1 in Section 5.1

\begin{array}{l} T (y_{3} | x) = T (“ adult ” | x) = \frac{1}{1 + \exp [- 1.5 (x - 18)]}, \\ T (y_{1} | x) = T (“ non - adult ” | x) = 1 - T (y_{3} | x), \\ T (y_{2} | x) = T (“ youth ” | x) = 1 - {[1 - \exp (- \frac{{(x - 22)}^{2}}{2 \times 5^{2}})]}^{2}, \\ T (y_{4} | x) = T (“ elder ” | x) = \frac{1}{1 + \exp [- (x - 60)]}, \end{array}

(A4)

P (x) = \exp (- \frac{x^{2}}{2 \times 37^{2}}) / \sum_{x \geq 0} \exp (- \frac{x^{2}}{2 \times 37^{2}}), x \geq 0 .

(A5)

Appendix C

Formulas and Parameters for Figure 5a in Section 5.2

\begin{array}{l} T (y_{0} | x) = 1 - \frac{1}{1 + \exp [- (x - 2)]}, \\ T (y_{j} | x) = 1 - {[1 - \exp (- \frac{{(x - μ_{j})}^{2}}{2 σ_{j}^{2}})]}^{3}, j = 2, \dots, 6, \\ T (y_{7} | x) = \frac{1}{1 + \exp [- 0.2 (x - 200)]}, \end{array}

(A6)

P (x) = \exp (- \frac{x^{2}}{2 \times 37^{2}}) / \sum_{x \geq 0} \exp (- \frac{x^{2}}{2 \times 37^{2}}), x \geq 0 .

(A7)

Table A1. Six pairs of parameters for six truth functions T(y_j|x), j = 2, 3, …, 7.

y_j	T(y₂\|x)	T(y₃\|x)	T(y₄\|x)	T(y₅\|x)	T(y₆\|x)	T(y₇\|x)
μ_j	14	30	52	80	120	170
σ_j²	16	24	50	80	160	240

Only the formula for T(y_j|x), j = 2, 3, …, 6, has not been used by others in the above formulas. This formula represents a group of trapezoidal curves with rounded corners. These curves are often used as membership functions and constructed by some piecewise functions. With this formula, the piecewise functions are not necessary.

References

Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–429+623–656. [Google Scholar] [CrossRef] [Green Version]
Shannon, C.E. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec. 1959, 4, 142–163. [Google Scholar]
Berger, T. Rate Distortion Theory; Prentice-Hall: Enklewood Cliffs, NJ, USA, 1971. [Google Scholar]
Jaynes, E.T. Information Theory and Statistical Mechanics. Phys. Rev. 1957, 106, 620. [Google Scholar] [CrossRef]
Jaynes, E.T. Information Theory and Statistical Mechanics II. Phys. Rev. 1957, 108, 171. [Google Scholar] [CrossRef]
Smolensky, P. Chapter 6: Information Processing in Dynamical Systems: Foundations of Harmony Theory. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations; Rumelhart, D.E., McLelland, J.L., Eds.; MIT Press: Cambridge, UK, 1986; pp. 194–281. [Google Scholar]
Hinton, G.E. A Practical Guide to Training Restricted Boltzmann Machines. In Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science; Montavon, G., Orr, G.B., Müller, K.R., Eds.; Springer: Berlin, Germany, 2012; Volume 7700, pp. 599–619. [Google Scholar] [CrossRef]
Salakhutdinov, R.; Hinton, G.E. Replicated softmax: An undirected topic model. Neural Inf. Process. Syst. 2009, 22, 1607–1614. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Softmax Units for Multinoulli Output Distributions. In Deep Learning; MIT Press: Cambridge, UK, 2016; pp. 180–184. [Google Scholar]
Zadeh, L.A. Fuzzy Sets. Inf. Control. 1965, 8, 338–353. [Google Scholar] [CrossRef] [Green Version]
Harremoës, P. Maximum Entropy on Compact Groups. Entropy 2009, 11, 222–237. [Google Scholar] [CrossRef] [Green Version]
Berger, T.; Gibson, J.D. Lossy Source Coding. IEEE Trans. Inf. Theory 1998, 44, 2693–2723. [Google Scholar] [CrossRef] [Green Version]
Gibson, J. Special Issue on Rate Distortion Theory and Information Theory. Entropy 2018, 20, 825. [Google Scholar] [CrossRef] [Green Version]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: New York, NY, USA, 2006. [Google Scholar]
Davidson, D. Truth and meaning. Synthese 1967, 17, 304–323. [Google Scholar] [CrossRef]
Willems, F.M.J.; Kalker, T. Semantic compaction, transmission, and compression codes. In Proceedings of the International Symposium on Information Theory ISIT 2005, Adelaide, Australia, 4–9 September 2005; pp. 214–218. [Google Scholar] [CrossRef]
Babu, S.; Garofalakis, M.; Rastogi, R. SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables. ACM SIGMOD Rec. 2001, 30, 283–294. [Google Scholar] [CrossRef]
Ceglarek, D.; Haniewicz, K.; Rutkowski, W. Semantic Compression for Specialised Information Retrieval Systems. Adv. Intell. Inf. Database Syst. 2010, 283, 111–121. [Google Scholar]
Blau, Y.; Michaeli, T. The perception-distortion tradeoff. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18 June 2018; pp. 6228–6237. [Google Scholar]
Bardera, A.; Bramon, R.; Ruiz, M.; Boada, I. Rate-Distortion Theory for Clustering in the Perceptual Space. Entropy 2017, 19, 438. [Google Scholar] [CrossRef] [Green Version]
Carnap, R.; Bar-Hillel, Y. An Outline of a Theory of Semantic Information. Available online: http://dspace.mit.edu/bitstream/handle/1721.1/4821/RLE-TR-247-03150899.pdf;sequence=1 (accessed on 1 July 2021).
Klir, G. Generalized information theory. Fuzzy Sets Syst. 1991, 40, 127–142. [Google Scholar] [CrossRef]
Floridi, L. Outline of a theory of strongly semantic information. Minds Mach. 2004, 14, 197–221. [Google Scholar] [CrossRef] [Green Version]
Zhong, Y.X. A theory of semantic information. China Commun. 2017, 14, 1–17. [Google Scholar] [CrossRef]
D’Alfonso, S. On Quantifying Semantic Information. Information 2011, 2, 61–101. [Google Scholar] [CrossRef] [Green Version]
Bhandari, D.; Pal, N.R. Some new information measures of fuzzy sets. Inf. Sci. 1993, 67, 209–228. [Google Scholar] [CrossRef]
Dębowski, Ł. Approximating Information Measures for Fields. Entropy 2020, 22, 79. [Google Scholar] [CrossRef] [Green Version]
Lu, C. A Generalized Information Theory; China Science and Technology University Press: Hefei, China, 1993; ISBN 7-312-00501-2. (In Chinese) [Google Scholar]
Lu, C. Meanings of generalized entropy and generalized mutual information for coding. J. China Inst. Commun. 1994, 15, 37–44. (In Chinese) [Google Scholar]
Lu, C. A generalization of Shannon’s information theory. Int. J. Gen. Syst. 1999, 28, 453–490. [Google Scholar] [CrossRef]
Lu, C. Semantic information G theory and logical Bayesian inference for machine learning. Information 2019, 10, 261. [Google Scholar] [CrossRef] [Green Version]
Lu, C. The P–T probability framework for semantic communication, falsification, confirmation, and Bayesian reasoning. Philosophies 2020, 5, 25. [Google Scholar] [CrossRef]
Lu, C. Channels’ Confirmation and Predictions’ Confirmation: From the Medical Test to the Raven Paradox. Entropy 2020, 22, 384. [Google Scholar] [CrossRef] [Green Version]
Shannon, C.E.; Weaver, W. The Mathematical Theory of Communication; The University of Illinois Press: Urbana, IL, USA, 1963. [Google Scholar]
Zadeh, L.A. Probability measures of fuzzy events. J. Math. Anal. Appl. 1986, 23, 421–427. [Google Scholar] [CrossRef] [Green Version]
Cumulative_Distribution_Function. Available online: https://en.wikipedia.org/wiki/Cumulative_distribution_function (accessed on 10 April 2021).
Popper, K. Conjectures and Refutations; Routledge: London, UK; New York, NY, USA, 2002. [Google Scholar]
Wittgenstein, L. Philosophical Investigations; Basil Blackwell Ltd.: Oxford, UK, 1958. [Google Scholar]
Sow, D.M.; Eleftheriadis, A. Complexity distortion theory. IEEE Trans. Inf. Theory 2003, 49, 604–608. [Google Scholar] [CrossRef] [Green Version]
Lu, C. Understanding and Accelerating EM Algorithm’s Convergence by Fair Competition Principle and Rate-Verisimilitude Function. arXiv 2021, arXiv:2104.12592. [Google Scholar]
Boltzmann Distribution. Available online: https://en.wikipedia.org/wiki/Boltzmann_distribution (accessed on 10 April 2021).
Binary Images. Available online: https://www.cis.rit.edu/people/faculty/pelz/courses/SIMG203/res.pdf (accessed on 25 June 2021).
Kutyniok, G. A Rate-Distortion Framework for Explaining Deep Learning. Available online: https://maths-of-data.github.io/Talk_Edinburgh_2020.pdf (accessed on 30 June 2021).
Nokleby, M.; Beirami, A.; Calderbank, R. A rate-distortion framework for supervised learning. In Proceedings of the 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, USA, 17–20 September 2015; pp. 1–6. [Google Scholar] [CrossRef]
John, S.; Gadde, A.; Adsumilli, B. Rate Distortion Optimization Over Large Scale Video Corpus with Machine Learning. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 1286–1290. [Google Scholar] [CrossRef]
Song, J.; Yuan, C. Learning Boltzmann Machine with EM-like Method. arXiv 2016, arXiv:1609.01840. [Google Scholar]

Figure 1. Four labels’ (fuzzy) truth functions about people’s ages. T(y_j|x) denotes the truth function of proposition function y_j(x) and j = 1, 2, 3, 4, as the constraint condition.

Figure 2. The semantic information conveyed by y_j about x_i.

Figure 3. P(y|x) and I(X; Y) change in the iterative process for R(Θ). (a) P(y|x) after the first iteration (this P(y|x) also results in maximum entropies H(Y|X) and H(X, Y)); (b) P(y|x) after the second iteration; (c) P(y|x) after eight iterations; (d) I(X; Y), I(Y; X_θ), and H(Y) change in the iterative process.

Figure 4. The iterative results for the R(G) function. (a) The convergent Shannon channel P(y|x); (b) I(X; Y_θ) and I(X; Y) = I(Y; X_θ) change in the iterative process.

Figure 5. The results of Example 2. (a) Eight truth functions or the semantic channel T(y|x); (b) The convergent Shannon channel P(y|x); (c) I(X; Y_θ), I(X; Y) and H(Y) change in the iterative process.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, C. Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions. Entropy 2021, 23, 1050. https://doi.org/10.3390/e23081050

AMA Style

Lu C. Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions. Entropy. 2021; 23(8):1050. https://doi.org/10.3390/e23081050

Chicago/Turabian Style

Lu, Chenguang. 2021. "Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions" Entropy 23, no. 8: 1050. https://doi.org/10.3390/e23081050

APA Style

Lu, C. (2021). Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions. Entropy, 23(8), 1050. https://doi.org/10.3390/e23081050

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions

Abstract

1. Introduction

2. Background

2.1. Shannon’s Entropies and Mutual Information

2.2. Rate-Distortion Function R(D)

2.3. The Maximum Entropy Method

3. The Author’s Related Work

3.1. The P-T Probability Framework

3.2. The Semantic Information G Measure

4. Theoretical Results

4.1. The New Explanations of the MMI Distribution and the Rate-Distortion Function

4.2. Setting Up the Relation between the Truth Function and the Distortion Function

4.3. Rate-Truth Function R(Θ)

4.4. Rate-Verisimilitude Function R(G)

4.5. The New Explanation of the Maximum Entropy Distribution and the Extension

4.6. The New Explanation of the Boltzmann Distribution and the Maximum Entropy Law

5. Experimental Results

5.1. An Example Shows the Shannon Channel’s Changes in the MMI Iterations for R(Θ) and R(G)

5.2. An Example about Greyscale Image Compression Shows the Convergent P(y|x) for the R(Θ)

6. Discussions

6.1. Connecting Machine Learning and Data Compression by Truth Functions

6.2. Viewing Rate-Distortion Functions and Maximum Entropy Distributions from the Perspective of Semantic Information G Theory

6.3. How the Experimental Results Support the Explanation for the MMI Iterations

6.4. Considering Other Kinds of Semantic Information

6.5. About the Importance of Semantic Information’s Studies

7. Conclusions

Supplementary Materials

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. The Derivation of Equation (47)

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI