# Reviewing Evolution of Learning Functions and Semantic Information Measures for Understanding Deep Learning

^{1}

^{2}

## Abstract

**:**

## 1. Introduction

_{j})/P(x) for expressing EMI. MINE and InfoNCE achieve distinct successes and encourage others. In 2019, Hjelm et al. [4,5] proposed Deep InfoMax (DIM) based on MINE. They combine DIM and InfoNCE (see Equation (5) in [4]) to achieve better results (they believe that InfoNCE was put forward independently of MINE). In 2020, Chen et al. proposed SimCLR [6], He et al. proposed MoCo [7], and Grill et al. presented BOYL [8], all of which show strong learning ability. All of them use similarity functions to construct the EMI or the loss function similar to that used for InfoNCE.

_{j}(a constant) about an instance x (a variable), we need the (fuzzy) truth function (where x makes y

_{j}true). Generally, x and y

_{j}belong to different domains. If x and y

_{j}belong to the same domain, y

_{j}becomes an estimation, i.e., y

_{j}= $\widehat{x}$

_{j}= “x is about x

_{j}”. In this case, semantic information becomes estimated information. For example, a GPS pointer means an estimate y

_{j}= $\widehat{x}$

_{j}(see Section 3.1 for details); it conveys estimated information. Since it may be wrong, the information is also semantic information. Therefore, we can say that estimated information is a special case of semantic information; this paper also regards estimated information as semantic information and the learning method of using EMI as the semantic ITL method.

- How are EMI and ShMI related in supervised, semi-supervised, and unsupervised learning?
- Is similarity probability? If it is, why is it not normalized? If it is not, why can we bring it into Bayes’ formula (see Equation (9))?
- Can we get similarity functions, distortion functions, truth functions, or membership functions directly from samples or sampling distributions?

- Reviewing the evolutionary histories of semantic information measures and learning functions;
- Clarifying the relationship between SeMI and ShMI;
- Promoting the integration and development of the G theory and deep learning.

- Reviewing the evolution of semantic information measures;
- Reviewing the evolution of learning functions;
- Introducing the G theory and its applications to machine learning;
- Discussing some questions related to SeMI and ShMI maximization and deep learning;
- Discussing some new methods worth exploring and the limitation of the G theory;
- Conclusions with opportunities and challenges.

## 2. The Evolution of Semantic Information Measures

#### 2.1. Using the Truth or Similarity Function to Approximate to Shannon’s Mutual Information

_{i}is an instance, y

_{j}is a label, and X and Y are two random variables.

_{j}is a constant. Deep learning researchers [1,2] found that we could use parameters to construct similarity functions and then use similarity functions to construct EMI that approximates to ShMI. The main characteristic of similarity functions is:

_{j}), an important function. However, the author has never seen its name. We call m(x, y

_{j}) the relatedness function now. We can also construct the distortion function d(x, y

_{j}) with parameters and use d(x, y

_{j}) to express the similarity function so that

_{l}) is added in the partition function. Nevertheless, Equations (4) and (6) are equivalent. Compared with the likelihood function P(x|θ

_{j}) (where θ

_{j}represents y

_{j}and related parameters), the similarity function is independent of the source P(x). After the source and the channel change, the similarity function is still proper as a predictive model. We can use the similarity function and the new source P’(x) to make a new probability prediction or produce a likelihood function:

#### 2.2. Historial Events Related to Semantic Information Measures

_{j}, whereas ShMI can also be seen as the KL information between two distributions: P(x, y) and P(x)P(y).

_{CB}= log [1/m

_{p}].

_{p}is a proposition’s logical probability. This formula partly reflects Popper’s idea that the smaller the logical probability, the greater the amount of semantic information. However, I

_{CB}does not indicate whether the hypothesis can stand the test. Therefore, this formula is not practical. In addition, the logical probability m

_{p}is independent of the prior probability distribution of the instance, which is also unreasonable.

_{ij}= d(x

_{i}, y

_{j}), s ≤ 0, and Z

_{i}is the partition function. In the author’s opinion, exp(sd

_{ij}) is a truth function, and this MinMI can be regarded as SeMI [18].

_{Theil}means predictive information.

_{i}in a fuzzy set A

_{j}is denoted as M

_{Aj}(x

_{i}). The author of this paper explained in 1993 [15] that the membership function M

_{Aj}(x) is also the truth function of the proposition function y

_{j}= y

_{j}(x) = “x is in A

_{j}”. If we assume that there is a typical x

_{j}(that is, Plato’s idea, which may not be in A

_{j}) that makes y

_{j}true, i.e., M

_{Aj}(x

_{j}) = 1, then the membership function M

_{Aj}(x) is the similarity function between x and x

_{j}.

_{DV}was later called the Donsker–Varadhan representation in [1]. We also called it DV-KL information. They proposed this formula perhaps because they were inspired by the information rate-distortion function or the Gibbs (or Boltzmann–Gibbs) distribution. To understand this formula, we replace P(x) with P(x|y

_{j}) and Q(x) with P(x). Then the KL information becomes:

_{i}) so that exp[T(x

_{i})]∝P(y

_{j}|x). DV-KL information was later used for MINE [1]. However, exponential or negative-exponential functions are generally symmetrical, while P(y

_{j}|x) is usually asymmetrical. Thus, it is not easy to make two information quantities equal.

_{t}

_{1}, x

_{t}

_{2}, …; y

_{t}) |t = 1, 2, …, N)}, the probability of x

_{i}in examples with label y

_{j}is membership grade M

_{Aj}(x

_{i}). The author later [17] proved that the membership function M

_{Aj}(x) is a truth function and can be obtained from a regular sample, where an example only includes one instance.

_{j}is a fuzzy set, y

_{j}= “x belongs to A

_{j}” is a hypothesis, and Q(A

_{j}|x) is the membership function of A

_{j}and the (fuzzy) truth function of y

_{j}. Q(A

_{j}) is the logical probability of y

_{j}and the probability of fuzzy events called by Zadeh [53]. If there is always Q(A

_{j}|x) = 1, the above semantic information formula becomes Carnap–Bar-Hillel’s semantic information formula.

_{j}= $\widehat{x}$

_{j}, and employed the Gaussian function as the discrimination or similarity function. Later, he found that we could also measure natural language information by replacing the similarity function with the truth function. Since statistical probability is used for averaging, this information measure can ensure that wrong hypotheses or estimates will reduce semantic information.

_{Lu}(x; y

_{j}), we obtain generalized Kullback–Leibler (KL) information (if we use P(x|A

_{j})), as proposed by Theil, and semantic KL information (if we use Q(A

_{j}|x)), namely:

_{j}|x) is expressed as exp[−d(x, y

_{j})], semantic KL information becomes DV-KL information:

_{j}= “Elderly” is asymmetric and can be described by a Logistic function rather than an exponential function. Therefore, DV-KL information is only a particular case of semantic KL information.

_{LU}(x; y), to obtain the information rate-fidelity function R(G), which means minimum ShMI for given SeMI. The R(G) function reveals the matching relationship between ShMI and SeMI. Meanwhile, he defined the truth value matrix as the generalized channel and proposed the idea of mutual matching of two channels. He also studied how to compress image data according to the visual discrimination of colors and the R(G) function [15,16].

_{j}with θ

_{j}. In 2020, He discussed how the P-T probability framework is related to Popper’s theory [17]. The θ represents not only a fuzzy set but also a set of model parameters. With the P-T probability framework, we can conveniently use the G measure for machine learning. The relationship between SeMI and several generalized entropies is:

_{θ}) can be called the prediction entropy. It is also a cross-entropy. About the other two generalized entropies, the author suggests that we call H(Y

_{θ}|X) the fuzzy entropy and H(Y

_{θ}) the coverage entropy (see Section 5.4 for details).

_{θ}|X) as the fuzzy entropy is more general. The reason is that if n labels become two complementary labels, y

_{1}and y

_{0}, and P(y

_{1}|x) and P(y

_{0}|x) equals two membership functions T(θ

_{1}|x) and T(θ

_{0}|x), fuzzy entropy H(Y

_{θ}|X) degenerates into the DT fuzzy entropy.

_{j}) with −logT(θ

_{j}|x) and the average distortion with H(Y

_{θ}|X) [18]. In this way, the constraint is more flexible.

## 3. The Evolution of Learning Functions

#### 3.1. From Likelihood Functions to Similarity and Truth Functions

- P(x) is the prior distribution of instance x, representing the source. We use P
_{θ}(x) to approximate to it. - P(y) is the prior distribution of label y, representing the destination. We use P
_{θ}(y_{j}) to approximate to it. - P(x|y
_{j}) is the posterior distribution of x. We use the likelihood function P(x|θ_{j}) = P(x|y_{j}, θ) to approximate to it. - P(y|x
_{i}) is the posterior distribution of y. Since P(y|x_{i}) = P(y)P(x|y)/P(x_{i}), Bayesian Inference uses P(θ) P(x|y, θ)/P_{θ}(x) (Bayesian posterior) [57] to approximate to it. - P(x, y) is the joint probability distribution. We use P(x, y|θ) to approximate to it.
- m(x, y
_{j}) = P(x, y_{j})/[P(x)P(y_{j})] is the relatedness function. What approximates to it is m_{θ}(x, y_{j}). We call m_{θ}(x, y_{j}) the truthlikeness function, which changes between 0 and ∞. - m(x, y
_{j})/max[m(x, y_{j})] = P(y_{j}|x)/max[P(y_{j}, x)] is the relative relatedness function. We use the truth function T(θ_{j}|x) or the similarity function S(x, y_{j}) to approximate to it.

_{j}) to make a probability prediction because it may be unsmooth or even intermittent. For this reason, Fisher [57] proposed using the smooth likelihood function P(x|θ

_{j}) with parameters to approximate to P(x|y

_{j}) with the maximum likelihood criterion. Then we can use P(x|θ

_{j}) to make a probability prediction.

_{j}) after P(x) is changed. In addition, P(x|y

_{j}) is often irregular and difficult to be approximated by a function. Therefore, the inverse probability function P(θ

_{j}|x) is used. With this function, when P(x) becomes P’(x), we can obtain a new likelihood function P’(x|θ

_{j}) by using Bayes’ formula. In addition, we can also use P(θ

_{j}|x) for classification with the maximum accuracy criterion.

_{1}and y

_{0}, we can use a pair of Logistic functions with parameters to approximate to P(y

_{1}|x) and P(y

_{0}|x). However, this method also has two disadvantages:

- When the number of labels is n > 2, it is difficult to construct a set of inverse probability functions because P(θ
_{j}|x) should be normalized for every x_{i}:$$\sum _{j}P({\theta}_{j}|{x}_{i})=1},\text{}\mathrm{for}\text{}i=1,2,\dots $$ - Although P(y
_{j}|x_{i}; i = j) indicates the accuracy for binary communication, when n > 2, P(y_{j}|x_{i}; i = j) may not mean it, especially for semantic communication. For example, x represents an age, y denotes one of the three labels: y_{0}= “Non-adult”, y_{1}= “Adult”, and y_{2}= “Youth”. If y_{2}is rarely used, both P(y_{2}) and P(y_{2}|x) are tiny. However, the accuracy of using y_{2}for x = 20 should be 1.

_{j}), P(x, y|θ) is not suitable for irregular or changeable P(x). The Bayesian posterior is also unsuitable.

_{j}|x) and independent of P(x) and P(y). The truth function and the similarity function are such functions.

_{j}|x), instead of P(x|θ

_{j}) or P(x, y|θ), as the learning function. A GPS pointer indicates an estimate y

_{j}= $\widehat{x}$

_{j}, whereas the real position of x may be a little different. In this case, the similarity between x and $\widehat{x}$

_{j}is the truth value of $\widehat{x}$

_{j}as x happens. This explains why the similarity is often called the semantic similarity [22].

_{j}does not point at the railway because it is slightly inaccurate.

_{j}|x) = S(x, x

_{j}) = exp[−(x − x

_{j})

^{2}/(2σ

^{2})],

_{j}is the pointed position by y

_{j}, and σ is the Root Mean Square (RMS). For simplicity, we assume x is one-dimensional in the above Equation. According to Equation (9), we can predict that the star represents the most probable position.

_{j}) or P(x, y|θ

_{j}) as the learning function and the predictive model, after P(x) is changed, this function cannot make use of the new P(x), and hence the learning is not transferable. In addition, using the truth function, we can learn the system deviation and the precision of a GPS device from a sample [13]. However, using P(x|θ

_{j}) or P(x, y|θ

_{j}), we cannot do this because the deviation and precision are independent of P(x) and P(y). Consider a car with a GPS map on a road, where P(x) and P(x|θ

_{j}) may be more complex and changeable. However, T(θ

_{j}|x) can be an invariant Gaussian function.

_{j})/max[m(x, y

_{j})].

#### 3.2. The Definitions of Semantic Similarity

_{i}; y

_{j}) (see Equation (2)), as a similarity measure. Since PMI varies between −∞ and ∞, an improved PMI similarity between 0 and 1 was proposed [65]. We can also see that semantic similarity is defined with semantic distance. More semantic similarity and semantic distance measures can be found in [66]. Some researchers believe semantic similarity includes relatedness, while others believe the two are different [67]. For example, hamburgers and hot dogs are similar, while hamburgers and French fries are related. However, the author believes that because the relatedness of two objects reflects the similarity between their positions in space or time, it is practical and reasonable to think that the related two objects are similar.

_{j}|x) proposed by the author can also be understood as the similarity function (between x and x

_{j}). We can define it with distance or distortion and optimize it with a sampling distribution. The author uses T(θ

_{j}|x) to approximate to m(x, y

_{j})/max[m(x, y

_{j})]. When the sample is large enough, they are equal.

_{d}(x

_{i}, y

_{j}). Assuming that the maximum possible fidelity is f

_{d}

_{max}, we may define the distortion function: d(x, y

_{j}) = k[f

_{d}

_{max}− f

_{d}(x, y

_{j})], and then use exp[−d(x, y

_{j})] as the similarity function. Nevertheless, with or without this conversion, the value of the SoftMax function or the truthlikeness function is unchanged.

#### 3.3. Similarity Functions and Semantic Information Measures Used for Deep Learning

_{j}|x) for a given P(x|y

_{j}), we need P(x) because P(y

_{j}|x)/P(y

_{j}) = P(x|y

_{j})/P(x). Without P(x), we can create a counterexample by noise and get P(x). Then we can optimize P(θ

_{j}|x). Like Binary Relevance, this method converts a multi-label learning task into n single-label learning tasks.

_{j}|x) ∝ P(x|y

_{j})/P(x) = m(x, y) or T(θ

_{j}|x) ∝ P(y

_{j}|x), SeMI reaches its maximum and is equal to ShMI. T(θ

_{j}|x) is the longitudinal normalization of parameterized m(x, y). When the sample is large enough, we can obtain the optimized truth function from the sampling distribution:

_{j}|x), j = 1, 2, … In [12,13], he developed a group of Channels Matching algorithms for solving the multi-label classification, the MaxMI classification of unseen instances, and mixture models. He also proved the convergence of mixed models [73] and derived the new Bayesian confirmation and causal confirmation measures [74,75]. However, the author has not provided the semantic ITL method’s applications to neural networks.

_{w}(x, y

_{j}) is not negative, it can be understood as a fidelity function.

_{j})/P(x). The expression in their paper is:

_{t}is the feature vector obtained from previous data, x

_{t+k}is the predictive vector, and f

_{k}(x

_{t+k}, c

_{t}) is a similarity function (between predicted x

_{t+k}and real x

_{t+k}) used to construct EMI. The n pairs of Logistic functions in Noise Contrast Learning become n SoftMax functions, which can be directly used for multi-label learning. However, f

_{k}(x

_{t+k}, c

_{t}) is not limited to exponential or negative exponential functions, unlike the learning function in MINE. Therefore, the author believes that it is more flexible to use a function such as a membership function as the similarity function.

- A function proportional to m(x, y
_{j}) is used as the learning function (denoted as S(x, y_{j})); its maximum is generally 1, and its average is the partition function Z_{j}. - The semantic or estimated information between x and y
_{j}is log[S(x, y_{j})/Z_{j}]. - The statistical probability distribution P(x, y) is used for the average.
- The semantic or estimated mutual information can be expressed as the coverage entropy minus the fuzzy entropy, and the fuzzy entropy is equal to the average distortion.

## 4. The Sematic Information G Theory and Its Applications to Machine Learning

#### 4.1. The P-T Probability Framework and the Semantic Information G Measure

_{1}, x

_{2}, …}. Let Y be a random variable representing a label or hypothesis. It takes a value y ∈ V = {y

_{1,}y

_{2}, …}. A set of transition probability represents a Shannon channel P(y

_{j}|x) (j = 1, 2, …), whereas a set of truth functions T(y

_{j}|x) (j = 1, 2, …) denotes a semantic channel.

_{j}true form a fuzzy subset θ

_{j}(i.e., y

_{j}= “x is in θ

_{j}”). Then the membership function (denoted as T(θ

_{j}|x)) of x in θ

_{j}is the truth function T(y

_{j}|x) of proposition function y

_{j}(x). That is T(θ

_{j}|x) = T(y

_{j}|x). The logical probability of y

_{j}is the probability of a fuzzy event defined by Zadeh [45] as:

_{j}is true, the predicted probability of x is

_{j}can also be regarded as the model parameter so that P(x|θ

_{j}) is a likelihood function.

_{j}= $\widehat{x}$

_{j}= “x is about x

_{j}.” Then T(θ

_{j}|x) can be understood as the confusion probability or similarity between x and x

_{j}. For given P(x) and P(x|θ

_{j}), supposing the maximum of T(θ

_{j}|x) is 1, we can derive [13]:

_{j}|x) is 1 (for different x), and T(θ

_{1}) + T(θ

_{2}) + … > 1. In contrast, P(y

_{0}|x) + P(y

_{1}|x) + … = 1 for every x, and P(y

_{1)}+ P(y

_{2}) + … = 1.

_{j}about x

_{i}is:

_{j}) for different x, we obtain semantic KL information I(X; θ

_{j}) (see Equation (19)). Averaging I(X; θ

_{j}) for different y, we get SeMI I(X; Y

_{θ}) (See Equation (21)).

#### 4.2. Optimizing Truth Functions and Making Probability Predictions

_{j}|x) (j = 1, 2, …), constitutes a semantic channel, just as a set of transition probability functions, P(y

_{j}|x) (j = 1, 2, …), forms a Shannon channel. When the semantic channel matches the Shannon channel, that is, T(θ

_{j}|x)∝P(y

_{j}|x)∝P(x|y

_{j})/P(x), or P(x|θ

_{j}) = P(x|y

_{j}), the semantic KL information and SeMI reach their maxima. If the sample is large enough, we have:

_{j}is the maximum of function m(x, y

_{j}) (for different x). The author has proved in [17] that the above formula is compatible with Wang’s Random Set Falling Shadow theory [51]. We can think that T(θ

_{j}|x) in Equation (34) comes from Random Point Falling Shadow.

_{j}) the relatedness function, which varies between 0 and ∞. Note that relatedness functions are symmetric, whereas truth functions or membership functions are generally asymmetric, i.e., T(θ

_{xi}|y

_{j}) ≠ T(θ

_{j}|x

_{i}) (θ

_{xi}means x

_{i}and related parameters). The reason is that mm

_{i}= max[m(x

_{i}, y)] is not necessarily equal to mm

_{j}= max[m(x, y

_{j})]. If we replace mm

_{j}and mm

_{i}with the maximum in matrix m(x, y), the truth function is also symmetrical, like the distortion function. In that case, the maximum of a truth function may be less than 1, so it is not convenient to use a negative exponential function to express a similarity function. Nevertheless, the similarity function S(x, $\widehat{x}$

_{j}) between different instances should be symmetrical, i.e., S(x

_{j}

_{,}$\widehat{x}$

_{i}) = S(x

_{i}, $\widehat{x}$

_{j}). The truth function expressed as exp[−d(x, y)] should also be symmetrical if d(x, y) is symmetrical.

_{j}) or m(x, y) is an essential function. From the perspective of calculation, there exists P(x, y) before m(x, y); but from a philosophical standpoint, there exists m(x, y) before P(x, y). Therefore, we use m

_{θ}(x, y) to approximate to m(x, y) and call m

_{θ}(x, y) the truthlikeness function.

_{θ}(x, y) is similar. Unfortunately, it is difficult for the human brain to remember truthlikeness functions. Nevertheless, it is easier for the human brain to remember truth functions. Therefore, we need T(y

_{j}|x) or S(x, y

_{j}). With the truth or similarity function, we can also make probability predictions when P(x) is changed (see Equation (31)).

#### 4.3. The Information Rate-Fidelity Function R(G)

_{j}) with semantic information I(x; θ

_{j}) and replace the upper limit D of average distortion $\overline{d}$ with the lower limit G of SeMI. Then R(D) becomes the information rate-fidelity function R(G) [13,18] (see Figure 4). Finally, following the deduction for R(D), we obtain the R(G) function with parameter s:

#### 4.4. Channels Matching Algorithms for Machine Learning

#### 4.4.1. For Multi-Label Learning

_{j}|x) = P(y

_{j}|x)/max[P(y

_{j}|x)], j = 1, 2, …; otherwise, we can use the semantic KL information formula to optimize T(θ

_{j}|x) (see Equation (35)).

#### 4.4.2. For the MaxMI Classification of Unseen Instances

_{j}be a subset of C and y

_{j}= f(z|z ∈ C

_{j}); hence S = {C

_{1}, C

_{2}, …} is a partition of C. Our task is to find the optimized S, which is

**Matching I**: Let the semantic channel match the Shannon channel and set the reward function. First, for a given S, we obtain the Shannon channel:

_{j}) (or m

_{θ}(x, y) = m(x, y)). Then we have I(x

_{i}; θ

_{j}). For given z, we have conditional information as the reward function:

**Matching II:**Let the Shannon channel match the semantic channel by the classifier:

**Matching I**and

**Matching II**until S does not change. Then, the convergent S is S* we seek. The author has explained the convergence with the R(G) function (see Section 3.3 in [13]).

#### 4.4.3. Explaining and Improving the EM Algorithm for Mixture Models

_{j}P(y

_{j})P(x|y

_{j}). For a given sampling distribution P(x), we use the mixture model P

_{θ}(x) = ∑

_{j}P(y

_{j})P(x|θ

_{j}) to approximate to P(x), making relative entropy H(P‖P

_{θ}) close to 0. After setting the initial P(x|θ

_{j}) and P(y

_{j}), j = 1, 2, …, we do the following iterations, each of which includes two matching steps (for details, see [13,73]):

**Matching 1:**Let the Shannon channel P(y|x) match the semantic channel by repeating the following two formulas n times:

_{j}increases or decreases R; other steps only reduce R.

**Matching 2**: Let the semantic channel match the Shannon channel to maximize G by letting

_{θ}) cannot be improved.

^{+1}(y)‖P(y)), H(P‖P

_{θ}) can approach 0.

^{+1}(y) ≈ P(y). The EnM algorithm can perform better than the EM algorithm in most cases. Moreover, the convergence proof can help us avoid blind improvements.

_{1}, µ

_{2}, σ

_{1}, σ

_{2}, P(y

_{1})) = (100, 125, 10, 10, 0.7).

_{1}, µ

_{2}, σ

_{1}, σ

_{2}, P(y

_{1})) = (105, 120, 5, 5, 0.5) according to the fair competition principle [73]. In that case, the EM algorithm needs about four iterations, whereas the E3M algorithm needs about three iterations, on average. Ref. [73] provides an initialization map, which can tell if a pair of initial means (µ

_{1}, µ

_{2}) is good.

## 5. Discussion 1: Clarifying Some Questions

#### 5.1. Is Mutual Information Maximization a Good Objective?

_{θ}(x, y

_{j}), j = 1, 2, … to maximize ShMI and SeMI simultaneously. According to the R(G) function, the classification is to make s—>∞, so that P(y

_{j}|x) = 1 or 0 (see Equation (38)), and both R and G reach the highest point on the right side of the R(G) curve (see Figure 4). When s increases from 1, information efficiency G/R decreases. Sometimes, we have to balance between maximizing G and maximizing G/R.

_{1}= “children”, y

_{1}= “youth”, y

_{3}= “adult”, and y

_{4}= “elderly”, according to their ages x. ShMI reaches its maximum when the classification makes P(y

_{1}) = P(y

_{2}) = P(y

_{3}) = P(y

_{4}) = 1/4. However, this is not our purpose. In addition, even if we exchange the labels of two age groups {children} and {elderly}, ShMI does not change. However, if labels are misused, the SeMI will be greatly reduced or negative. Therefore, it is problematic to maximize ShMI alone. However, SeMI maximization can ensure less distortion and more predictive information and is equivalent to—likelihood maximization and compatible with the RLS criterion. In addition, ShMI maximization is not a good objective because using Shannon’s posterior entropy H(Y|X) as the loss function is improper (see Section 5.4).

#### 5.2. Interpreting DNNs: The R(G) Function vs. the Information Bottleneck

_{θ}). Then we decode Y into $\widehat{X}$ and fine-tune the network parameters between X and $\widehat{X}$to minimize the loss with the RLS criterion, like maximizing EMI I(X; $\widehat{X}$

_{θ}).

_{max}by reducing the distortion between X and $\widehat{X}.$

_{j}|x) = exp[−sd(x, y

_{j})] (s > 0), increasing s is to narrow the coverage of the truth function (or the membership function of fuzzy set θ

_{j}). If both x and x

_{j}are elements in the fuzzy set θ

_{j}, the average distortion between them will also be reduced. As for whether increasing s is enough, it needs to be tested.

_{1}—>… —>T

_{m}—>Y, such as a DBN, we need to minimize I(T

_{i-}

_{1;}T

_{i}) − βI(Y; T

_{i-}

_{1}|T

_{i}) (i = 1, 2, …, m), like solving a R(D) function. This idea is very inspiring. However, the author believes that every latent layer (T

_{i}) needs its SeMI maximization and ShMI minimization. The reasons are:

- DNNs often need pre-training and fine-tuning. In the pre-training stage, the RBM is used for every latent layer.

#### 5.3. Understanding Gibbs Distributions, Partition Functions, MinMI Matching, and RBMs

_{i}/(kT)] as the truth function. For machine learning, to predict the posterior probability of x

_{i}, we have:

_{j}) is the partition function Z. If we put T(θ

_{j}|x) = m(x, y

_{j})/mm

_{j}into T(θ

_{j}), there is

_{j}) = P(x|y

_{j}). We can see that the summation for T(θ

_{j}) is only to get T(θ

_{j}) = 1/mm

_{j}so that 1/mm

_{j}in the numerator and the denominator of the Gibbs distribution are eliminated simultaneously. Then the distribution P(x|θ

_{j}) approximates to P(x)m(x, y

_{j}) = P(x|y

_{j}). It does not matter how big mm

_{j}is.

_{j}) and the probability prediction, such as P(x, y|θ

_{j}) = exp[−d(x, y

_{j})] and

_{θ}(x, y

_{j}) in the semantic information method, but they are different. In the semantic information method,

_{k}).

_{j}|x) = kT(y

_{j}|x) for MinMI matching and then obtain P(y

_{j}) = kT(y

_{j}), k = 1/∑

_{j}T(y

_{j}). Hence,

_{j}|x) = m(x, y

_{j})/mm

_{j}, then use Equation (53) to get P(y).

_{i}, w

_{ij}, b

_{j}|i = 1, 2, …, n; j = 1, 2, …, m}. Parameters a

_{i}, w

_{ij}, and b

_{j}are associated to P(v

_{i}), m

_{θ}(v

_{i}, h

_{j}), and P(h

_{j}), respectively. Optimizing {w

_{ij}} (weights) improves m

_{θ}(v, h) and maximizes SeMI, and optimizing {b

_{j}} improves P(h) and minimizes ShMI. Alternate optimization makes SeMI and ShMI close to each other.

#### 5.4. Understanding Fuzzy Entropy, Coverage Entropy, Distortion, and Loss

_{θ}). We take age x and the related classification label y as an example to explain the coverage entropy.

_{1}= “Children”, y

_{2}= “Youth”, y

_{3}= “Middle-aged people”, and y

_{4}= “Elderly”. The four subsets constitute a partition of U. We divide U again into two subsets according to whether x ≥ 18 and add labels y

_{5}= “Adult” and y

_{6}= “Non-adult”. Then six subsets constitute the coverage of U.

_{θ}) represents the MinMI of decoding y to generate P(x) for given P(y). The constraint condition is that P(x|y

_{j}) = 0 for x$\notin $θ

_{j}. Some researchers call this MinMI the complexity distortion [79]. A simple proof method is to make use of the R(D) function. We define the distortion function:

_{j}) ≤ P(x|θ

_{j}) for T(θ

_{j}|x) < 1, then the minimum ShMI equals the coverage entropy minus the fuzzy entropy, that is, R = I(X; Y) = H(Y

_{θ}) − H(Y

_{θ}|X) [18].

_{j}|x) = exp[−d(x, y

_{j})] and the logical probability T(θ

_{j}) = Z

_{j}. Then letting P(y

_{j}|x) = kT(θ

_{j}|x), we can get P(y

_{j}|x) = exp[−d(x, y

_{j})]/∑

_{j}T(θ

_{j}). Furthermore, we have I(X; Y) = I(X; Y

_{θ}) = H(Y

_{θ}) − $\overline{d}$, which means that the minimum ShMI for a given distortion limit can be expressed as SeMI, and the SeMI decreases with the average distortion increasing.

_{i}) as the distortion function. This usage shows a larger distortion when P(y|x

_{i}) is much less than 1. However, from the perspective of semantic communication, less P(y|x

_{i}) does not mean larger distortion. For example, with ages, when the above six labels are used on some occasions, y

_{6}= “Non-adult” is rarely used, so P(y

_{6}|x < 18) is very small, and hence −logP(y

_{6}|x < 18) is very large. However, there is no distortion because T(θ

_{6}|x < 18) = 1. Therefore, Using H(Y|X) to express distortion or loss is often unreasonable. In addition, it is easy to understand that for a given x

_{i}, the distortion of a label has nothing to do with the frequency in which the label is selected, whereas P(y|x

_{i}) is related to P(y).

_{KL}to represent distortion or loss [27] has a similar problem, whereas using semantic KL information to represent negative loss is appropriate.

#### 5.5. Evaluating Learning Methods: The Information Criterion or the Accuracy Criterion?

## 6. Discussion 2: Exploring New Methods for Machine Learning

#### 6.1. Optimizing Gaussian Semantic Channels with Shannon’s Channels

_{j}|x) is proportional to a Gaussian function. Then, P(y

_{j}|x)/∑

_{k}P(y

_{j}|x

_{k}) is a Gaussian distribution. We can assume P(x) = 1/|U| (|U| is the number of elements in U) and then use P(y

_{j}|x) to optimize T(θ

_{j}|x) by maximizing semantic KL information:

_{j}) reaches its maximum as T(θ

_{j}|x)∝P(yj|x), which means that we can use the expectation μ

_{j}and the standard deviation σ

_{j}of P(y

_{j}|x) as those of T(θ

_{j}|x).

_{j}) and P(x), we can replace P(y

_{j|}x) with m(x, y

_{j}) = P(x|y

_{j})/P(x), and then optimize the Gaussian truth function. However, this method requires that no x makes P(x) = 0. For this reason, we need to replace P(x) with an uninterrupted distribution close to P(x).

#### 6.2. The Gaussian Channel Mixture Model and the Channel Mixture Model Machine

**Matching 1**and

**Matching 2**becomes:

**Matching 1:**Let the Shannon channel match the semantic channel by using P(x|θ

_{j}) = P(x)T(θ

_{j}|x)/T(θ

_{j}) and repeating Equation (44) n times.

**Matching 2**: Let the semantic channel match the Shannon channel by letting:

_{j}|x) or P(x)T(θ

_{j}|x)/P

_{θ}(x) as those of T(θ

_{j}

^{+1}|x).

_{j}|x

_{i}) is used as weight w

_{ji}. The input may be x

_{i}, a vector

**x**, or a distribution P(

**x**), for which we use different methods to obtain the same T(θ

_{j}).

_{j}= P(y

_{j}) − T(θ

_{j}) ≤ 0 and f is a Relu function, two neurons will be equivalent.

_{j}|x) with cT(θ

_{j}|x) does not change P(x|θ

_{j}).

#### 6.3. Calculating the Similarity between any Two Words with Sampling Distributions

- It is simple without needing the semantic structure such as that in WordNet;
- This similarity is similar to the improved PMI similarity [65] which varies between 0 and 1;
- This similarity function is suitable for probability predictions.

_{θ}(x, y) and suitable for classification, whereas S(x, y

_{j}) defined above is suitable for probability predictions. We can convert S(x, y

_{j}) with P(x) into the truthlikeness function for classification.

#### 6.4. Purposive Information and the Information Value as Reward Functions for Reinforcement Learning

_{KL}) represents necessary control complexity. Given distortion D, the smaller the mutual information R, the higher the control efficiency. When we use the G theory for constraint control and reinforcement learning, the SeMI measure can be used as the reward function, then G/R represents the control efficiency. We explain reinforcement learning as being like driving a car to a destination. We need to consider:

- Choosing an action a to reach the destination;
- Learning the system state’s change from P(x) to P(x|a);
- Setting the reward function, which is a function of the goal, P(x), and P(x|a).

- How to get P(x|a) or P(x|a, h) (h means the history)?
- How to choose an action a according to the system state and the reward function?
- How to achieve the goal economically, that is, to balance the reward maximization and the control-efficiency maximization?

_{j}= “The death ages of the population had better not be less than 60 years old”. Suppose that the control method is to improve medical conditions. In this case, Shannon information represents the control cost, and the semantic information indicates the control effect. Although we can raise the average death age to 80 at a higher cost, it is uneconomical in terms of control efficiency. There is enough purposive information when the average death age reaches about 62. The author in [80] discusses this example.

_{j}|x) = 1/[1 + exp(−0.8(x − 60))] (see Figure 9). The prior distribution P(x) is normal (μ = 50 and σ = 10), and the control result P(x|a

_{j}) of a medical condition a

_{j}is also normal. The purposive information or the reward function of a

_{j}is:

_{j}) approximate to P(x|θ

_{j}) by changing the μ

_{j}and σ

_{j}of P(x|a

_{j}) to get the minimum KL information I(X; a

_{j}) so that information efficiency I(X; θ

_{j}|a

_{j})/I(X; a

_{j}) is close to 1. In addition, we can increase s in the following formula to increase both SeMI and ShMI:

_{j}, then SeMI equals I(X; θ

_{j}|a

_{j}), and ShMI equals I(X; a

_{j}). The results indicate that the highest information efficiency G/R is 0.95 when P(x|a

_{j}) approximates to P(x|θ

_{j}). When s increases from 1 to 20, the purposive information G increases from 2.08 bits to 3.13 bits, but the information efficiency G/R decreases from 0.95 to 0.8. To balance G and G/R, we should select s between 5 and 15.

_{j}) = P(y

_{j}), j = 1, 2, … according to Equation (38).

_{v}(

**R**,

**q**|θ

_{j}) is the value-added entropy, where

**q**is the vector of portfolio ratios, R

_{i}(

**q**) is the return of the portfolio with

**q**when the i-th price vector

**x**

_{i}appears, and

**R**is the return vector. V(

**X**; θ

_{j}) is the predictive information value, where

**X**is a random variable taking

**x**as its value,

**q*** is the optimized vector of portfolio ratios according to the prior distribution P(

**x**), and

**q**** is that according to prediction P(

**x**|θ

_{j}). V(

**X**; θ

_{j}) can be used as a reward function to optimize probability predictions and decisions. However, to calculate the actual information value (in the learning stage), we need to replace P(

**x**|θ

_{j}) with P(

**x**|y

_{j}) in Equation (63).

_{1}< 0 with probability P

_{1}and yield rate r

_{2}> 0 with probability P

_{2}, the optimized investment ratio is

_{1}r

_{2}|,

_{1}= −1, we have q* = P

_{2}− P

_{1}/r

_{2}, which is the famous Kelly formula. For the above formula, we assume that the risk-free rate r

_{0}= 0; otherwise, there is

_{0}R

_{0}/|r

_{10}r

_{20}|,

_{0}= E − r

_{0}, r

_{10}= r

_{1}− r

_{0}, r

_{20}= r

_{2}− r

_{0}.

**X**; θ

_{j}) is only the information value in particular cases. Information values in other situations need to be further explored.

#### 6.5. The Limitations of the Semantic Information G Theory

## 7. Conclusions

_{j}) = P(x, y

_{j})/[P(x)P(y

_{j})], and their maximum is one. Compared with the likelihood function and the anti-probability function (often expressed as the Logistic function), the similarity function is independent of prior probability distributions P(x) and P(y), has good transferability, and is more suitable for multi-label learning.

_{θ}(x, y)) and two generalized entropies (fuzzy entropy H(Y

_{θ}|X) and coverage entropy H(Y

_{θ})). Moreover, it has interpreted the Gibbs distribution as the semantic Bayes’ prediction, the partition function as the logical probability, and the energy function as the distortion function.

_{S}.

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

BOYL | Bootstrap Your Own Latent |

DIM | Deep InfoMax (Information Maximization) |

DNN | Deep Neural Network |

DT | DeLuca-Termini |

DV | Donsker-Varadhan |

EM | Expectation-Maximization |

EMI | Estimated Mutual Information |

EnM | Expectation-n-Maximization |

GCMM | Gaussian Channel Mixture Model |

CMMM | Channel Mixture Model Machine |

GPS | Global Positioning System |

G theory | Semantic information G theory (G means generalization) |

InfoNCE | Information Noise Contrast Estimation |

ITL | Information-Theoretic Learning |

KL | Kullback–Leibler |

LSA | Latent Semantic Analysis |

MaxMI | Maximum Mutual Information |

MinMI | Minimum Mutual Information |

MINE | Mutual Information Neural Estimation |

MoCo | Momentum Contrast |

PMI | Pointwise Mutual Information |

SeMI | Semantic Mutual Information |

ShMI | Shannon’s Mutual Information |

SimCLR | A simple framework for contrastive learning of visual representations |

## Appendix A. About Formal Semantic Meaning

## Appendix B. The Definitions of the Value-Added Entropy and the Information Value

**x**be the price vector of a group of securities, and its constant is

**x**= (x

_{i}_{i}

_{1}, x

_{i}

_{2}, …), i = 1, 2, … The current price vector is

**x**= (x

_{0}_{01}, x

_{02}, …), and the return vector is

**R**

_{i}= (R

_{0}, x

_{i}

_{1}/x

_{01}, x

_{i}

_{2}/x

_{02}, …), where R

_{0}= 1 + r

_{0}and r

_{0}is the risk-free rate. The vector of portfolio ratios is

**q**= (q

_{0}, q

_{1}, q

_{2}, …). The return on investment is a function of

**q**, namely:

**x**|θ

_{j}), the Value-added entropy (that is, the expected doubling rate if the log is log

_{2}) is:

**q**represents the decision; it is also a learning function. We can increase the value-added entropy by optimizing

**q**. The predictive information value is

**q*** is the optimized vector of portfolio ratios according to the prior distribution P(

**x**), and

**q**** is that according to prediction P(

**x**|θ

_{j}). The predictive information value can be used as a reward function to optimize probability predictions and decisions. How to calculate the actual information value (in the learning stage), we need to replace P(

**x**|θ

_{j}) with P(

**x**|y

_{j}).

## References

- Belghazi, M.I.; Baratin, A.; Rajeswar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, R.D. MINE: Mutual information neural estimation. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1–44. [Google Scholar] [CrossRef]
- Oord, A.V.D.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv
**2018**, arXiv:1807.03748. [Google Scholar] - Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J.
**1948**, 27, 379–429, 623–656. [Google Scholar] [CrossRef] - Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Trischler, A.; Bengio, Y. Learning Deep Representations by Mutual Information Estimation and Maximization. arXiv
**2018**, arXiv:1808.06670. [Google Scholar] - Bachman, P.; Hjelm, R.D.; Buchwalter, W. Learning Representations by Maximizing Mutual Information Across Views. arXiv
**2018**, arXiv:1906.00910. [Google Scholar] - Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML, PMLR 119, Virtual Event, 13–18 July 2020; pp. 1575–1585. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9726–9735. [Google Scholar] [CrossRef]
- Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst.
**2020**, 33, 21271–21284. [Google Scholar] - Shannon, C.E.; Weaver, W. The Mathematical Theory of Communication; The University of Illinois Press: Urbana, IL, USA, 1963. [Google Scholar]
- Bao, J.; Basu, P.; Dean, M.; Partridge, C.; Swami, A.; Leland, W.; Hendler, J.A. Towards a theory of semantic communication. In Proceedings of the 2011 IEEE 1st International Network Science Workshop, West Point, NSW, USA, 22–24 June 2011; pp. 110–117. [Google Scholar] [CrossRef]
- Strinati, E.C.; Barbarossa, S. 6G networks: Beyond Shannon towards semantic and goal-oriented communications. Comput. Netw.
**2021**, 190, 107930. [Google Scholar] [CrossRef] - Lu, C. Channels’ matching algorithm for mixture models. In Intelligence Science I, Proceedings of the ICIS 2017, Beijing, China, 27 September 2017; Shi, Z.Z., Goertel, B., Feng, J.L., Eds.; Springer: Cham, Switzerland, 2017; pp. 321–332. [Google Scholar] [CrossRef]
- Lu, C. Semantic information G theory and logical Bayesian inference for machine learning. Information
**2019**, 10, 261. [Google Scholar] [CrossRef] - Lu, C. Shannon equations reform and applications. BUSEFAL
**1990**, 44, 45–52. Available online: https://www.listic.univ-smb.fr/production-scientifique/revue-busefal/version-electronique/ebusefal-44/ (accessed on 5 March 2019). - Lu, C. A Generalized Information Theory; China Science and Technology University Press: Hefei, China, 1993; ISBN 7-312-00501-2. (In Chinese) [Google Scholar]
- Lu, C. A generalization of Shannon’s information theory. Int. J. Gen. Syst.
**1999**, 28, 453–490. [Google Scholar] [CrossRef] - Lu, C. The P–T probability framework for semantic communication, falsification, confirmation, and Bayesian reasoning. Philosophies
**2020**, 5, 25. [Google Scholar] [CrossRef] - Lu, C. Using the Semantic Information G Measure to Explain and Extend Rate-Distortion Functions and Maximum Entropy Distributions. Entropy
**2021**, 23, 1050. [Google Scholar] [CrossRef] - Floridi, L. Semantic conceptions of information. In Stanford Encyclopedia of Philosophy; Stanford University: Stanford, CA, USA, 2005; Available online: http://seop.illc.uva.nl/entries/information-semantic/ (accessed on 1 March 2023).
- Tarski, A. The semantic conception of truth: And the foundations of semantics. Philos. Phenomenol. Res.
**1994**, 4, 341–376. [Google Scholar] [CrossRef] - Davidson, D. Truth and meaning. Synthese
**1967**, 17, 304–323. [Google Scholar] [CrossRef] - Semantic Similarity. In Wikipedia: The Free Encyclopedia. Available online: https://en.wikipedia.org/wiki/Semantic_similarity (accessed on 10 February 2023).
- Resnik, P. Using information content to evaluate semantic similarity in a taxonomy. arXiv
**1995**, arXiv:cmp-lg/9511007. [Google Scholar] - Poole, B.; Ozair, S.; Oord, A.V.D.; Alemi, A.; Tucker, G. On Variational Bounds of Mutual Information. arXiv
**2019**, arXiv:1905.06922. [Google Scholar] - Tschannen, M.; Djolonga, J.; Rubenstein, P.K.; Gelly, S.; Luci, M. On Mutual Information Maximization for Representation Learning. arXiv
**2019**, arXiv:1907.13625. [Google Scholar] - Tishby, N.; Pereira, F.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, Monticello, IL, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
- Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015; pp. 1–5. [Google Scholar]
- Hu, B.G. Information theory and its relation to machine learning. In Proceedings of the 2015 Chinese Intelligent Automation Conference; Lecture Notes in Electrical Engineering; Deng, Z., Li, H., Eds.; Springer: Berlin/Heidelberg, Germany, 2015; Volume 336. [Google Scholar] [CrossRef]
- Xu, X.; Huang, S.-L.; Zheng, L.; Wornell, G.W. An information-theoretic interpretation to deep neural networks. Entropy
**2022**, 24, 135. [Google Scholar] [CrossRef] - Rényi, A. On measures of information and entropy. Proc. Fourth Berkeley Symp. Math. Stat. Probab.
**1960**, 4, 547–561. [Google Scholar] - Principe, J.C. Information-Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer Publishing Company: New York, NY, USA, 2010. [Google Scholar]
- Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys.
**1988**, 52, 479–487. [Google Scholar] [CrossRef] - Irshad, M.R.; Maya, R.; Buono, F.; Longobardi, M. Kernel estimation of cumulative residual Tsallis entropy and its dynamic version under ρ-mixing dependent data. Entropy
**2022**, 24, 9. [Google Scholar] [CrossRef] - Liu, W.; Pokharel, P.P.; Principe, J.C. Correntropy: A localized similarity measure. In Proceedings of the 2006 IEEE International Joint Conference on Neural Network Proceedings, Vancouver, BC, Canada, 16–21 July 2006; IEEE: Piscataway, NY, USA, 2006. [Google Scholar]
- Yu, S.; Giraldo, L.S.; Principe, J. Information-Theoretic Methods in Deep Neural Networks: Recent Advances and Emerging Opportunities. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Survey Track, Montreal, QC, Canada, 19–27 August 2021; pp. 4669–4678. [Google Scholar] [CrossRef]
- Oddie, G.T. The Stanford Encyclopedia of Philosophy, Winter 2016th ed.; Zalta, E.N., Ed.; Available online: https://plato.stanford.edu/archives/win2016/entries/truthlikeness/ (accessed on 18 May 2020).
- Floridi, L. Outline of a theory of strongly semantic information. Minds Mach.
**2004**, 14, 197–221. [Google Scholar] [CrossRef] - Zhong, Y. A theory of semantic information. Proceedings
**2017**, 1, 129. [Google Scholar] [CrossRef] - Popper, K. Logik der Forschung: Zur Erkenntnistheorie der Modernen Naturwissenschaft; Springer: Vienna, Austria, 1935; English translation: The Logic of Scientific Discovery, 1st ed.; Hutchinson: London, UK, 1959. [Google Scholar]
- Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Stat.
**1951**, 22, 79–86. [Google Scholar] [CrossRef] - Carnap, R.; Bar-Hillel, Y. An Outline of a Theory of Semantic Information; Technical Report No. 247; Research Laboratory of Electronics, MIT: Cambridge, MA, USA, 1952. [Google Scholar]
- Shepard, R.N. Stimulus and response generalization: A stochastic model relating generalization to distance in psychological space. Psychometrika
**1957**, 22, 325–345. [Google Scholar] [CrossRef] - Shannon, C.E. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec.
**1959**, 4, 142–163. [Google Scholar] - Theil, H. Economics and Information Theory; North-Holland Pub. Co.: Amsterdam, The Netherlands; Rand McNally: Chicago, IL, USA, 1967. [Google Scholar]
- Zadeh, L.A. Fuzzy Sets. Inf. Control.
**1965**, 8, 338–353. [Google Scholar] [CrossRef] - De Luca, A.; Termini, S. A definition of a non-probabilistic entropy in setting of fuzzy sets. Inf. Control.
**1972**, 20, 301–312. [Google Scholar] [CrossRef] - Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control.
**1974**, 19, 716–723. [Google Scholar] [CrossRef] - Thomas, S.F. Possibilistic uncertainty and statistical inference. In Proceedings of the ORSA/TIMS Meeting, Houston, TX, USA, 12–14 October 1981. [Google Scholar]
- Dubois, D.; Prade, H. Fuzzy sets and probability: Misunderstandings, bridges and gaps. In Proceedings of the 1993 Second IEEE International Conference on Fuzzy Systems, San Francisco, CA, USA, 28 March 1993. [Google Scholar]
- Donsker, M.; Varadhan, S. Asymptotic evaluation of certain Markov process expectations for large time IV. Commun. Pure Appl. Math.
**1983**, 36, 183–212. [Google Scholar] [CrossRef] - Wang, P.Z. From the fuzzy statistics to the falling fandom subsets. In Advances in Fuzzy Sets, Possibility Theory and Applications; Wang, P.P., Ed.; Plenum Press: New York, NY, USA, 1983; pp. 81–96. [Google Scholar]
- Aczel, J.; Forte, B. Generalized entropies and the maximum entropy principle. In Bayesian Entropy and Bayesian Methods in Applied Statistics; Justice, J.H., Ed.; Cambridge University Press: Cambridge, UK, 1986; pp. 95–100. [Google Scholar]
- Zadeh, L.A. Probability measures of fuzzy events. J. Math. Anal. Appl.
**1986**, 23, 421–427. [Google Scholar] [CrossRef] - Lu, C. Decoding model of color vision and verifications. Acta Opt. Sin.
**1989**, 9, 158–163. (In Chinese) [Google Scholar] - Lu, C. Explaining color evolution, color blindness, and color recognition by the decoding model of color vision. In 11th IFIP TC 12 International Conference, IIP 2020, Hangzhou, China; Shi, Z., Vadera, S., Chang, E., Eds.; Springer Nature: Cham, Switzerland, 2020; pp. 287–298. Available online: https://www.springer.com/gp/book/9783030469306 (accessed on 18 May 2020).
- Ohlan, A.; Ohlan, R. Fundamentals of fuzzy information measures. In Generalizations of Fuzzy Information Measures; Springer: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
- Fisher, R.A. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc.
**1922**, 222, 309–368. [Google Scholar] - Fienberg, S.E. When did Bayesian Inference become “Bayesian”? Bayesian Anal.
**2006**, 1, 1–40. [Google Scholar] [CrossRef] - Zhang, M.L.; Li, Y.K.; Liu, X.Y.; Geng, X. Binary Relevance for multi-label learning: An overview. Front. Comput. Sci.
**2018**, 12, 191–202. [Google Scholar] [CrossRef] - Hinton, G.E. A practical guide to training Restricted Boltzmann Machines. In Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science; Montavon, G., Orr, G.B., Müller, K.R., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7700, pp. 599–619. [Google Scholar] [CrossRef]
- Ashby, F.G.; Perrin, N.A. Toward a unified theory of similarity and recognition. Psychol. Rev.
**1988**, 95, 124–150. [Google Scholar] [CrossRef] - Banu, A.; Fatima, S.S.; Khan, K.U.R. Information content based semantic similarity measure for concepts subsumed by multiple concepts. Int. J. Web Appl.
**2015**, 7, 85–94. [Google Scholar] - Dumais, S.T. Latent semantic analysis. Annu. Rev. Inf. Sci. Technol.
**2005**, 38, 188–230. [Google Scholar] [CrossRef] - Church, K.W.; Hanks, P. Word association norms, mutual information, and lexicography. Comput. Linguist.
**1990**, 16, 22–29. [Google Scholar] - Islam, A.; Inkpen, D. Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data
**2008**, 2, 1–25. [Google Scholar] [CrossRef] - Chandrasekaran, D.; Mago, V. Evolution of Semantic Similarity—A Survey. arXiv
**2021**, arXiv:2004.13820. [Google Scholar] [CrossRef] - Costa, T.; Leal, J.P. Semantic measures: How similar? How related? In Web Engineering, Proceedings of the ICWE 2016, Lugano, Switzerland, 6–9 June 2016; Bozzon, A., Cudre-Maroux, P., Pautasso, C., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; Volume 9671. [Google Scholar] [CrossRef]
- Ackley, D.H.; Hinton, G.E.; Sejnowski, T.J. A learning algorithm for Boltzmann machines. Cogn. Sci.
**1985**, 9, 147–169. [Google Scholar] [CrossRef] - Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science
**2006**, 313, 504–507. [Google Scholar] [CrossRef] - Hinton, G.E.; Osindero, S.; Teh, Y.W. A fast learning algorithm for deep belief nets. Neural Comput.
**2006**, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed] - Gutmann, M.U.; Hyvärinen, A. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res.
**2012**, 13, 307–361. [Google Scholar] - Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. In Proceedings of the Advances in Neural Information Processing Systems 29 (NIPS2016), Barcelona, Spain, 5–10 December 2016; pp. 1857–1865. [Google Scholar]
- Lu, C. Understanding and Accelerating EM Algorithm’s Convergence by Fair Competition Principle and Rate-Verisimilitude Function. arXiv
**2021**, arXiv:2104.12592. [Google Scholar] - Lu, C. Channels’ Confirmation and Predictions’ Confirmation: From the Medical Test to the Raven Paradox. Entropy
**2020**, 22, 384. [Google Scholar] [CrossRef] - Lu, C. Causal Confirmation Measures: From Simpson’s Paradox to COVID-19. Entropy
**2023**, 25, 143. [Google Scholar] [CrossRef] - Lu, C. Semantic channel and Shannon channel mutually match and iterate for tests and estimations with maximum mutual information and maximum likelihood. In Proceedings of the 2018 IEEE International Conference on Big Data and Smart Computing, Shanghai, China, 15 January 2018; IEEE Computer Society Press Room: Piscataway, NY, USA, 2018; pp. 15–18. [Google Scholar]
- Nair, V.; Hinton, G. Implicit mixtures of Restricted Boltzmann Machines. In Proceedings of the NIPS’08: Proceedings of the 21st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 8–10 December 2008; pp. 1145–1152. [Google Scholar]
- Song, J.; Yuan, C. Learning Boltzmann Machine with EM-like Method. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016. [Google Scholar]
- Sow, D.M.; Alexandros Eleftheriadis, A. Complexity distortion theory. IEEE Trans. Inf. Theory
**2003**, 49, 604–608. [Google Scholar] [CrossRef] - Lu, C. How Semantic Information G Measure Relates to Distortion, Freshness, Purposiveness, and Efficiency. arXiv
**2022**, arXiv:2304.13502. [Google Scholar] - Still, S. Information-theoretic approach to interactive learning. Europhys. Lett.
**2009**, 85, 28005. [Google Scholar] [CrossRef] - Eysenbach, B.; Salakhutdinov, R.; Levine, S. The Information Geometry of Unsupervised Reinforcement Learning. arXiv
**2021**, arXiv:2110.02719. [Google Scholar] - Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: New York, NY, USA, 2006. [Google Scholar]
- Lu, C. The Entropy Theory of Portfolio and Information Value: On the Risk Control of Stocks and Futures; Science and Technology University Press: Hefei, China, 1997; ISBN 7-312-00952-2/F.36. (In Chinese) [Google Scholar]

**Figure 2.**Illustrating a GPS device’s positioning with a deviation. We predict the probability distribution of x according to y

_{j}and the prior knowledge P(x). The red star represents the most probable position.

**Figure 3.**The semantic information conveyed by y

_{j}about x

_{i}decreases with the deviation or distortion increasing. The larger the deviation is, the less information there is.

**Figure 4.**The information rate-fidelity function R(G) for binary communication. Any R(G) function is a bowl-like function. There is a point at which R(G) = G (s = 1). For given R, two anti-functions exist: G

^{-}(R) and G

^{+}(R).

**Figure 5.**Illustrating the medical test and the signal detection. We choose y

_{j}according to z ∈ C

_{j}. The task is to find the dividing point z’ that results in MaxMI between X and Y.

**Figure 6.**The MMI classification with a very bad initial partition. The convergence is very fast and stable without considering gradients. (

**a**) The very bad initial partition. (

**b**) The partition after the first iteration. (

**c**) The partition after the second iteration. (

**d**) The mutual information changes with iterations.

**Figure 7.**Comparing EM and E3M algorithms with an example that is hard to converge. The EM algorithm needs about 340 iterations, whereas the E3M algorithm needs about 240 iterations. In the convergent process, complete data log-likelihood Q is not monotonously increasing. H(P||P

_{θ}) decreases with R − G. (

**a**) Initial components with (µ

_{1}, µ

_{2}) = (80, 95). (

**b**) Globally convergent two components. (

**c**) Q, R, G, and H(P||P

_{θ}) changes with iterations (initialization: (µ

_{1}, µ

_{2}, σ

_{1}, σ

_{2}, P(y

_{1})) = (80, 95, 5, 5, 0.5)).

**Figure 8.**Comparing a typical neuron and a neuron in a CMMM. (

**a**) A typical neuron in neural networks. (

**b**) A neuron in the CMMM and its optimization.

**Figure 9.**Illustrating population death age control for measuring purposive information. P(x|a

_{j}) approximates to P(x|θ

_{j}) = P(x|θ

_{j}, s = 1) for information efficiency G/R = 1. G and R are close to their maxima as P(x|a

_{j}) approximates to P(x|θ

_{j}, s = 20).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lu, C.
Reviewing Evolution of Learning Functions and Semantic Information Measures for Understanding Deep Learning. *Entropy* **2023**, *25*, 802.
https://doi.org/10.3390/e25050802

**AMA Style**

Lu C.
Reviewing Evolution of Learning Functions and Semantic Information Measures for Understanding Deep Learning. *Entropy*. 2023; 25(5):802.
https://doi.org/10.3390/e25050802

**Chicago/Turabian Style**

Lu, Chenguang.
2023. "Reviewing Evolution of Learning Functions and Semantic Information Measures for Understanding Deep Learning" *Entropy* 25, no. 5: 802.
https://doi.org/10.3390/e25050802