# Semantic Information G Theory and Logical Bayesian Inference for Machine Learning

## Abstract

**:**

## 1. Introduction

- Fisher’s likelihood method for hypothesis-testing [1];
- Carnap and Bar-Hillel’s semantic information formula with logical probability [9];
- Tarski’s truth theory for the definition of truth and logical probability [28];
- Davidson’s truth-conditional semantics [29];
- Kullback and Leibler’s KL divergence [5];
- Akaike’s proof [4] that the ML criterion is equal to the minimum KL divergence criterion;
- Theil’s generalized KL formula [37];
- the Donsker–Varadhan representation as a generalized KL formula with Gibbs density [38];
- Wittgenstein’s thought: meaning lies in uses (see [39], p. 80);

- making use of the prior knowledge of instances for probability predictions;
- multilabel learning, belonging to supervised learning;
- the Maximum Mutual Information (MMI) classifications of unseen instances, belonging to semi-supervised learning; and,
- mixture models, belonging to unsupervised learning.

## 2. Methods I: Background

#### 2.1. From Shannon Information Theory to Semantic Information G Theory

#### 2.1.1. From Shannon’s Mutual Information to Semantic Mutual Information

**Definition**

**1.**

- x: an instance or data point; X: a discrete random variable taking a value x∈U = {x
_{1}, x_{2}, …, x_{m}}. - y: a hypothesis or label; Y: a discrete random variable taking a value y∈V = {y
_{1}, y_{2}, …, y_{n}}. - P(y
_{j}|x) (with fixed y_{j}and variable x): a Transition Probability Function (TPF) (named as such by Shannon [7]).

_{j}|x) is not normalized, unlike the conditional probability function, P(y|x

_{i}), in which y is variable and x

_{i}is constant. We will discuss how the TPF can be used for the traditional Bayes prediction in Section 2.2.1.

_{j}, the mutual information I(X; Y) will become the Kullback–Leibler (KL) divergence:

_{i}and y

_{j}:

_{i}; y

_{j}) may be negative, however Shannon did not use this formulation. Shannon explained that information is the reduced uncertainty or the saved average code word length. The author believes that the above formula is meaningful, because negative information indicates that a bad prediction may increase the uncertainty or the code word length.

_{p}],

_{p}is its logical probability.

_{j}) = log2 + [t

_{j}logt

_{j}+ (1 − t

_{j}) log(1 − t

_{j})],

_{j}is “the logical truth” of y

_{j}. However, according to this formula, whenever t

_{j}= 1 or t

_{j}= 0, the information reaches its maximum of 1 bit. This result is not expected. Therefore, this formula is unreasonable. This problem is also found in other semantic or fuzzy information theories that use DeLuca and Termini’s fuzzy entropy [14].

_{j}) is equivalent to P(“X ϵ θ” is true) = P(y

_{j}is true). The truth function of y

_{j}ascertains the semantic meaning of y

_{j,}according to Davidson’s truth condition semantics [29]. Following Tarski and Davidson, we define, as follows:

**Definition**

**2.**

- θ
_{j}is a fuzzy subset of U which is used to explain the semantic meaning of a predicate y_{j}(X) = “X ϵ θ_{j}”. If θ_{j}is non-fuzzy, we may replace it with A_{j}. The θ_{j}is also treated as a model or a group of model parameters. - A probability is defined with “=”, such that P(y
_{j}) = P(Y = y_{j}), is a statistical probability; a probability is defined with “∈”, such as P(X∈θ_{j}), is a logical probability. To distinguish P(Y = y_{j}) and P(X∈θ_{j}), we define T(θ_{j}) = P(X∈θ_{j}) as the logical probability of y_{j}. - T(θ
_{j}|x) = P(x∈θ_{j}) = P(X∈θ_{j}|X = x) is the conditional logical probability function of y_{j}; this is also called the (fuzzy) truth function of y_{j}or the membership function of θ_{j}.

_{j}|x), j = 1,2,…,n, form a Shannon channel, whereas a group of membership functions T(θ

_{j}|x), j = 1,2…n, form a semantic channel:

_{p}that was defined by Carnap and Bar-Hillel [9]. The latter only rests with the denotation of a hypothesis. For example, y

_{1}is a hypothesis (such as “X is infected by the Human Immunodeficiency Virus (HIV)”) or a label (such as “HIV-infected”). Its logical probability T(θ

_{1}) is very small for normal people, because HIV-infected people are rare. However, m

_{p}is irrelative to P(x); it may be 1/2.

_{0}, θ

_{1}, …, θ

_{n}form a cover of U, we have that P(y

_{0}) + P(y

_{1}) + …+P(y

_{n}) = 1 and T(θ

_{0}) + T(θ

_{1}) + … + T(θ

_{n}) ≥ 1.

_{1}= {adults} = {x|x ≥ 18}, A

_{0}= {juveniles} = {x|x < 18}, and A

_{2}= {young people} = {x|15 ≤ x ≤ 35}. The three sets form a cover of U, and T(A

_{0}) + T(A

_{1}) = 1. If T(A

_{2}) = 0.3; the sum of the three logical probabilities is 1.3 > 1. However, the sum of three statistical probabilities P(y

_{0}) + P(y

_{1}) + P(y

_{2}) must be less or equal to 1. If y

_{2}is correctly used, P(y

_{2}) will change from 0 to 0.3. If A

_{0}, A

_{1}, and A

_{2}become fuzzy sets, the conclusion is the same.

_{j}|x) and P(x) into Bayes’ formula to obtain a likelihood function [21]:

_{j}) can be called the semantic Bayes prediction or the semantic likelihood function. According to Dubois and Prade [45], Thomas [46] and others have proposed similar formulae.

_{j}|x) is 1. From P(x) and P(x|θ

_{j}), we can obtain

_{j}) and T(θ

_{j}|x) for a given P(x), where x is an age, the label y

_{j}= “Youth”, and θ

_{j}is a non-fuzzy set and, hence, becomes A

_{j}.

**Example**

**1.**

_{j}|x) = exp[− (x − x

_{j})

^{2}/(2σ

^{2})],

_{j}is the pointed position by y

_{j}and σ is the Root Mean Square (RMS). For simplicity, we assume that x is one-dimensional.

_{j}.

_{j}= “X is about x

_{j}”. We can also express their truth functions by Equation (15).

_{j}about x

_{i}with the log-normalized-likelihood:

_{i}; θ

_{j}), we have generalized KL information

_{i}|y

_{j}) (i = 1,2…) is the sampling distribution, which may be unsmooth or discontinuous.

_{j}) is constant. If T(θ

_{j}|x) is an exponential function with e as the base, and then Equation (18) will become the Donsker–Varadhan representation [19,38].

**Definition**

**3.**

**D**is a sample with labels {(x(t), y(t))|t = 1 to N; x(t)∈U; y(t)∈V}, which includes n different sub-samples or conditional samples

**X**

_{j}, j = 1,2,…,n. Every sub-sample includes data points x(1), x(2), …, x(N

_{j})∈U with label y

_{j}. If

**X**

_{j}is large enough, we can obtain the distribution P(x|y

_{j}) from

**X**

_{j}. If y

_{j}in

**X**

_{j}is unknown, we replace

**X**

_{j}with

**X**and P(x|y

_{j})with P(x|.).

_{j}data points in

**X**

_{j}, where the N

_{ji}data points are x

_{i}. When N

_{j}data points in

**X**

_{j}come from Independent and Identically Distributed (IID) random variables, we have the likelihood

_{j}) and logP(

**X**

_{j}|θ

_{j}) reach their maxima at the same time that θ

_{j}changes and, hence, the two criteria are equivalent. It is easy to prove that, when P(x|θ

_{j}) = P(x|y

_{j}), I(X; θ

_{j}), and logP(

**X**

_{j}|θ

_{j}) reach their maxima.

**X**

_{j}is very large, letting P(x|θ

_{j}) = P(x|y

_{j}), we can obtain the optimized truth function:

_{j}|x) = [P*(x|θ

_{j})/P(x)]/max[P*(x|θ

_{j})/P(x)] = [P(x|y

_{j})/P(x)]/max[P(x|y

_{j})/P(x)].

_{j}|x) = P*(θ

_{j}|x)/max[P*(θ

_{j}|x)] = P(y

_{j}|x)/max[P(y

_{j}|x)].

_{j}) for different y, we use the Semantic Mutual Information (SMI) formula

_{j}) = P(x|y

_{j}) or T(θ

_{j}|x)∝P(y

_{j}|x) for different y

_{j}, the SMI will be equal to the Shannon Mutual Information (SHMI).

#### 2.1.2. From the Rate-Distortion Function R(D) to the Rate-Verisimilitude Function R(G)

_{ij}be replaced with I

_{ij}= I(x

_{i}

_{;}θ

_{j}) = log[T(θ

_{j}|x

_{i})/T(θ

_{j})] = log[P(x

_{i}|θ

_{j})/P(x

_{i})] and G be the lower limit of I(X; θ). The information rate for given G and source P(X) is defined as

_{i}; θ

_{j}) can be a good measure for the verisimilitude of y

_{j}reflecting x

_{i}; therefore, we call R(G) the rate-verisimilitude function.

_{ij}= T(θ

_{j}|x

_{i})/T(θ

_{j}) = P(x

_{i}|θ

_{j})/P(x

_{i}) is the normalized likelihood and λ

_{i}= ∑

_{j}P(y

_{j})m

_{ij}

^{s}. The shape of any R(G) function is a bowl-like curve with second derivative > 0, as shown in Figure 5.

_{max}and G

_{max}. As s increases, the TPFs P(y

_{j}|x), j = 1, 2, …, n, will become steeper and the Shannon channel will have less noise. Hence, R and G will increase. This property of R(G) can be used to prove the convergence of the CM iteration algorithm for the MMI classification of unseen instances.

^{+}and a minimum value G

^{−}; G

^{−}is negative, which means that we also need certain objective information R to bring a certain information loss |G| to enemies. When R = 0, G is negative, which means that if we listen to someone who randomly predicts, the information that we already have will be reduced.

#### 2.2. From Traditional Bayes Prediction to Logical Bayesian Inference

#### 2.2.1. Traditional Bayes Prediction, Likelihood Inference (LI), and Bayesian Inference (BI)

_{j}|x) TBP. For given P(x) and P(y

_{j}|x), we can make a probability prediction

_{j}) = P(x) P(yj|x)/P(y

_{j}).

_{j}|x) is replaced with kP(y

_{j}|x), where k is a constant, P(x|y

_{j}) is the same, because

_{j}), P(x|y

_{j}), and P(x), we can obtain the predictive model

_{j}|x) = P(y

_{j}) P(x|y

_{j})/P(x).

_{j}|x) to make a new probability prediction, in most cases where the Shannon channels are stable.

**Definition**

**4.**

_{1}, z

_{2}, …}. For unseen instance classification, x denotes a true class or true label.

_{1}, the conditional probability P(y

_{1}|x

_{1}) of y

_{1}= positive is called sensitivity, which means the true positive rate. For an uninfected subject x

_{0}, the conditional probability P(y

_{0}|x

_{0}) of y

_{0}= negative is called specificity, which means the true negative rate [49]. The sensitivity and specificity ascertain a Shannon channel, as shown in Table 1.

**Example**

**2.**

_{1}|y

_{1}) using P(y

_{1}|x) in Table 1 for P(x

_{1}) = 0.0001, 0.002 (for normal people), and 0.1 (for high-risk crowd).

**Solution.**

_{1}) = 0.0001, 0.002, and 0.1, we have P(x

_{1}|y

_{1}) = 0.084, 0.65, and 0.99, respectively.

_{j}). This is why we use LI, which uses parameters to construct smooth likelihood functions. Using Maximum Likelihood Estimation (MLE), we can use a sample

**X**to train a likelihood function to obtain the best θ

_{j}:

_{i}|.) indicates that y

_{j}is unknown. The main defect of LI is that LI cannot make use of prior knowledge and that the optimized likelihood function will be invalid when P(x) is changed.

_{θ}(

**X**) is the normalized constant related to θ and P(θ|

**X**) is the posterior distribution of θ or the Bayesian posterior. Using P(θ|

**X**), we can derive the Maximum A Posterior estimation:

_{θ}(

**X**) is neglected.

- it is especially suitable to cases where Y is a random variable for a frequency generator, such as a dice;
- as the sample size increases, the distribution P(θ|
**X**) will gradually shrink to some θ_{j}* coming from the MLE; and, - BI can make use of prior knowledge better than LI.

- the probability prediction from BI [3] is not compatible with traditional Bayes prediction;
- P(θ) is subjectively selected; and,
- BI cannot make use of the prior of X.

_{j}|x). Therefore, to make use of the prior of X, we still want a parameterized TPF P(θ

_{j}|x).

#### 2.2.2. From Fisher’s Inverse Probability Function P(θ_{j}|x) to Logical Bayesian Inference (LBI)

_{j}|x) the “inverse probability”, with respect to Laplace’s method of probability [2]. The corresponding direct probability is P(x|y

_{j}). Later, Fisher called the likelihood function P(x|θ

_{j}) the direct probability and the parameterized TPF P(θ

_{j}|x) the inverse probability [2]. We use θ

_{j}(instead of θ) and x (instead of x

_{i}) to emphasize that θ

_{j}is a constant and x is a variable, and hence P(θ

_{j}|x) should be a function. In the following, we call P(θ

_{j}|x) the Inverse Probability Function (IPF). According to Bayes’ theorem,

_{j}|x ) = P(θ

_{j}) P(x|θ

_{j})/P(x),

_{j}) = P(xi) P(θ

_{j}|x)/P(θ

_{j}).

_{j}|x) can make use of the prior knowledge P(x) well. When P(x) is changed into P’(x), we can still obtain P’(x|θ

_{j}) from P’(x) and P(θ

_{j}|x).

_{j}|x), j=1,2, with parameters. For instance, we can use a pair of Logistic (or Sigmoid) functions as the IPFs. Unfortunately, when n > 2, it is hard to construct P(θ

_{j}|x), j=1,2,…,n, because there is a normalization limitation ∑

_{j}P(θ

_{j}|x) = 1 for every x. This is why a multiclass or multilabel classification is often converted into several binary classifications [51,52].

_{j}|x) for n > 2. However, this function is not compatible with P(y

_{j}|x), especially when two or more classes are not exclusive, the Softmax function does not work.

_{j}|x) and P(y

_{j}|x) as predictive models also has another disadvantage: In many cases, we can only know P(x) and P(x|y

_{j}) without knowing P(θ

_{j}) or P(y

_{j}), such that we cannot obtain P(y

_{j}|x) or P(θ

_{j}|x). Nevertheless, we can obtain a truth function T(θ

_{j}|x) in these cases. In LBI, there is no normalization limitation and, hence, it is easy to construct a group of truth functions and train them with P(x) and P(x|y

_{j}), j=1,2,…,n, without P(y

_{j}) or P(θ

_{j}). This is an important reason why we use LBI.

**X**is very large, we can directly obtain T*(θ

_{j}_{j}|x) from Equation (21). For a size-limited sample, we can use the generalized KL information formula to obtain

- we can use an optimized truth function T*(θ
_{j}|x) to make probability predictions for different P(x) just as we would use P(y_{j}|x) or P(θ_{j}|x); - we can train a truth function with parameters by a sample with small size, as we would train a likelihood function;
- the truth function indicates the semantic meaning of a hypothesis and, hence, is easy for us to understand;
- it is also the membership function, which indicates the denotation of a label or the range of a class and, hence, is suitable for classification;
- to train a truth function, we only need P(x) and P(x|y
_{j}), without needing P(y_{j}) or P(θ_{j}); and, - letting T(θ
_{j}|x)∝P(y_{j}|x), we construct a bridge between statistics and logic.

## 3. Methods II: The Channel Matching (CM) Algorithms

#### 3.1. CM1: To Resolve the Multilabel-Learning-for-New-P(x) Problem

#### 3.1.1. Optimizing Truth Functions or Membership Functions

_{j}is a label “Youth”, and θ

_{j}is a fuzzy set {x|x is a youth}. From population statistics, we can obtain a population age distribution P(x) and a posterior distribution P(x|y

_{j}).

_{j}|x) without parameters if the sample is very large and, hence, the distributions P(x) and P(x|y

_{j}) are smooth. If P(x) and P(x|y

_{j}) are not smooth, we can use Equation (36) to obtain T*(θ

_{j}|x) with parameters. Without needing P(y

_{j}), in CM1, every label’s learning for T*(θ

_{j}|x) is independent.

_{j}|x), we may assume that P(x) is flat. Subsequently, Equation (36) becomes

_{j}|x) is smooth, we can use Equation (22) to obtain T*(θ

_{j}|x) without parameters. For multilabel learning, we can directly obtain a group of truth functions from a Shannon channel P(Y|X) or a sample with distribution P(x,y). However, while using popular multilabel learning methods, such as Binary Relevance, we have to prepare several samples for several Logistic functions.

_{j}|x) is still useful for making semantic Bayes predictions.

#### 3.1.2. For the Confirmation Measure of Major Premises

_{1}|x

_{1}) = T(y

_{0}|x

_{0}) = 1 and T(y

_{1}|x

_{0}) = T(y

_{0}|x

_{1}) = 0 for non-fuzzy hypotheses. Two truth functions for corresponding fuzzy hypotheses are

_{1}|x) = b

_{1}′ +b

_{1}T(y

_{1}|x),

_{0}|x) = b

_{0}′ +b

_{0}T(y

_{1}|x),

_{1}= b(y

_{1}→x

_{1}), which is the degree of belief of major premise MP

_{1}= y

_{1}→x

_{1}= “if Y = y

_{1}then X = x

_{1}”, and b

_{1}’ = 1 − |b

_{1}| means the degree of disbelief of MP

_{1}and the ratio of a tautology in y

_{1}. Likewise, b

_{0}= b(y

_{0}→x

_{0}) and b

_{0}′ = 1 − b

_{0}.

_{1}’* = P(y

_{1}|x

_{0})/ P(y

_{1}|x

_{1}) = [P(x

_{0}|y

_{1})/P(x

_{0})]/[P(x

_{1}|y

_{1})/P(x

_{1})],

_{0}’* = P(y

_{0}|x

_{1})/P(y

_{0}|x

_{0}) = [P(x

_{1}|y

_{0})/P(x

_{1})]/[P(x

_{0}|y

_{0})/P(x

_{0})].

_{1}, we can use b

_{1}’* and different P(x) to make the semantic Bayes prediction:

_{1}|θ

_{1}) = P(x

_{1})/[P(x

_{1}) + b

_{1}′*P(x

_{0})],

_{0}|θ

_{1}) = b

_{1}′*P(x

_{0}) /[P(x

_{1}) + b

_{1}’*P(x

_{0})].

_{j}|x). We can still make the prediction, even if we only know P(x|y

_{1}) and P(x) without knowing P(y

_{1}). It is easy to verify that, while using Equation (43) to solve Example 2, the results are the same as those that were obtained from the traditional Bayes prediction.

_{1}|x), we need to remember two numbers; whereas, to remember T*(θ

_{1}|x), we only need to remember one number b

_{1}′*.

_{1}= P(y

_{1}|x

_{1})/[P(y

_{1}|x

_{0}) + P(y

_{1}|x

_{1})] is the confidence level of MP

_{1}and CL

_{1}’ = 1 − CL

_{1}. As CL

_{1}changes from 0 to 1, b

_{1}* changes from −1 to 1, as shown in Figure 7.

#### 3.1.3. Rectifying the Parameters of a GPS Device

_{j}|x) = exp[− |x − (x

_{j}+Δx)|

^{2}/(2σ

^{2})],

_{j}. From many relative deviations, we can obtain a sampling distribution P(x’|y

_{j}). As we are driving on a big square, P(x) should be flat. Afterwards, we can use the generalized KL information formula to obtain the optimized parameters Δx* and σ*. Subsequently, we replace y

_{j}with y

_{k}= ”X is about x

_{k}”, where x

_{k}= x

_{j}+ Δx*.

_{j}|x) = b exp[−|x − (x

_{j}+ Δx)|

^{2}/(2σ

^{2})] + 1 − b

_{j}|x) or the Bayesian posterior P(θ|

**X**) for the above task and probability prediction (see Figure 3), they will find that it is difficult to do, because they only have prior knowledge P(x) from a GPS map, without prior knowledge P(y) or P(θ).

#### 3.2. CM2: The Semantic Channel and the Shannon Channel Mutually Match for Multilabel Classifications

**Matching I:**Let the semantic channel match the Shannon channel or use CM1 for multilabel learning; and,**Matching II:**Let the Shannon channel match the semantic channel by using the Maximum Semantic Information (MSI) classifier.

_{j}with membership function T(θ

_{j}|x) and its negative label y

_{j}’ with membership function 1 − T(θ

_{j}|x), at the same time, as in the popular method of [51,52], by

_{j}

^{c}is the complementary set of θ

_{j}. The obtained T*(θ

_{j}|x) may be a Logistic function, which will cover a larger area of U, in comparison with T*(θ

_{j}|x) from Equation (36) or Equation (38).

_{j}), we can overcome the class-imbalance problem [50] and reduce the rate of failure to report smaller probability events. If T(θ

_{j}|x) ∈ {0,1}, the semantic information measure becomes Carnap and Bar-Hillel’s semantic information measure, and the classifier becomes the minimum logical probability classifier:

_{j}|x) and P(θ

_{k}|x), such as two Logistic functions, to select a label with greater IPF. This method is not compatible with the information criterion or the likelihood criterion.

#### 3.3. CM3: the CM Iteration Algorithm for MMI Classification of Unseen Instances

_{j}be a subset of C and y

_{j}= f(z|z∈C

_{j}); hence, S = {C

_{1}, C

_{2}, …} is a partition of C. Our aim is, for given P(x, z) from

**D**, to find the optimized S, as given by

**Matching I**: Let the semantic channel match the Shannon channel.

_{i}; θ

_{j}). Subsequently, for given z, we obtain the information gain functions:

_{i}; θ

_{j}) = I(x

_{i}; y

_{j}) = log[P(y

_{j}|x)/P(y

_{j})]. However, with the notion of the semantic channel, we can understand this algorithm and better prove its convergence.

**Matching II:**Let the Shannon channel match the semantic channel by the classifier

_{max}and G

_{max}(see Figure 8).

_{max}, R

_{max}), as every Matching I procedure increases G and every Matching II procedure increases G and R, and the maxima of G and R are finite.

_{j}|z), j=1,2,…, because it checks every z to see which of the I(X; y

_{j}|z), j=1,2…, is the maximum.

_{max}, R

_{max}) is reached.

#### 3.4. CM4: the CM-EM Algorithm for Mixture Models

_{X}(θ) by maximizing the complete data log-likelihood Q, whereas the convergence theory of the CM-EM algorithm explains that we can maximize L

_{X}(θ) by maximizing the information efficiency G/R.

_{θ}(x) comes from the mixture of n likelihood functions, such as

_{θ}(x) a mixture model. If every predictive model P(x|θ

_{j}) is a Gaussian function, then P

_{θ}(x) is a Gaussian mixture. In the following, we use n = 2 to discuss the algorithms for mixture models.

_{1}*) and P(x|θ

_{2}*) with ratios of P*(y

_{1}) and P*(y

_{2}) = 1 − P*(y

_{1}); that is,

_{1})P(x|θ

_{1}*)+P*(y

_{2})P(x|θ

_{2}*).

_{θ}(x) = P(y

_{1})P(x|θ

_{1}) + P(y

_{2})P(x|θ

_{2}).

_{θ}(x) are close to each other, such that the relative entropy is close to 0, for example, less than 0.001 bit for a huge sample or 0.01 bit for a sample with size = 1000, then, we may say that our guess is right. Therefore, our task is to change θ and P(y) to maximize the likelihood L

_{X}(θ) = logP(

**X**|θ) or to minimize the relative entropy H(P‖P

_{θ}).

_{j}|x) is from Equation (63). There exists

_{X}(θ) = Q + H,

_{X}(θ) by increasing Q.

**E-step:**Write the conditional probability functions (e.g., the Shannon channel):

**M-step**: Improve P(y) and θ to maximize Q. If Q cannot be further improved, then end the iteration process; otherwise, go to the E-step.

**(θ) are positively correlated and that the E-step does not decrease Q; nevertheless, this is not true. The author found that Q may decrease in some E-steps; and, Q and F should decrease in some cases [42].**

_{X}**E1-step:**Construct the Shannon channel. This step is the same as the E-step of the EM algorithm.

**E2-step:**Repeat the following three equations until P

^{+1}(y) converges to P(y):

_{θ}) is less than a small value, then end the iteration.

**MG-step:**Optimize the parameters θ

_{j}

^{+1}of the likelihood function in log(.) to maximize G:

_{j}

^{+}

^{1})/P(x) = P(x|θ

_{j})/P

_{θ}(x), the new likelihood function is

_{j}

^{+1}) = P(x)P(x|θ

_{j})/P

_{θ}(x).

_{j}

^{+1}) above is, in general, not normalized [57]. For Gaussian mixtures, we can easily obtain new parameters:

- R(G) − G is close to the relative entropy H(P‖P
_{θ}).

_{θ}). Hence,

_{θ}(X) converges to P(X) is equivalent to proving that H(P‖P

_{θ}) converges to 0. As the E2-step forces R = R″ and H(Y

^{+1}‖Y) = 0, we only need to prove that every step minimizes R − G. It is evident that the MG-step minimizes R − G, because this step maximizes G without changing R. The remaining problem is how to prove that R − G is minimized in the E1- and E2-steps. Learning from the variational and iterative methods that Shannon [30] and others [48] have used for analyzing the rate-distortion function R(D), we can optimize P(y|x) and P(y), respectively, to minimize R − G = I(X; Y) − I(X; θ). As P(Y|X) and P(Y) are interdependent, we can only fix one to optimize the other; the E2-step is for exactly this purpose. For the detailed convergence proof, see [57].

## 4. Results

#### 4.1. The Results of CM2 for Multilabel Learning and Classification

_{j}) to optimize a truth function in order to obtain T*(θ

_{j}|x), as shown in Figure 9.

_{j}), we first used a Gaussian random number generator to produce two samples, S

_{1}and S

_{2}. Both sample sizes were 100,000. The data with distribution P(x) was a part of S

_{1}. We have

_{2}had distribution P

_{2}(x). P(x|y

_{j}) was produced from P

_{2}(x) and the following truth function:

_{j}) = P

_{2}(x|θ

_{2}) = P

_{2}(x)T(θ

_{2}|x)/T(θ

_{2}).

_{j}|x) from P(x) and P(x|y

_{j}). If we directly used the formula in Equation (21), T*(θ

_{j}|x) would not be smooth. We set a truth function with parameters

_{j}|x) to obtain smooth T*(θ

_{j}|x). If S

_{2}= S

_{1}, then T*(θ

_{j}|x) = P(x|y

_{j})/P(x)/max[P(x|y

_{j})/P(x)] = T(θ

_{2}|x).

_{1}, y

_{2}, y

_{3}, y

_{4}, y

_{5})=(“Adult”, ”Child”, “Youth”, “Middle age”, “Old”). Figure 10a shows the truth functions of the five labels.

_{3}|x) and T(θ

_{4}|x) were constructed by two logistic functions; each of others was a logistic function. The python 3.6 source file with parameters for Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15 can be found in Appendix A.

_{j}|x) that were obtained from the popular methods, and then use the Bayesian classifier or the Maximum Posterior Probability criterion to classify them.

_{0}= y

_{1}′, y

_{6}= y

_{3}∧y

_{1}′, and y

_{7}= y

_{3}∧y

_{1}.

_{0}= ”Non-adult” or y

_{1}= “Adult”, for most ages. However, while using the MSI criterion, we selected y

_{2}, y

_{6}, y

_{7}, y

_{4}, and y

_{5,}in turn, as the age x increased. The MSI criterion encouraged us to use more labels with smaller logical probabilities. For example, if x was between 11.2–16.6, we should use the label y

_{6}= y

_{3}∧y

_{1}′ = ”Youth” and “Non-adult”. However, for most x, CM2 did not use redundant labels, as Binary Relevance [52] does. For example, while using the MSI criterion, we did not add the label “Non-youth” to x = 60, with the label “Old” already.

#### 4.2. The Results of CM3 for the MMI classifications of Unseen Instances

**Example**

**3.**

_{0}) and P(z|x

_{1}) had parameters μ

_{0}= 30, μ

_{1}= 70, σ

_{0}= 15, and σ

_{1}= 10; P(x

_{0}) = 0.8 and P(x

_{1}) = 0.2. The initial partitioning point z’ was 50.

**The iterative process**: Matching II-1 obtained z’ = 53; Matching II-2 obtained z’ = 54; Matching II-3 obtained z* = 54.

**Example**

**4.**

_{0}) and P(z|x

_{1})—and the right one was a mixture of two Gaussian distributions—P(z|x

_{21}) and P(z|x

_{22}). The sample size was 1000. See Table 3 for the parameters of the four Gaussian distributions.

#### 4.3. The Results of CM4 for Mixture Models

**Example**

**5.**

_{1}*, µ

_{2}*, σ

_{1}*, σ

_{2}*, P*(y

_{1})) = (100, 125, 10, 10, 0.7). The invalid convergence is centered on the µ

_{1}–µ

_{2}plane at (µ

_{1}, µ

_{2}) = (115, 95), where Q reaches its local maximum.

_{1}, µ

_{2}, P(y

_{1}), σ

_{1}, σ

_{2}) = (115, 95, 0.5, 10, 10) to test the CM-EM algorithm, in order to see whether (µ

_{1}, µ

_{2},) can converge to (µ

_{1}*, µ

_{2}*)=(100, 125). Figure 13 shows the result, which indicates that L

_{X}(θ) converged to its global maximum under the CM-EM algorithm.

**Example**

**6.**

_{θ}) = 0.00072 bit after nine E1- and E2-steps and eight MG-steps.

**Example**

**7.**

## 5. Discussion

#### 5.1. Discussing Confirmation Measure b*

_{1}|x

_{1}) (more positive examples) is important, whereas b

_{1}* = b *(y

_{1}→x

_{1}) emphasizes that smaller P(y

_{1}|x

_{0}) (fewer negative examples) is important. For example, when the sensitivity P(y

_{1}|x

_{1}) is 0.1 and the specificity P(y

_{0}|x

_{0}) is 1, it follows that both b

_{1}* and CL

_{1}are 1, which is reasonable. However, while using the existing confirmation formulae [59], the degrees of confirmation of MP

_{1}are very small.

_{1}, b

_{1}*, is also 0.1. However, while using the existing confirmation formulae, the degrees of confirmation are much bigger than 0.1. A bigger degree is unreasonable, as the ratio of negative examples is 0.9/1.9 ≈ 0.47 ≈ 0.5, which means that MP

_{1}is almost unbelievable.

_{1}→x

_{1}) = − b *(y

_{1}→not x

_{1}) = −b*(y

_{1}→x

_{0}). We can prove that the confirmation measure b* has Hypothesis Symmetry:

#### 5.2. Discussing CM2 for the Multilabel Classification

_{5}) and a new truth function T(θ

_{5}|x); the new truth function will cause the boundary to move further to the right. The truth function, or the semantic meaning of “Old”, should evolve with the human average lifespan, in this way.

#### 5.3. Discussing CM3 for the MMI Classification of Unseen Instances

_{j}|z) (j = 0, 1,…). For given reward functions, Matching II is used to let the Shannon channel match the semantic channel to obtain new neural network parameters. Repeating these two steps will cause I(X; θ) to converge to MMI. Matching I and Matching II are similar to the tasks of the generative and discriminative models in a Generative Adversarial Network. We should be able to improve the MMI classification in high-dimensional feature spaces by combining CM3 and popular deep learning methods [33].

#### 5.4. Discussing CM4 for mixture models

**(θ) are not always positively correlated, as most researchers believe. In some cases, Q may (and should) decrease, as Q may be greater than Q* = Q(θ*), which is the true model’s Q. In Example 5, while assuming the true model’s parameters σ**

_{X}_{1}* = σ

_{2}* = σ* and P*(y

_{1}) = P*(y

_{2}) = 0.5, we could prove that P(y

_{1}|x) and P(y

_{2}|x) were a pair of logistic functions and they became steeper as σ decreased. Hence, H increases as σ increases. We can prove that the partial derivative

**∂**H/

**∂**σ is greater than 0. Hence, when θ = θ*,

## 6. Conclusions

_{j}|x), we can make probability predictions with a different prior P(x), as we use the Transition Probability Function (TPF) P(y

_{j}|x) or the Inverse Probability Function (IPF) P(θ

_{j}|x). However, it is much easier to obtain optimized truth functions from samples than to obtain the optimized IPF, as P(y

_{j}) or P(θ

_{j}) are not necessary for optimizing the truth functions. Importantly, the truth function can represent the semantic meaning of a hypothesis or a label and connect statistics and logic better. A windfall is that the optimization of the truth function brings a seemly reasonable confirmation measure b* for induction.

## Funding

## Acknowledgments

## Conflicts of Interest

## Abbreviations

BI | Bayesian Inference |

CM | Channel Matching |

CM-EM | Channel Matching Expectation-Maximization |

EM | Expectation-Maximization |

G theory | Semantic information G theory |

GPS | Global Positioning System |

HIV | Human Immunodeficiency Virus |

IPF | Inverse Probability Function |

KL | Kullback-Leibler |

LBI | Logical Bayesian Inference |

LI | Likelihood Inference |

MLE | Maximum Likelihood Estimation |

MM | Maximum Mutual Information |

MMI | Maximization-Maximization |

MPP | Maximum Posterior Probability |

MSI | Maximum Semantic Information |

MSIE | Maximum Semantic Information Estimation |

SMI | Semantic Mutual Information |

SHMI | Shannon’s Mutual Information |

TBP | Traditional Bayes Prediction |

TPF | Transition Probability Function |

## Appendix A

Program Name | Task |
---|---|

Bayes Theorem III 2.py | For Figure 9. To show label learning. |

Ages-MI-classification.py | For Figure 10. To show people classification on ages using maximum semantic information criterion for given membership functions and P(x). |

MMI-v.py | For Figure 11. To show the Channels Matching (CM) algorithm for the maximum mutual information classifications of unseen instances. One can modify parameters or the initial partition in the program for different result. |

MMI-H.py | For Figure 12. |

LocationTrap3lines.py | For Figure 13. To show how the CM-EM algorithm for mixture models avoids local convergence because of the local maximum of Q. |

Folder ForEx6 (with Excel file and Word readme file) | For Figure 14. To show the effect of every step of the CM-EM algorithm for mixture models. |

MixModels6-2valid.py | For Figure 15. To show the CM-EM algorithm of for a two-dimensional mixture models with seriously overlapped components. |

## References

- Fisher, R.A. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc.
**1922**, 222, 309–368. [Google Scholar] [CrossRef] [Green Version] - Fienberg, S.E. When Did Bayesian Inference Become “Bayesian”? Bayesian Anal.
**2006**, 1, 1–40. [Google Scholar] [CrossRef] - Bayesian Inference. In Wikipedia: The Free Encyclopedia. Available online: https://en.wikipedia.org/wiki/Bayesian_inference (accessed on 3 March 2019).
- Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control.
**1974**, 19, 716–723. [Google Scholar] [CrossRef] - Kullback, S.; Leibler, R. On information and Sufficiency. Ann. Math. Stat.
**1951**, 22, 79–86. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: New York, NY, USA, 2006. [Google Scholar]
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J.
**1948**, 27, 379–429, 623–656. [Google Scholar] [CrossRef] - Weaver, W. Recent contributions to the mathematical theory of communication. In The Mathematical Theory of Communication, 1st ed.; Shannon, C.E., Weaver, W., Eds.; The University of Illinois Press: Urbana, IL, USA, 1963; pp. 93–117. [Google Scholar]
- Carnap, R.; Bar-Hillel, Y. An Outline of a Theory of Semantic Information; Tech. Rep. No. 247; Research Laboratory of Electronics, MIT: Cambridge, MA, USA, 1952. [Google Scholar]
- Bonnevie, E. Dretske’s semantic information theory and metatheories in library and information science. J. Doc.
**2001**, 57, 519–534. [Google Scholar] [CrossRef] - Floridi, L. Outline of a theory of strongly semantic information. Minds Mach.
**2004**, 14, 197–221. [Google Scholar] [CrossRef] - Zhong, Y.X. A theory of semantic information. China Commun.
**2017**, 14, 1–17. [Google Scholar] [CrossRef] - D’Alfonso, S. On Quantifying Semantic Information. Information
**2011**, 2, 61–101. [Google Scholar] [CrossRef] [Green Version] - De Luca, A.; Termini, S. A definition of a non-probabilistic entropy in setting of fuzzy sets. Inf. Control
**1972**, 20, 301–312. [Google Scholar] [CrossRef] - Bhandari, D.; Pal, N.R. Some new information measures of fuzzy sets. Inf. Sci.
**1993**, 67, 209–228. [Google Scholar] [CrossRef] - Kumar, T.; Bajaj, R.K.; Gupta, B. On some parametric generalized measures of fuzzy information, directed divergence and information Improvement. Int. J. Comput. Appl.
**2011**, 30, 5–10. [Google Scholar] - Klir, G. Generalized information theory. Fuzzy Sets Syst.
**1991**, 40, 127–142. [Google Scholar] [CrossRef] - Wang, Y. Generalized Information Theory: A Review and Outlook. Inf. Technol. J.
**2011**, 10, 461–469. [Google Scholar] [CrossRef] [Green Version] - Belghazi, I.; Rajeswar, S.; Baratin, A.; Hjelm, R.D.; Courville, A. Mine: Mutual information neural estimation. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2018; Available online: https://arxiv.org/abs/1801.04062 (accessed on 1 January 2019).
- Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Trischler, A.; Bengio, Y. Learning Deep Representations by Mutual Information Estimation and Maximization. Available online: https://arxiv.org/abs/1808.06670 (accessed on 22 February 2019).
- Lu, C. Shannon equations reform and applications. BUSEFAL
**1990**, 44, 45–52. Available online: https://www.listic.univ-smb.fr/production-scientifique/revue-busefal/version-electronique/ebusefal-44/ (accessed on 5 March 2019). - Lu, C. B-fuzzy quasi-Boolean algebra and a generalize mutual entropy formula. Fuzzy Syst. Math.
**1991**, 5, 76–80. (in Chinese). [Google Scholar] - Lu, C. A Generalized Information Theory; China Science and Technology University Press: Hefei, China, 1993; ISBN 7-312-00501-2. (in Chinese) [Google Scholar]
- Lu, C. Meanings of generalized entropy and generalized mutual information for coding. J. China Inst. Commun.
**1994**, 15, 37–44. (in Chinese). [Google Scholar] - Lu, C. A generalization of Shannon’s information theory. Int. J. Gen. Syst.
**1999**, 28, 453–490. [Google Scholar] [CrossRef] - Lu, C. GPS information and rate-tolerance and its relationships with rate distortion and complexity distortions. J. Chengdu Univ. Inf. Technol.
**2012**, 6, 27–32. (In Chinese) [Google Scholar] - Zadeh, L.A. Fuzzy Sets. Inf. Control
**1965**, 8, 338–353. [Google Scholar] [CrossRef] - Tarski, A. The semantic conception of truth: and the foundations of semantics. Philos. Phenomenol. Res.
**1994**, 4, 341–376. [Google Scholar] [CrossRef] - Davidson, D. Truth and meaning. Synthese
**1967**, 17, 304–323. [Google Scholar] [CrossRef] - Shannon, C.E. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec.
**1959**, 4, 142–163. [Google Scholar] - Popper, K. The Logic of Scientific Discovery, 1st ed.; Routledge: London, UK, 1959. [Google Scholar]
- Popper, K. Conjectures and Refutations, 1st ed.; Routledge: London, UK, 2002. [Google Scholar]
- Goodfellow, I.; Bengio, Y. Deep Learning, 1st ed.; The MIP Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Carnap, R. Logical Foundations of Probability, 1st ed.; University of Chicago Press: Chicago, IL, USA, 1950. [Google Scholar]
- Zadeh, L.A. Probability measures of fuzzy events. J. Math. Anal. Appl.
**1986**, 23, 421–427. [Google Scholar] [CrossRef] - Floridi, L. Semantic conceptions of information. In Stanford Encyclopedia of Philosophy; Stanford University: Stanford, CA, USA, 2005; Available online: http://seop.illc.uva.nl/entries/information-semantic/ (accessed on 1 July 2019).
- Theil, H. Economics and Information Theory; North-Holland Pub. Co.: Amsterdam, The Netherlands; Rand McNally: Chicago, IL, USA, 1967. [Google Scholar]
- Donsker, M.; Varadhan, S. Asymptotic evaluation of certain Markov process expectations for large time IV. Commun. Pure Appl. Math.
**1983**, 36, 183–212. [Google Scholar] [CrossRef] - Wittgenstein, L. Philosophical Investigations; Basil Blackwell Ltd: Oxford, UK, 1958. [Google Scholar]
- Bayes, T.; Price, R. An essay towards solving a problem in the doctrine of chance. Philos. Trans. R. Soc. Lond.
**1763**, 53, 370–418. [Google Scholar] [CrossRef] - Lu, C. From Bayesian inference to logical Bayesian inference: A new mathematical frame for semantic communication and machine learning. In Intelligence Science II, Proceedings of the ICIS2018, Beijing, China, 2 October 2018; Shi, Z.Z., Ed.; Springer International Publishing: Cham, Switzerland, 2018; pp. 11–23. [Google Scholar]
- Lu, C. Channels’ matching algorithm for mixture models. In Intelligence Science I, Proceedings of ICIS 2017, Beijing, China, 27 September 2017; Shi, Z.Z., Goertel, B., Feng, J.L., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 321–332. [Google Scholar]
- Lu, C. Semantic channel and Shannon channel mutually match and iterate for tests and estimations with maximum mutual information and maximum likelihood. In Proceedings of the 2018 IEEE International Conference on Big Data and Smart Computing, Shanghai, China, 15 January 2018; IEEE Computer Society Press Room: Washington, DC, USA, 2018; pp. 15–18. [Google Scholar]
- Lu, C. Semantic channel and Shannon channel mutually match for multi-label classification. In Intelligence Science II, Proceedings of ICIS 2018, Beijing, China, 2 October 2018; Shi, Z.Z., Ed.; Springer International Publishing: Cham, Switzerland, 2018; pp. 37–48. [Google Scholar]
- Dubois, D.; Prade, H. Fuzzy sets and probability: Misunderstandings, bridges and gaps. In Proceedings of the 1993 Second IEEE International Conference on Fuzzy Systems, San Francisco, CA, USA, 28 March 1993. [Google Scholar]
- Thomas, S.F. Possibilistic uncertainty and statistical inference. In Proceedings of the ORSA/TIMS Meeting, Houston, TX, USA, 11–14 October 1981. [Google Scholar]
- Wang, P.Z. From the fuzzy statistics to the falling fandom subsets. In Advances in Fuzzy Sets, Possibility Theory and Applications; Wang, P.P., Ed.; Plenum Press: New York, NY, 1983; pp. 81–96. [Google Scholar]
- Berger, T. Rate Distortion Theory; Prentice-Hall: Enklewood Cliffs, NJ, USA, 1971. [Google Scholar]
- Thornbury, J.R.; Fryback, D.G.; Edwards, W. Likelihood ratios as a measure of the diagnostic usefulness of excretory urogram information. Radiology
**1975**, 114, 561–565. [Google Scholar] [CrossRef] - OraQuick. Available online: http://www.oraquick.com/Home (accessed on 31 December 2016).
- Zhang, M.L.; Zhou, Z.H. A review on multi-label learning algorithm. IEEE Trans. Knowl. Data Eng.
**2014**, 26, 1819–1837. [Google Scholar] [CrossRef] - Zhang, M.L.; Li, Y.K.; Liu, X.Y.; Geng, X. Binary relevance for multi-label learning: An overview. Front. Comput. Sci.
**2018**, 12, 191–202. [Google Scholar] [CrossRef] - Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. Ser. B
**1997**, 39, 1–38. [Google Scholar] [CrossRef] - Ueda, N.; Nakano, R. Deterministic annealing EM algorithm. Neural Netw.
**1998**, 11, 271–282. [Google Scholar] [CrossRef] - Marin, J.-M.; Mengersen, K.; Robert, C.P. Bayesian modelling and inference on mixtures of distributions. In Handbook of Statistics: Bayesian Thinking, Modeling and Computation; Dey, D., Rao, C.R., Eds.; Elsevier: Amsterdam, The Netherlands, 2011; pp. 459–507. [Google Scholar]
- Neal, R.; Hinton, G. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models; Michael, I.J., Ed.; MIT Press: Cambridge, MA, USA, 1999; pp. 355–368. [Google Scholar]
- Lu, C. From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models. Available online: https://arxiv.org/abs/18 (accessed on 26 October 2018).
- James, H. Inductive logic. In The Stanford Encyclopedia of Philosophy; Edward, N.Z., Ed.; Stanford University Press: Palo Alto, CA, USA, 2018; Available online: https://plato.stanford.edu/archives/spr2018/entries/logic-inductive/ (accessed on 19 March 2018).
- Tentori, K.; Crupi, V.; Bonini, N.; Osherson, D. Comparison of confirmation measures. Cognition
**2007**, 103, 107–119. [Google Scholar] [CrossRef] - Ellery, E.; Fitelson, B. Measuring confirmation and evidence. J. Philos.
**2000**, 97, 663–672. [Google Scholar] - Huang, W.H.; Chen, Y.G. The multiset EM algorithm. Stat. Probab. Lett.
**2017**, 126, 41–48. [Google Scholar] [CrossRef] [Green Version] - Ueda, N.; Nakano, R.; Ghahramani, Z.; Hinton, G.E. SMEM algorithm for mixture models. Neural Comput.
**2000**, 12, 2109–2128. [Google Scholar] [CrossRef] - Zhang, B.; Zhang, C.; Yi, X. Competitive EM algorithm for finite mixture models. Pattern Recognit.
**2004**, 37, 131–144. [Google Scholar] [CrossRef]

**Figure 1.**The Shannon channel and the semantic channel. The semantic meaning of y

_{j}is ascertained by the membership relation between x and θ

_{j}. A fuzzy set θ

_{j}may be overlapped or included by another.

**Figure 3.**Illustrating the positioning of a GPS device with deviation. The round point is the pointed position with a deviation, and the position with the star is the most probable position.

**Figure 5.**The rate-verisimilitude function R(G) for binary communication. For any R(G) function, there is a point where R(G)=G.

**Figure 6.**Illustrating the medical test and signal detection. We choose y

_{j}according to z∈C

_{j}. {C

_{0}, C

_{1}} is a partition of C.

**Figure 7.**Relationship between degree of conformation b* and confidence level CL. As CL changes from 0 to 1, b* changes from −1 to 1.

**Figure 8.**Illustrating the iterative convergence of the MMI classification of unseen instances. In the iterative process, (G, R) moves from the start point to a,b,c,d,e,…,f gradually.

**Figure 9.**Using prior and posterior distributions P(x) and P(x|y

_{j}) to obtain the optimized truth function T*(θ

_{j}|x).

**Figure 11.**The Maximum Mutual Information (MMI) classification of unseen instances. The classifier is y = f(z). The mutual information is I(X; Y). X is a true class and Y is a selected label.

**Figure 13.**The iterative process from the local maximum of Q to the global maximum of L

_{X}(θ). The stopping condition is when the deviation of every parameter is smaller than 1%.

**Figure 14.**The iterative process of the CM-EM algorithm for Example 6. It can be seen that some E2-steps decrease Q. The relative entropy is smaller than 0.001 bit after nine iterations.

**Figure 15.**CM4 for a two-dimensional mixture model. There are six components with Gaussian distributions.

Infected Subject x_{1} | Uninfected Subject x_{0} | |
---|---|---|

Positive y_{1} | P(y_{1}|x_{1}) = sensitivity = 0.917 | P(y_{1}|x_{0}) = 1 − specificity = 0.001 |

Negative y_{0} | P(y_{0}|x_{1}) = 1 − sensitivity = 0.083 | P(y_{0}|x_{0}) = specificity = 0.999 |

Y | Infected x_{1} | Uninfected x_{0} |
---|---|---|

Positive y_{1} | T(θ_{1}|x_{1}) = 1 | T(θ_{1}|x_{0}) = b_{1}′ |

Negative y_{0} | T(θ_{0}|x_{1}) = b_{0}′ | T(θ_{0}|x_{0}) = 1 |

μ_{z}_{1} | μ_{z}_{2} | σ_{z}_{1} | σ_{z}_{2} | ρ | P(x_{i}) | |
---|---|---|---|---|---|---|

P(z|x_{0}) | 50 | 50 | 75 | 200 | 50 | 0.2 |

P(z|x_{1}) | 75 | 90 | 200 | 75 | −50 | 0.5 |

P(z|x_{21}) | 100 | 50 | 125 | 125 | 75 | 0.2 |

P(z|x_{22}) | 120 | 80 | 75 | 125 | 0 | 0.1 |

Real Parameters | Starting Parameters H(P‖P _{θ}) = 0.68 bit | Parameters after 9 E2-Steps H(P‖P _{θ}) = 0.00072 bit | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

μ* | σ* | P*(Y) | μ | σ | P(Y) | μ | σ | P(Y) | ||

y_{1} | 46 | 2 | 0.7 | 30 | 20 | 0.5 | 46.001 | 2.032 | 0.6990 | |

y_{2} | 50 | 20 | 0.3 | 70 | 20 | 0.5 | 50.08 | 19.17 | 0.3010 |

Algorithm | Sample Size | Iteration Number | Convergent Parameters | ||||
---|---|---|---|---|---|---|---|

μ_{1} | μ_{2} | σ_{1} | σ_{2} | P(y_{1}) | |||

EM | 1000 | about 36 | 46.14 | 49.68 | 1.90 | 19.18 | 0.731 |

MM | 1000 | about 18 | 46.14 | 49.68 | 1.90 | 19.18 | 0.731 |

CM-EM | 1000 | 8 | 46.01 | 49.53 | 2.08 | 21.13 | 0.705 |

Real parameters | 46 | 50 | 2 | 20 | 0.7 |

About | Gradient Descent | CM3 |
---|---|---|

Models for different classes | Optimized together | Optimized separately |

Boundaries is expressed by | Functions with parameters | Numerical values |

For complicated boundaries | Not easy | Easy |

Consider gradient and search | Necessary | Unnecessary |

Convergence | Not easy | Easy |

Computation | Complicated | Simple |

Iterations needed | Many | 2–3 |

Samples required | Not necessarily big | Big enough |

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lu, C.
Semantic Information G Theory and Logical Bayesian Inference for Machine Learning. *Information* **2019**, *10*, 261.
https://doi.org/10.3390/info10080261

**AMA Style**

Lu C.
Semantic Information G Theory and Logical Bayesian Inference for Machine Learning. *Information*. 2019; 10(8):261.
https://doi.org/10.3390/info10080261

**Chicago/Turabian Style**

Lu, Chenguang.
2019. "Semantic Information G Theory and Logical Bayesian Inference for Machine Learning" *Information* 10, no. 8: 261.
https://doi.org/10.3390/info10080261