Semantic Information G Theory and Logical Bayesian Inference for Machine Learning

An important problem with machine learning is that when label number n>2, it is very difficult to construct and optimize a group of learning functions, and we wish that optimized learning functions are still useful when prior distribution P(x) (where x is an instance) is changed. To resolve this problem, the semantic information G theory, Logical Bayesian Inference (LBI), and a group of Channel Matching (CM) algorithms together form a systematic solution. A semantic channel in the G theory consists of a group of truth functions or membership functions. In comparison with likelihood functions, Bayesian posteriors, and Logistic functions used by popular methods, membership functions can be more conveniently used as learning functions without the above problem. In Logical Bayesian Inference (LBI), every label's learning is independent. For Multilabel learning, we can directly obtain a group of optimized membership functions from a big enough sample with labels, without preparing different samples for different labels. A group of Channel Matching (CM) algorithms is developed for machine learning. For the Maximum Mutual Information (MMI) classification of three classes with Gaussian distributions on a two-dimensional feature space, 2-3 iterations can make mutual information between three classes and three labels surpass 99% of the MMI for most initial partitions. For mixture models, the Expectation-Maximization (EM) algorithm is improved and becomes the CM-EM algorithm, which can outperform the EM algorithm when mixture ratios are imbalanced, or local convergence exists. The CM iteration algorithm needs to combine neural networks for MMI classifications on high-dimensional feature spaces. LBI needs further studies for the unification of statistics and logic.


Introduction
Machine learning needs learning functions and classifiers.In 1922, Fisher [1] proposed the Likelihood Inference (LI) (see Appendix A for all abbreviations in this paper), which uses likelihood functions as learning functions and use the Maximum Likelihood (ML) criterion to optimize learning functions and classifiers.However, when the prior distribution P(x) (where x is an instance) is changed, the optimized likelihood function will be invalid.Since LI cannot make use of prior knowledge, during 1950s, Bayesians proposed Bayesian Inference (BI) [2,3], which uses Bayesian posteriors as learning functions.However, we often only have the prior knowledge of instances instead of labels or model parameters.BI is still not good in these cases.A pair of Logistic (or Sigmoid) functions are often used as learning functions for binary classifications.With a Logistic function and the Bayes' Theorem, we can make use of new P(x) to make new probability prediction for the ML classifier.However, when the labels' number n>2, we cannot find proper learning functions that are similar to Logistic functions for Multilabel learning.We call the above problem as "Multilabel-Learning-for-New-P(x) Problem".
Machine learning is to acquire and convey information.The information criterion should be a good criterion.in 1974, Akaike [4] proved that the ML criterion is equal to the minimum Kullback-Leibler (KL) divergence criterion, where KL divergence [5] is also called "KL information".Since then, the information criterion, especially the information criterion that is compatible with the likelihood criterion, has been attracting more researchers' attention [6].However, KL divergence decreases as likelihood increases, and hence the Least KL divergence is not ideal for the information criterion.Can we use Shannon's mutual information or another information measure for the information criterion?
In 1948, Shannon [7] initiated the classical information theory .In 1949, Weaver [8] proposed the three levels of communication that are relevant to the technical problem resolved by Shannon, the semantic problem relating to meaning and truth, and the effectiveness problem concerning information values.In 1952, Carnap and Bar-Hillel [9] proposed an outline of semantic information theory.Now we can find some different semantic information theories [10][11][12][13].There are also some fuzzy information theories [14][15][16] and generalized information theories [17,18] relating to semantic information theories.Recently some researchers used Shannon mutual information measure with parameters to optimize neural networks [19,20].
However, before the author, no one brought a learning function with parameters into Shannon's mutual information formula to optimize the learning function and the classifier by a sampling distribution.Therefore, the author tried to develop a semantic information theory that can combine Shannon's information theory and Fisher's likelihood method.The semantic information G theory, or the G theory, is for this purpose.This theory has been developed mainly by the author in the recent three decades [21][22][23][24][25][26].It uses membership functions of fuzzy sets proposed by Zadeh [27] as learning functions and treats a membership function as the truth function of a hypothesis.According to Tarski's truth theory [28] and Davidson's truth-conditional semantics [29], the truth function can represent the semantic meaning of a hypothesis.
"The G theory" is used because, in this theory, Semantic Mutual Information (SMI) is a natural generalization of Shannon's Mutual Information (SHMI) ("G" means generalization) so that SHMI is the upper limit of SMI.G also denotes SMI as D denotes average distortion in Shannon's information rate-distortion theory [30].Replacing D with G, the author reformed the rate-distortion function R(D) into the rate-verisimilitude function R(G) [24][25] not only for data compression but also for machine learning.
The G theory has two headstreams: Shannon's information theory and Popper's hypothesistesting theory [31] (p.96, 269) [32] (p.294), which emphasizes that a hypothesis with less logical probability can convey more information if it can survive empirical tests and hence is more preferred.
The semantic information formula proposed by Carnap and Bar-Hillel [9] contains Popper's thought.It is where p is a hypothesis and mp is its logical probability.However, this formula does not deal with whether the hypothesis can survive empirical tests.Therefore, the author of this paper improved it by Eq. (2.15) in Section 2.1.1.Further, bringing likelihood functions and truth functions into Shannon's mutual information formula, the author obtained the SMI formula.Cross-entropy is getting popular in machine learning [33].The G theory uses not only crossentropy but also mutual cross-entropy [22,25].The SMI in the G theory is a mutual cross-entropy.
To resolve the Multilabel-Learning-for-New-P(x) problem,, the author found a new inference method, Logical Bayesian Inference (LBI).Bayesians include subjective Bayesians and logical Bayesians.BI was developed by subjective Bayesians, who use subjective probability as the inference tool for statistical inference.Logical Bayesians, such as Keynes and Carnap [34], use logical probability including truth functions as the inference tool for inductive logic."Logical Bayesian Inference" is used because the (fuzzy) truth function, instead of the Bayesian posterior, is used as the main inference tool.In LBI, both statistical probability and logical probability are used for inference between statistics and logic.BI fits cases with given prior distribution of predictive model θ whereas LBI fits cases with given prior distribution of instances.Besides Shannon's information theory and Poppers' hypothesis-testing theory, the G theory is also inherits,, absorbs, or is compatible with


Akaike's proving that the Maximum Likelihood criterion is equal to the Minimum KL divergence criterion [37]. Donsker-Varadhan representation as a generalized KL formula with Gibs density [38].


Wittgenstein's thought: meaning lies in uses [39] (p.80).  Bayes' Theorem [40], which can be extended to link likelihood functions and membership functions [41].Based on the G theory and LBI, the author developed a group of algorithms: Channel Matching (CM) algorithms [41][42][43][44], for machine learning.In the CM algorithms, the semantic channel and the Shannon channel mutually match to achieve maximum information (for classifications) or maximum information efficiency (G/R) (for mixture models) 1 .
These algorithms are used mainly for:  Making probability predictions with prior knowledge;  Multilabel learning, belonging to supervised learning；  The Maximum Mutual Informatioon (MMI) classifications of unseen instances, belonging to semi-supervised learning；  Mixture models, belonging to unsupervised learning.Each of them is very difficult and not well resolved.MultilabelMultilabel The purpose of this paper is to completely introduce the G theory, LBI, and the CM algorithms with background knowledge and applications for readers to fully understand them, especially to understand how to use them to resolve the Multilabel-Learning-for-New-P(x) Problem.
The partial contents of this paper have been introduced in several short papers published in conference proceedings [41][42][43][44].Some contents introduced before are improved in this paper.Such as, one-dimensional examples for the MMI classifications and mixture models now become twodimensional examples; two formulas for the confirmation measure now become one formula.
According to the author's knowledge, no other one used a semantic information measure to optimize membership functions or truth functions with parameters by sampling distributions; no other one distinguished the statistical probability and the logical probability of a hypothesis and used both in the same formula; no other one proposed the semantic channel with its mathematical representation; no other confirmation measure is compatible with Popper's falsification theory.
This paper also compares the CM algorithms with some popular methods to show their efficiencies. x: an instance or data point; X: a discrete random variable taking a value x∈U={x1, x2, …, xm}. y: a hypothesis or label; Y is a discrete random variable taking a value y∈V={y1, y2, …, yn}. P(yj|x) (with certain yj and variable x): a Transition Probability Function (TPF) (which is so called by Shannon [7]).Shannon call P(X) as the source, P(Y) as the destination, and P(Y|X) as the channel.A Shannon's channel is a transition probability matrix or a group of transition probability functions:

Methods I: Background
where  means equivalence.Note TPF P(yj|x) is not normalized unlike conditional probability function P(y|xi), in which y is variable, and xi is constant.We will discuss that the TPF can be used for traditional Bayes' prediction in Section 2.2.1.Shannon's entropies of X and Y are: Shannon's posterior entropies of X and Y are: Shannon's mutual information are If Y=yj, mutual information I(X; Y) will become Kullback-Leibler (KL) divergence: Some researchers use the following formula to measure the information between xi and yj: Since I(xi; yj) may be negative, Shannon did not use it.Shannon explains that information is the reduced uncertainty or the saved average code word length.The author thinks that the above formula is meaningful because negative information means that a bad prediction may increase the uncertainty or the code word length.Since Shannon's information theory cannot measure semantic information, Carnap and Bar-Hillel proposed a semantic information formula: I(p)=log [1/mp].Since I(p) is irrelative to whether the prediction is correct or not, this formula is unpractical.
Zhong [12] makes use of DeLuca and Termini's fuzzy entropy [14] to define the semantic information measure: (2.9) where tj is "the logic truth" of yj.However, according to this formula, whatever tj=1 or tj=0, the information reaches its maximum 1 bit.This result is not expected.Therefore, this formula is unreasonable.This problem is also met by other semantic or fuzzy information theories that use DeLuca and Termini's fuzzy entropy [14].Floridi's semantic information formula [11,36] is a little complicated.It can ensure that the information conveyed by a tautology or a contradiction reaches its minimum 0. However, according to common sense, a wrong prediction or a lie is worse than a tautology.As to how the semantic information is related to the deviation and how the amount of semantic information of a correct prediction is different from that of a wrong prediction, we cannot get clear answers from his formula.
The author proposed an improved semantic information measure in 1990 [21] and developed the G theory later.
According to Tarski's truth theory [28], P(X ϵ θj) is equivalent to P("X ϵ θ" is true)=P(yj is true).According to Davidson's truth condition semantics [29], the truth function of yj ascertains the semantic meaning of yj.Following Tarski and Davidson, we define as follows: Definition 2:  θj is a fuzzy subset of U and is used to explain the semantic meaning of a predicate yj(X)="X ϵ θj."The θj is a model or a group of model parameters.If θj is non-fuzzy, we may replace it with Aj.  A probability is defined with "=", such as P(yj)=P(Y=yj), is a statistical probability; a probability is defined with "∈", such as P(X∈θj), is a logical probability.To distinguish P(Y=yj) and P(X∈θj), we define T(θj)=P(X∈θj) as the logical probability of yj. T(θj|x)=P(x∈θj)=P(X∈θj|X=x) is the conditional logical probability function of yj; it is also called the (fuzzy) truth function of yj or the membership function of θj.A group of TPFs P(yj|x), j=1,2,…,n, form a Shannon channel whereas a group of membership functions T(θj|x), j=1,2,…n, form a semantic channel: The Shannon channel P(Y|X) and the semantic channel T(θ|X) are illustrated in Figure 1.
(a) The Shannon channel.(b) The semantic channel The Shannon channel indicates the correlation between X and Y whereas the semantic channel indicates the fuzzy denotations of a set of labels.For the weather forecasts between an observatory and its audience, the Shannon channel indicates the rule by which the observatory selects labels or forecasts, whereas the semantic channel indicates the semantic meanings of these forecasts understood by the audience.
The expectation of the truth function is the logical probability: which was proposed by Zadeh [35] as the probability of a fuzzy event.This logical probability is a little different from mp defined by Carnap and Bar-Hillel [9].The latter only rests with a hypothesis' denotation.For example, yj is a hypothesis: "X is infected by HIV" (HIV: Human Immunodeficiency Virus) or a label "HIV infected".Its logical probability T(θj) is very small for normal people because HIV-infected people are rare.However, mp is irrelative to P(x); it may be 1/2.Note that the statistical probability is normalized whereas the logical probability is not.When θ0, θ1, …, θn form a cover of U, there are P(y0)+P(y1)+…+P(yn)=1 and T(θ0)+T(θ1)+…+T(θn)≥1.
For example, U is a set of different ages.There are nonfuzzy subsets of U (see Figure 2): A1={adults}={x|x≥18}, A0={juveniles}={x|x<18}=A1 c ( c means a complementary set), and A2={young people}={x|15≤x≤35}.The three sets form a cover of U.There is T(A0)+T(A1)=1.If T(A2)=0.3; the sum of the three logical probabilities is 1.3>1.However, the sum of three statistical probabilities P(y0)+P(y1) +P(y2) must be less or equal to 1.If y2 is correctly used, P(y2) will change from 0 to 0.3.If A0, A1, and A2 become fuzzy sets, the conclusion is the same.Consider a tautology "There will be rain or will not be rain tomorrow".Its logical probability is 1 whereas its statistical probability is close to 0 because it is rarely selected.
We can put T(θj|x) and P(x) into the Bayes' formula to obtain a likelihood function [21]: P(x|θj) can be called the semantic Bayes' prediction or the semantic likelihood function.According to Dubois and Prade's paper [45], Thomas [46] and others proposed similar formulas earlier.
Assume that the maximum of T(θj|x) is 1.From P(x) and P(x|θj), we can obtain where max [.] means the maximum of the function in [.] for different x.The author [41] proposed the third type of Bayes' theorem which consists of the above two formulas.This theorem can convert the likelihood function and the membership function or the truth function form one to another when P(x) is given.The Eq. (2.13) is compatible with Wang's fuzzy sets falling shadow theory [41,47].Figure 2 illustrates the relationship between P(x|θj) and T(θj|x) for given P(x), where x is an age; label yj="Youth"; θj is a non-fuzzy set and hence becomes Aj.We use Global Positioning System (GPS) as an example to show the semantic Bayes' prediction.
Example 2.1 A GPS device is used in a train and hence P(x) is uniformly distributed on a line (see Figure 3).The GPS pointer has a deviation.Try to find the most possible position of the GPS device.The semantic meaning of the GPS pointer can be expressed by (2.14 where xj is the pointed position by yj, and σ is the Root Mean Square (RMS).For simplicity, here we assume x is one-dimensional.According to Eq. (2.12), we can predict that the star is the most possible position.Most people can make the same prediction without using any mathematical formula.It seems that human brains automatically use a similar method: predicting according to the fuzzy denotation of yj.
In semantic communication, we often see hypotheses or predictions such as "The temperature is about ten ˚C", "Time is about seven O'clock", and "The stock index will go up about 10% next month".Every one of them may be represented by yj="X is about xj".We can also express their truth functions by Eq. (2.14).
The author defines the (amount of) semantic information conveyed by yj about xi with lognormalized-likelihood : Bringing Eq. (2.14) into this formula, we have by which we can explain that this information is equal to Carnap-Bar-Hillel information minus relative deviation square.This formula is illustrated in Figure 4.  indicates that the less the logical probability is, the more information there is; the larger the deviation is, the less information there is; a wrong hypothesis will convey negative information.These conclusions accord with Popper's thought [32] (p.294).
To average I(xi; θj), we have generalized KL information where P(xi|yj) (i=1,2,…) is the sampling distribution, which may be unsmooth or discontinuous.
Akaike [4] proved that the Least KL divergence criterion is equivalent to the Maximum likelihood (ML) criterion.We can follow Akaike to prove that the Maximum Semantic Information (MSI) criterion, e.g., maximum generalized KL information criterion, is also equivalent to the ML criterion.
Definition 3: D is a sample with labels: {(x(t), y(t))|t=1 to N; x(t)∈U; y(t)∈V}, which includes n different sub-samples or conditional samples Xj, j=1, 2, ….Every sub-sample includes data points x(1), x(2), …, x(Nj)∈U with label yj.If Xj is large enough, we can obtain distribution P(x|yj) from Xj.If yj is unknown in Xj, we replace Xj with X and P(x|yj) with P(x|.).
Assume that there are Nj data points in Xj; Nji data points are xi.When Nj data points in Xj come from Independent and Identically Distributed (IID) random variables, we have the likelihood I(X; θj) and logP(Xj|θj) reach their maxima at the same time as θj changes, and hence the two criterions are equivalent.It is easy to prove that when P(x|θj)=P(x|yj), I(X; θj) and logP(Xj|θj) reach their maxima.
When the sample Xj is huge, letting P(x|θj)=P(x|yj), we can obtain the optimized truth function: This formula clearly indicates how the semantic channel matches the Shannon channel.It is compatible with Wittgenstein's thought: meaning lies in uses [39] (p.80).
To average I(X; θj) for different y, we have the Semantic Mutual Information (SMI) formula

22)
If P(x|θj)=P(x|yj) or T(θj|x)∝P(yj|x) for different yj, which means that the semantical channel matches the Shannon channel, the SMI will be equal to Shannon's Mutual Information (SHMI).
Bringing Eq. (2.13) into the above formula, we have It is easy to find that the maximum SMI criterion is a special Regularized Least Square (RLS) criterion [33].H(θ|X) is the mean squared error, and H(θ) is the negative regularization term.The difference is that H(θ) does not penalize every parameter.It only penalizes parameters for relative deviations.The important is that the maximum SMI criterion is also compatible with the ML criterion.2.1.2.From Rate-Distortion Function

R(D) to Rate-Verisimilitude Function R(G)
The R(G) function will be used to explain the convergence of iterative algorithms for the MMI classifications and mixture models.
Shannon proposed rate-distortion function R(D) [30].R(G) [25] is an extension of R(D).In R(D), R is the information rate, D is the upper limit of average distortion.R(D) means that for given D, R(D) is the minimum of SHMI I(X; Y).
The rate distortion function with parameter s [48] (P.32) includes two formulas: where ( ) exp( ) is the partition function.
Let dij be replaced with Iij= I(xi; yj)=log[T(θj|xi)/T(θj)]= log[P(xi|θj)/P(xi)], and let G be the lower limit of I(X; θ).The information rate for given G and source P(X) is defined as (2.25) Popper [32] proposed to use verisimilitude instead of correctness to evaluate a hypothesis.
Verisimilitude includes both correctness and precision.Hence, I(xi; θj) can be a good measure for the verisimilitude of yj reflecting xi; we call R(G) as the rate-verisimilitude function.
Following the derivation of R(D) [48] (p.31), we can obtain where mij=T(θj|xi)/T(θj)=P(xi|θj)/P(xi) is the normalized likelihood; λi =∑jP(yj)mij s .The shape of any R(G) function is a bowl-like curve with second derivative >0 as shown in Figure 5.In Figure 5, s= dR/dD.When s=1, R is equal to G, which means that the semantic channel matches the Shannon channel.G/R indicates the efficiency of the semantic communication.In Section 3.4, we will see that to solve a mixture model is to find a parameter set θ that maximum G/R so that G/R is close to 1 or G≈R.
When s→∞, both R and G reach their maxima Rmax and Gmax.As s increases, TPFs P(yj|x), j=1,2,…,n, will become sharper and the Shannon channel will have less noise.Hence, the R and G will increase.This property of R(G) can be used to prove the convergence of the CM iteration algorithm for the MMI classifications of unseen instances.
The R(G) function is different from the R(D) function.For a given R, we have the maximum value G + and the minimum value G -. G -is negative and means that to bring a certain information loss |G| to enemies, we also need certain objective information R. When R=0, G is negative, which means that if we listen to someone who randomly predicts, the information that we already have will be reduced.
The R(G) function was developed mainly for image compression according to visual discrimination [25].Now it can be used for the convergence proofs of MMI classifications and mixture models.

Traditional Bayes' Prediction, Likelihood Inference(LI), and Bayesian Inference (BI)
To understand LBI better, first, we review the traditional Bayes' Prediction (TBP), LI, and BI.We call the inference with the TPF P(yj|x) as Traditional Bayes' Prediction (TBP).Using TBP, for given P(x) and P(yj|x), we can make probability prediction P(x|yj)=P(x) P(yj|x)/P(yj). (2.27) When P(yj|x) is replaced with kP(yj|x), where k is a constant, P(x|yj) is the same because P x kP y x P x P y x P x y P x kP y x P x P y x    (2.28) Using this formula, we can easily explain that a truth function that is proportional to a TPF can be used for the same probability prediction.
For given P(yj), P(x|yj), and P(x), we can obtain the predictive model, e.g., TPF P(yj|x)= P(yj) P(x|yj)/P(x). (2.29) We use the medical test (or the signal detection) as an example to explain how TPFs and the Shannon channel as a predictive model.
Definition 4: Let z be an observed feature for an unseen instance (see Figure 1); Z is a random variable taking a value z∈C={z1, z2, …}.For unseen instance classifications, x denotes a true class or true label.
Assume that we classify every unseen instance with unseen true label x according to its observed feature z∈C.That is to provide a classifier y=f(z) to get a label y (see Figure 6).We use the HIV test to explain that the TPF can be used for probability prediction with different P(x).For an infected testee x1, the conditional probability P(y1|x1) of y1 =positive is called sensitivity, which means the true positive rate.For an uninfected testee x0, the conditional probability P(y0|x0) of y0=negative is called specificity, which means the true negative rate [49].The sensitivity and specificity ascertain a Shannon channel as shown in Table I.
where P(xi|.)meas that yj is unkown.The main defect of LI is that LI cannot make use of prior knowledge, and when P(x) is changed, the optimized likelihood function will be invalid.
To make use of prior knowledge, subjective Bayesians developed Bayesian Inference (BI) [2,3].They brought the prior distribution P(θ) of θ into the Bayes' Theorem to obtain the Bayesian posterior where Pθ(X) is the normalized constant related to θ; P(θ|X) is the posterior distribution of θ or the Bayesian posterior.Using P(θ|X), we can derive the Maximum A Posterior estimation: where Pθ(X) is neglected.BI has some advantages, such as  It is especially suitable to cases where Y is a random variable for a frequency generator, such as a dice. As the sample size increases, the distribution P(θ|X) will gradually shrink to θj* that comes from the MLE. BI can make use of prior knowledge better than LI.However, BI also has some disadvantages:  The probability prediction according to BI [3] is not compatible with the traditional Bayes' prediction. P(θ) is subjectively selected. BI cannot make use of the prior of x.If we try to use BI to solve Example 2.1 and Example 2.2,, we will find that the Bayesian posterior is not as good as TPF P(yj|x).Therefore, to make use of the prior of x, we still want a parameterized TPF P(θj|x).

From Fisher's Inverse Probability Function P(θj|x) to Logical Bayesian Inference (LBI)
De Morgan first called TPF P(yj|x) as "inverse probability" about Laplace's method of probability [2].The corresponding direct probability is P(x|yj).Later, Fisher called the likelihood function P(x|θj) as direct probability and the parameterized TPF P(θj|x) as inverse probability [2].We use θj instead of θ and x instead of xi to emphasize θj is a constant and x is a variable, and hence, P(θj|x) should be a function.In the following, we call P(θj|x) as the Inverse Probability Function (IPF).According to Bayes' theorem, there is P(x|θj)=P(xi) P(θj|x)/P(θj). (2.34) IPF P(θj|x) can make use of the prior knowledge P(x) well.When P(x) becomes P'(x), we can still obtain P'(x|θj) from P'(x) and P(θj|x).
When n=2, we can easily construct P(θj|x), j=1,2, with parameters.For instance, we can use a pair of Logistic (Sigmoid) functions as the IPFs.Unfortunately, when n>2, it is hard to construct P(θj|x), j=1,2,…,n, because there is normalization limitation ∑j P(θj|x)=1 for every x.That is why a multi-class or Multilabel classification is often converted into several binary classifications [51,52].P(θj|x) and P(yj|x) as predictive models also have a serious disadvantage.In many cases, we can only know P(x) and P(x|yj) without knowing P(θj) or P(yj) so that we cannot obtain P(yj|x) or P(θj|x).Nevertheless, we can obtain truth function T(θj|x) in these cases.For LBI, there is not the normalization limitation, and hence it is easy to construct a group of truth functions and train them with P(x) and P(x|yj), j=1,2,…,n, without P(yj) or P(θj).That is an important reason why we need LBI.
When a sample Xj is huge, we can directly obtain T*(θj|x) from Eq. (2.20).For a size-limited sample, we can use the generalized KL to obtain This formula is the main formulas used for LBI.LBI provides the Maximum Semantic Information Estimation (MSIE) : which is compatible with MLE.If samples are big enough, MSIE, MLE, and MAP are equivalent.We suggest using the truth function as the predictive model or the inference tool for LBI in some cases because the truth function has the following advantages:  We can use an optimized truth function T*(θj|x) to make probability prediction for different P(x) as well as we use P(yj|x) or P(θj|x). We can train a truth function with parameters by a sample with small size as well as we train a likelihood function. The truth function can indicate the semantic meaning of a hypothesis or the denotation of a label. It is also the membership function, which is suitable for classification. To train a truth function T(θj|x), we only need P(x) and P(x|yj), without needing P(yj) or P(θj). Letting T(θj|x)∝P(yj|x), we can set a bridge between statistics and logic.The CM algorithms can further reveal these advantages.We use CM1 to denote the basic matching algorithm in which the semantic channel matches the Shannon channel, where membership functions or truth functions are used as learning functions.

Methods II: The Channel Matching (CM) Algorithms
Assume that x is an age; yj is a label "Youth"; θj is a fuzzy set {x|x is a youth}.From population statistics, we can obtain population age distribution P(x) and posterior distribution P(x|yj).
If the sample is huge and hence distributions P(x) and P(x|yj) are smooth, we can directly use Eq.(2.20) to obtain the optimized membership function T*(θj|x) without parameters.If P(x) and P(x|yj) are not smooth, we can use Eq.(2.35) to obtain T*(θj|x) with parameters.Without needing P(yj), CM1 for every label's learning is independent.
If the given sampling distribution is transition probability function P(yj|x), we may assume P(x) is flat.Then Eq. (2.35) becomes If P(yj|x) is smooth, we can use Eq.(2.21) to obtain T*(θj|x) without parameters.For multilabel learning, we can directly obtain a group of truth functions from a Shannon channel P(Y|X) or a sample with distribution P(x,y).However, using the Binary Relevance, we have to prepare several samples for several Logistic functions.
When P(x) is changed, T*(θj|x) is still useful to makes semantic Bayes' predictions.
3.1.2.For the Confirmation Measure of Major PremisesWe use "degree of confirmation" to denote "degree of belief" supported by evidence or samples.
Bayesians use "degree of belief" to explain the subjective probability of a hypothesis.This degree of belief is between 0 and 1.However, the researchers of induction use "degree of belief" to evaluate if-then statements or major premises.This degree of belief should be between -1 and 1.In this paper, we use "degree of belief" between -1 and 1 for the subjective evaluation of if-then statements.We know that the correlation coefficient between two random variables is also between -1 and 1.The difference is that if-then statements are asymmetry; there are more than one major premises and degrees of belief between instance X and hypothesis Y. Now we use the medical test as an example to explain how to use truth functions to replace TPFs or how to use the semantic channel to replace the Shannon channel for probability predictions.
From the Shannon channel in Table 1, we can derive the semantic channel as shown in Table 2. Assume T(y1|x1)=T(y0|x0)=1 and T(y1|x0)=T(y0|x1)=0 for non-fuzzy hypotheses.Two truth functions for corresponding fuzzy hypotheses are where b1=b(y1→x1) means the degree of belief of major premise MP1= y1→x1 ="if Y=y1 then X=x1",, and b1 '=1-|b1 | means the degree of disbelief of MP1 and the ratio of a tautology in y1.Likewise, b0 b(y0→ x0) and b0'=1-|b0|.This prediction is equivalent to the traditional Bayes' prediction with TPF P(yj|x).Even if we only know P(x|y1) and P(x) without knowing P(y1), we can still make the probability prediction.It is easy to verify that, using Eq.(3.6) to solve Example 2.2, the results are the same as those from the traditional Bayes' prediction.
If we try to use LI or BI to solve Example 3.1, we will find it is not easy for the model to fit different P(x).
In comparison with the Shannon channel in Table 1, the semantic channel in Table 2 is easier to understand and remember.To remember P(yj|x), we need to remember two numbers; whereas, to remember T*(θ1 |x), we only need to remember one number b'*.
In reference [41], the author provides two formulas for positive and negative degrees of confirmation respectively.Now we can merge the two formulas into a new formula:

For Rectifying the Parameters of a GPS Device
If we do not know the real parameters of a GPS device or suspect the parameters claimed by the producer, we can assume (3.9) wher x is a two-dimensional vector.Then we can use a sample to find parameters △x, which is the systematic deviation, and σ.We may obtain the sample by driving a car with the GPS device on a big square and recording the relative positions x'=x-xj.From many relative deviations, we can obtain sampling distribution P(x'|yj).Since driving on a big square, P(x) should be flat.Then we can use the generalized KL information formula to obtain the optimized parameters △x* and σ*.We can use △x* to adjust yj by replacing yj with yk=yj+△x*.Assume that the GPS device is often faulty, we can also use as the learning function to obtain the confirmation measure b* of the GPS device.
If one tries to use the inverse likelihood function P(θj|x) or the Bayesian posterior P(θ|X) for the same task and probability prediction (see Figure 3), he will find it is very difficult to do because we only have prior knowledge P(x) from a GPS map, without the prior knowledge P(y) or P(θ).

MultilabelMultilabel3.2. CM2: The Semantic Channel and the Shannon Channel Mutually Match for Multilabel Classifications
We use CM2 to denote two steps：  Matching I: Let the semantic channel match the Shannon channel or use CM1 for multilabel learning;  Matching II: Let the Shannon channel match the semantic channel by using the Maximum Semantic Information (MSI) classifier.Both steps use the MMI or ML criterion.
MultilabelMultilabelFor Multilabel learning, we may train every label by Eq. (2.35) or Eq.(3.1).We may also learn from the popular method [52] to use both P(x|yj) and P(x|yj'), where yj' is the negation of yj, to train T(θj|x) by where θj c is the complementary set of θj.This T*(θj|x) may be a Logistic function,which will cover a larger area of U in comparison with T*(θj|x) from Eq.
To simplify Multilabel learning, we may train fewer atomic labels and use them and the fuzzy logic that is compatible with Boolean algebra [22] Multilabelto produce the membership function of a compound label for the multilabel classifications [44].
In the popular method for multilabel classifications with the Bayes' classifier or MPP criterion, different x, the classifier compares two IPFs P(θj|x) and P(θk|x), such as two Logistic functions, to select a label with greater IPF.This method is not compatible with the information criterion or the likelihood criterion.Multilabel

CM3: the CM Iteration Algorithm for the MMI classifications of Unseen instances
We use CM3 to denote the CM iteration algorithm, which repeats two steps: Matching I and Matching II.CM2 is not an iterative algorithm, nevertheless, CM3 is.This algorithm is used for the MMI classifications, for which the popular method is the Gradient Decent.
We use the medical test as shown in Figure 6 as an example to explain the problem with the MMI classifications of unseen instances.
We need to optimize z' for the MMI.The problem is that, without the classifier f(z), we cannot express mutual information I(X; Y), whereas, without the expression of mutual information, we cannot optimize the classifier f(z).This problem is also met by the MLE for uncertain Shannon's channels.To resolve this problem, researchers generally use parameters to construct construct partition boundaries and then use the Gradient Descent or the Newton method to search the best parameters for MMI.The CM iteration algorithm for the MMI classifications is different.It uses numerical values to express boundaries and information gain functions, e.g., reward functions.It repeats updating information gain functions and boundaries to achieve MMI.
Let Cj be a subset of C and yj=f(z|z∈Cj); hence S={C1, C2, …} is a partition of C. Our aim is, for P(x, z) from D, to find the optimized S, which is

.15)
Matching I: Let the semantic channel match the Shannon channel and set reward functions.. First, we obtain the Shannon channel for given S: From this Shannon's channel, we can obtain the semantic channel T(θ|X) and the semantic information I(xi; θj).Then, for given z, we have conditional information or reward functions: which are some curved surfaces over a two-dimensional feature space when U is two-dimensional.We may directly let I(xi; θj)=I(xi; yj)=log[P(yj|x)/P(yj)].However, with the notion of the semantic channel, we can understand this algorithm and prove its convergence better.
Matching II: Let the Shannon channel match the semantic channel by the classifier * ( ) arg max ( ; | ) Repeat Matching I and II until S does not change.The convergent S is S* we seek.
Matching II for the optimization of the Shannon channel can reduce noise.We can understand two matching steps in this way: Matching I is for the reward function; Matching II is for Bayes' decision.
For a given source P(X), a semantic channel ascertains an R(G) function.An improved R(G) function has a higher matching point where R(G)=G.The CM algorithm is to find the matching point that is also the point with Rmax and Gmax (see Figure 8).Matching II can always find any best partition for given I(X; yj|z), j=1,2,… because it checks every z to see which of I(X; yj|z), j=1,2,… is maximum.We can understand the CM algorithm in the following way.An R(G) function is like a ladder, and coordinate (G, R) is like a climber.In Matching I, (G, R) creates a ladder and moves on it.In Matching II, it climbs up to the top of the ladder.Then it creates a new ladder … until it reaches (Gmax, Rmax).

CM4: the CM-EM Algorithm for Mixture Models
We use CM4 to denote the CM-EM Algorithm: An Improved EM Algorithm for mixture models.In CM3, Matching II is for maximum R, whereas, in CM4, Matching II is for maximum information efficiency G/R or minimum R-G.
CM4 is based on a totally different convergence theory of mixture models.The popular convergence theory of the EM algorithm explains that we can maximize incomplete data loglikelihood LX(θ) by maximizing complete data log-likelihood Q, whereas the convergence theory basing the CM-EM algorithm explains that we can maximize LX(θ) by maximizing information efficiency G/R.
The EM algorithm [53] for mixture models often results in slow or invalid convergence [54,55].We can improve the EM algorithm by letting the semantic channel and the Shannon channel mutually match.The different is that Matching II is for the minimum of Shannon's mutual information R.
If a probability distribution Pθ(x) comes from the mixture of n likelihood functions, e. g., then we call Pθ(x) a mixture model.If every predictive model P(x|θj) is a Gaussian function, then Pθ(x) is a Gaussian mixture.In the following, we use n=2 to discuss the algorithms for mixture models.
Assume P(x) comes from the mixture of two true model P(x|θ1 *) and P(x|θ2*) with ratios P*(y1) and P*(y2)=1-P*(y1).That is P(x)=P*(y1)P (x|θ1*)+P*(y2)P(x|θ2*). (3.20) We only know P(x) and n=2.We can use guessed parameters and and the mixture ratios to obtain Then we have log-likelihood and the relative entropy or the KL divergence: If two distributions P(x) and Pθ(x) are close to each other so that the relative entropy is close to 0, such as less than 0.001 bit, then we may say that our guess is right.Therefore, our task is to change θ and P(y) to maximize likelihood LX(θ)=logP(X|θ) or to minimize relative entropy H(P||Pθ).
The main formula of the EM algorithm for mixture models can be described as follows: where Q= -NH(X, Y|θ) is called as complete data log-likelihood.P(yj|x) is from Eq. (3.26).There is where H=-NH(Y|X,θ) is a Shannon conditional entropy.
Steps in the EM algorithm are: E-step: Write the conditional probability functions (e. g., the Shannon channel): ). j j j j j j P y x P y P x P x M-step: Improve P(y) and θ to maximize Q.If Q cannot be improved further, then end the iteration process; otherwise, go to the E-step.
Neal and Hinton [56] proposed an improved EM algorithm, the Maximization-Maximization (MM) algorithm, in which Q is replaced with F=Q+H(Y).Both steps maximize F.
Almost all the EM algorithm researchers believe that Q and logLX(θ) are positively correlated and the E-step does not decrease Q; nevertheless, it is not true.The author found that Q may decrease in some E-steps; Q and F should decrease in some cases [42].
Using the CM algorithm to improve the EM algorithm, we can have an algorithm, the CM-EM algorithm, for better convergence.
The CM-EM algorithm includes three steps: E1-step: Construct the Shannon channel.This step is the same as the E-step of the EM algorithm.E2-step: Repeat the following three equations until P +1 (y) converges to P(y).
If H(P||Pθ) is less than a small value, then end the iteration.MG-step: Optimize the parameters θj +1 of the likelihood function in log(.) to maximize G： Then go to E1 -step.
Since G reaches the maximum when P(x|θj +1 )/P(x)=P(x|θj)/Pθ(x), the new likelihood function is Without E2-step, P(x|θj +1 ) above is not normalized in general [61].For Gaussian mixtures, we can easily obtain new parameters: If the likelihood functions are not Gaussian distributions, we can find optimized parameters by searching the parameter space, such as using the Gradient Descent.
To prove the convergence of the CM-EM algorithm, we may make use of the properties of the R(G) function:  R(G) function is concave and R(G)-G has the exclusive minimum 0 as R(G)=G [25];  R(G)-G is close to relative entropy H(P||Pθ).After E1 -step, Shannon's mutual information I(X; Y) becomes We define It is easy to prove that R″-G=H(P||Pθ).Hence where Proving that Pθ(X) converges to P(X) is equivalent to proving that H(P||Pθ) converges to 0. Since E2-step makes R=R'' and H(Y +1 ||Y)=0, we only need to prove that every step minimizes R-G.It is evident that the MG-step minimizes R-G because this step maximizes G without changing R. The remaining problem is how to prove that E1-step and E2-step minimize R-G.Learning from the variational method and the iterative method that Shannon [30] and others [48] used for analyzing the rate-distortion function R(D), we can optimize P(y|x) and P(y) respectively to minimize R-G=I(X; Y)-I(X; θ).Since P(Y|X) and P(Y) are interdependent, we can only fix one to optimize another.E2-step is for this purpose.For the detailed convergence proof, see [57].

The Results of CM2 for Multilabel Learning and Classification
We use a prior distribution P(x) and a posterior distribution P(x|yj) to optimize a truth function to obtain T*(θj|x) as shown in Figure 9.
For P(x) and P(x|yj), first we use a Gaussian random number generator to produce two samples S1 and S2.Both sample sizes are 100000.The data with distribution P(x) is a part of S1.There is The k is a normalizing constant.S2 has distribution P2(x).P(x|yj) is produced from P2(x) and the following truth function:  Then we try to obtain T*(θj|x) from P(x) and P(x|yj).If we directly use formula Eq. (2.20), T*(θj|x) will be unsmooth.We can set a truth function with parameters And then, we use the Generalized KL information formula to optimize T(θj|x) to obtain T*(θj|x), which will be smooth.If S2=S1, then T*(θj|x)=P(x|yj)/P(x)/max[P(x|yj)/P(x)]=T(θ2|x).
Figure 10 shows the MSI classification of ages for given prior distribution P(x) and the truth functions of five labels.Five labels are (y1, y2, y3, y4, y5)=("Adult", "Child", "Youth", "Middle age", "Old").Figure 10 (a) shows the truth functions of five labels.Among these truth functions, only each of T(θ3|x) and T(θ4|x) is constructed by two logistic functions; each of others is a logistic function.
We can also treat these truth functions as learning functions P(θj|x) obtained from the popular method, and then use the Bayes classifier or the Maximum Posterior Probability criterion to classify.
In Figure 10     Figure 10 indicates that the Maximum Posterior Probability (MPP) criterion and the MSI criterion result in different classifications.Using the MPP criterion, we shall only select y0="Nonadult" or y1="Adult" for most ages.However, using the MSI criterion, we shall select y2, y6, y7, y4, and y5 in turn as age x increases.The MSI criterion encourages us to use more labels with smaller logical probability.For example, if x is between 11.2 and 16.6, we should use y6=y3∧y1'="Youth" and "Nonadult".However, for most x, CM2 does not use redundant labels as Binary Relevance [52] does.For example, using the MSI criterion, we do not add label "Non-youth" to x=60 with label "Old" already.
The following is a two-dimensional example.
Example 4.2.2 (see Figure 11) There are three classes.The left two classes are two Gaussian distributions: P(z|x0) and P(z|x1); the right one is the mixture of two Gaussian distributions: P(z|x21) and P(z|x22).The sample size is 1000.See Table 3 for the parameters of four Gaussian distributions.Two vertical lines make the initial partition.
The iterative process: It is shown in Figure 11.After two iterations, the mutual information I(X; Y) is 1.0434 bits.The convergent MMI is 1.0435 bits.Two iterations make the mutual information reach 99.99% of the convergent MMI.
To test the reliability of CM3, the author used a very bad initial partition.The convergence is also very fast (see Figure 12).The author has the above example with different parameters and different initial partitions to test CM3.All iterative processes are fast and valid.In most cases, 2-3 iterations can make the mutual information surpass 99% of the MMI.

The Results of CM4 for Mixture Models
The following three examples show that the CM-EM algorithm can outperform the EM algorithm and the MM algorithm.
Ueda and Nakano [54] proposed an example to show that, because some initial parameters result in the local maximum of Q, local or invalid convergence is inevitable in the EM algorithm.This invalid convergence was also verified by Marin et al. [55].The following is this example.We use this example with initial parameters (μ1, μ2, P(y1), σ1, σ2) = (115, 95, 0.5, 10, 10) to test the CM-EM algorithm to see whether (μ1, μ2,) can converge to (μ1*, μ2*)=(100, 125).Figure 12 shows the result.The stop condition is that the deviation of every parameter is less than 1%.
The following example is to compare the iteration numbers of different algorithms.Neal and Hinton [56] used this example to compare their Maximization-Maximization (MM) algorithm with the EM algorithm.Now we use the same example to compare the CM-EM algorithm with the EM algorithm and the MM algorithm.
Example 4.3.2Real and initial parameters including mixture ratios are shown in Table 4.The transforming formula is x=20(x'-50), where x' is an original data point and x is a data point used for Table 4. Assume that P(x) comes from the two Gaussian functions with real parameters.Using the CM-EM algorithm, we obtain H(P||Pθ)=0.00072 bit after 9 E1 and E2-steps and 8 MG steps.The iterative process is shown in Figure 14.The author also used a sample whose size is 1000 to produce P(x) to test the CM-EM algorithm.Table 5 shows the iteration numbers and final parameters for the three different algorithms.These data show that iterations that the CM-EM needs are less than half of iterations that the EM or the MM algorithm needs.This example is to test if CM4 can validly converge for seriously overlapping components.The upper two pairs of components can quickly converge whereas the lower a pair can only slowly converge.The convergence condition is that horizontal deviation is smaller than 1.

Discussing Confirmation Measure b*
In modern times, the induction problem has become the confirmation problem [58].There have existed many confirmation measures [59].
When sensitivity is 1, if specificity is small, such as 0.1, the degree of confirmation of MP1, b1 *, is also 0.1.However, using the existing confirmation formulas, the degrees of confirmation are much bigger than 0.1.This is unreasonable because the ratio of negative examples is 0.9/1.9≈0.47≈0.5, which means that MP1 is almost unbelievable.
From the above two examples, we can find that confirmation measure b* emphasizes that no negative examples (for nonfuzzy hypotheses) or fewer negative examples (for fuzzy hypotheses) are more important than more positive examples, and hence, it is compatible with Popper's falsification thought [31,32].This measure b* is compatible with confidence level and hence is also supported by medical practices.

Discussing CM2 for the Multilabel Classification
In comarision with popular methods, such as Binary Relevance [52], for Multilabel learning, CM3 does not need n samples for n pairs of labels.It can directly obtain the semantic channel that consists of a group of truth functions from the Shannon channel P(Y|X) or the sampling distribution P(x, y).
In comparison with the MPP criterion, the MSI criterion can reduce the rate of failing to report smaller probability events.When information is more important than correctness, the MSI criterion is better.
Note That the boundary for "Old" in Figure 10 (b) is not 60 but 58.1.It is because "Old" has smaller logical probability than "Middle age".If people's average lifetime is longer, the boundary for "Old" will move right.We can image that the new partitioning boundary will result in new sampling distribution P(x|y5) and new truth function T(θ5|x); the new truth function will make the boundary move right further … The truth function or the semantic meaning of "Old" should have been evolving with human average lifetime in this way.

Discussing CM3 for the MMI Classifications of Unseen Instances
Solving MMI is a difficult problem not only with machine learning [19,20] but also with the classical information theory.Shannon and many researchers [7,6] use the least average distortion criterion instead of the MMI criterion to optimize detections and estimations.If we use the MMI criterion, the residual error coding will need smaller average code length.Why do not they use the MMI criterion?The reason is that it is hard to optimize partition boundaries for MMI.Now, using CM3, we can resolve this problem at least for low-dimensional feature spaces.
Popular methods for the MMI classifications or estimations use parameters to construct transition probability functions or likelihood functions and then optimize these parameters by the Gradient Descent or the Newton method.The optimized parameters ascertain partition boundaries.However, CM3 or the CM iteration algorithm separately constructs n likelihood functions by parameters for n different classes and then optimizes labels for different z.It provides the numerical solutions of partition boundaries.We compare CM3 and the Gradient Descent in Table 6.The CM iteration algorithm also has two disadvantages.One is that it requires that every subsample for every class is big enough so that we can construct n likelihood functions for n classes.Another is that for high-dimensional feature spaces, it is not feasible to label every z.We need to combine the CM iteration algorithm with neural networks for the MMI classifications of highdimensional feature spaces.
A neural network is a classifier y=f(z).For a given neural network, Matching I is to let the semantic channel match the Shannon channel to obtain reward functions I(X; θj|z), j=0, 1,… For given reward functions, Matching II is to let the Shannon channel match the semantic channel to obtain new neural network parameters.Repeating two steps can make I(X; θ) converge to MMI.The Matching I and Matching II are similar to the tasks of generative and discriminative models in the the Generative Adversarial Network.Combining CM3 and popular deep learning methods [33], we should be able to improve the MMI classifications for high-dimensional feature spaces.

Discussing CM4 for mixture models
The above results in Section 4.3 indicate that the complete data log-likelihood Q and the incomplete data log-likelihood LX(θ) not always positively correlated as most researchers believe.In some cases, Q may and should decrease because Q may be greater than Q*=Q(θ*), which is the true model's Q.In Example 3.4.1,assuming the true model's parameters σ1*=σ2*=σ* and P*(y1)=P*(y2)=0.5, we can prove that P(y1|x) and P(y2|x) are a pair of logistic functions and become sharp as σ decreases.Hence, H increases as σ increases.We can prove partial derivative ∂H/∂σ*>0.Hence, when θ=θ*, ( ) 0 0.
The new convergence theory basing CM4 explains that CM4 can converge because the iteration can maximize G/R or minimize R-G.
We have used different examples to test the CM-EM algorithm.The experiments show that the CM-EM algorithm can reduce slow and invalid convergence that the EM algorithm encounters when mixture ratios are imbalanced, or the local maximum Q exists.In most cases, the CM-EM algorithm needs no more than ten iterations for the convergence of Gaussian mixture models.Its convergence speed is faster or similar to other improved EM algorithms, such as the MM algorithm [55] and the multiset EM algorithm [61].
The CM-EM algorithm can be used not only for Gaussian mixtures but also for other mixtures.For other mixtures, the MG step is a little difficult.The convergence proof should be the same.
CM4 and CM3 together can be used for unsupervised learning.For CM4, we can obtain a group of model parameters from a sample with distribution P(x).Using CM3, we can make the MMI classification for the sample.
The CM-EM algorithm cannot avoid that θ converges to the boundary of parameter spaces.We need to combine some existing algorithms, such as, Split and Merge EM algorithm [62] and Competitive EM algorithm [63], for better global convergence of mixture models.

Conclusions
The semantic information G theory combines the thoughts of Shannon, Popper, Fisher, Zadeh, Carnap et al.The semantic information measure, the G measure, increases as the logical probability decreases as well as Carnap and Bar-Hillel's semantic information measure; however, the G measure also decreases as the relative deviation increases, and hence it can be used for hypothesis tests.
Logical Bayesian Inference (LBI) uses the truth function instead of the Bayesian posterior as the inference tool.Using the truth function T(θj|x), we can make probability predictions with the different prior P(x), as using the Transition Probability Function (TPF) P(yj|x) or the Inverse Probability Function (IPF) P(θj|x).However, it is much easier to obtain the optimized truth function from samples than to obtain the optimized IPF because P(yj) or P(θj) is not necessary for optimizing truth functions.The more important is that the truth function can represent the semantic meaning of a hypothesis or a label and connect statistics and logic better.A windfall is that the optimization of the truth function brings a seemly reasonable confirmation measure for induction.
MultilabelMultilabelMultilabelA group of Channel Matching (CM) algorithms CM1, CM2, CM3, and CM4 are used to improve machine learning, especially to resolve the Multilabel-Learning-for-New-P(x) problem.CM1 can be used to improve label learning and confirmation; CM2 can be used to improve Multilabel classifications; CM3 can be used to improve maximum mutual information classifications of unseen instances on low-dimensional feature spaces; CM4 can be used to improve mixture models.The G theory and LBI should have survived the tests by their applications to machine learning.
For the further applications of the G theory and LBI to machine learning, we need to combine the CM algorithms with the neural networks in the future.The logical Bayesian inference may be further developed for the unification of logic and statistics.

2. 1 .
From the Shannon Information Theory to the Semantic Information G Theory 2.1.1From Shannon's Mutual Information to Semantic mutual Information Definition 1: 1 The python 3.6 source files for these algorithms producing Figures 9-15 can be found through Appendix B.

Figure 1 .
Figure 1.The Shannon channel and the semantic channel.The semantic meaning of yj is ascertained by the membership relation between x and θj.A fuzzy set θj may be overlapped or included by another.

Figure 2
Figure 2 Relationship between T(Aj|x) and P(x|Aj) for given P(x).

Figure 3
Figure3Illustrating a GPS device's positioning with a deviation.The round point is the pointed position with a deviation and the star is the most possible position.

Figure 4 .
Figure 4.The semantic information conveyed by yj about xi.

Figure 4 .
Figure 4. indicates that the less the logical probability is, the more information there is; the larger the deviation is, the less information there is; a wrong hypothesis will convey negative information.These conclusions accord with Popper's thought[32] (p.294).To average I(xi; θj), we have generalized KL information

Figure 5 .
Figure 5.The rate-verisimilitude function R(G) of a binary communication.For any R(G) function, there is a point where R(G)=G.

Figure 5 .
Figure 5. Illustrating the medical test and signal detection.We choose yj according to z∈Cj.{C0, C1} is a partition of C.

Figure 7 .
Figure 7. Relationship between degree of conformation b* and confidence level CL.As CL changes from 0 to 1, b* changes from -1 to 1.

12 ) 13 )
(2.35)  or Eq.(3.1).If there are examples with one instance and several labels, or with several instances and one label, we can split such an example into several single-instance and single-label examples as well as the popular method does[51].Then we can obtain the Shannon channel P(Y|X) for multilabel learning.MultilabelMultilabelMultilabelMultilabelFor seen instance classifications, x is given.For Matching II, the MSI classifier is *= ( ) arg max log ( ; )= arg max log[ ( | ) / ( )] Using T(θj), we can overcome the class-imbalance problem [50] and reduce the rate of failing to report smaller probability events.If T(θj|X) ∈ {0,1}, the semantic information measure becomes Carnap and Bar-Hillel's semantic information measure, and the classifier becomes the minimum logical probability classifier: with ( | ) 1 with ( | ) 1 *= ( ) arg max log[1 / ( )]= arg min log ( ) This criterion encourages us to select a compound label with smaller denotation.For unseen instance classifications or uncertain x, we only know P(x|z).The MSI classifier becomes

Figure 8 .
Figure 8. Illustrating the iterative convergence of the MMI classifications of unseen instances.In the iterative process, (G,R) moves from the start point to a,b,c,d,e,…,f gradually. .We can prove that the iteration can converge.In the iterative process of the CM iteration algorithm, coordinate (G, R) changes as follows: Matching I-1st Matching II-1st Matching I-2nd Matching II-2nd Matching I-3rd Matching II-3rd Start That means P(x|yj)=P2(x|θ2)=P2(x)T(θ2|x)/T(θ2).

Figure 9 .
Figure 9.Using prior and posterior distributions P(x) and P(x|yj) to obtain the optimized truth function T*(θj|x).Figures 9-15 are produced by Python 3.6 files that can be find through Appendix B.

Figure 10 (
b) shows the effect of the MSI classifier (see Eq. 3.12)).
(a) The truth functions of five labels for ages and the prior distribution P(x) of population.(b) Labeling x according to which of Ij=I(x;θj) (j=0,1,…7) is miximum.

Figure 10 .
Figure 10.The maximum semantic information classification of ages.

( a )Figure 11 .
Figure 11.The MMI classification of unseen instances.The classifier is y=f(z).The mutual information is I(X;Y).X is a true class and Y is a selected label.
(a) The very bad initial partition.(b) The partition after the first iteration.(c) The partition after the second iteration.(d) The mutual information changes with iterations Figure 12.The MMI classification with a very bad initial partition.
(a) Q is close to local maximum at beginning (b) LX(θ) converges to the global maximum after 63 iterations (c) Q decreases after the first E2-step and then increases as H(P||Pθ) decreases in the CM-EM algorithm.

Figure 13 .
Figure 13.The iterative process from the local maximum of Q to the global minimum of H(P||Pθ).The stop condition is that the deviation of every parameter is less than 1%.

Figure 14 .
Figure 14.The iterative process of the CM-EM algorithm for Example 6.2.Some E2-steps decrease Q.The relative entropy is less than 0.001 bit after 8 iterations.

Example 4 . 3 . 3
There are six components in a two-dimensional feature space as shown in Figure15.The sample size is 1000.True and initial parameters can be found through Appendix B.
(a) Iterative start.(b) The mixture model converges after 30 iterations.

Figure 15 .
Figure 15.CM4 for a two-dimensional mixture model.There are six components with Gaussian distributions.

Table 1 .
The sensitivity and Specificity of a Medical Test ascertain a Shannon's Channel P(Y|X) Example 2.2 Calculate P(x1|y1) using P(y1|x) in Table1for P(x1)=0.0001,0.002(for normal people), and 0.1 (for high-risk crowd).Solution: Using Eq. (2.28), for P(x1)=0.0001,0.002, and 0.1, we have P(x1|y1)=0.084,0.65, and 0.99.Using Likelihood Inference (LI), it is not easy to solve Example 3.1.Nevertheless, when x is one of many different values, and TPFs are not smooth, we may need LI.If samples are small so that TPFs are not smooth even not continuous, we cannot use TPF to obtain continuous P(x|yj).That is why we use LI, which uses parameters to construct smooth likelihood functions.Using the MLE, we can train a likelihood function by a sample X to obtain the best θj: * Data are from OREQuick HIV tests[50]

Table 2 .
Two degrees of disbelief of a medical test form a semantic channel T(θ|x)

Table 3 .
The parameters of four Gaussian distributions

Table 4 .
Trye and guessed model parameters and iterative results of Example 4.3.2.

Table 5 .
The iteration numbers and final parameters for different algorithms.

Table 6 .
Comparing the CM algorithm and the Gradient Descent for low-dimensional feature spaces