The P-T Probability Framework for Semantic Communication, Falsification, Confirmation, and Bayesian Reasoning

Many researchers want to unify probability and logic by defining logical probability or probabilistic logic reasonably. This paper tries to unify statistics and logic so that we can use both statistical probability and logical probability at the same time. For this purpose, this paper proposes the P-T probability framework, which is assembled with Shannon's statistical probability framework for communication, Kolmogorov's probability axioms for logical probability, and Zadeh's membership functions used as truth functions. Two kinds of probabilities are connected by an extended Bayes' theorem, with which we can convert a likelihood function and a truth function from one to another. Hence, we can train truth functions (in logic) by sampling distributions (in statistics). This probability framework was developed in the author's long-term studies on semantic information, statistical learning, and color vision. This paper first proposes the P-T probability framework and explains different probabilities in it by its applications to semantic information theory. Then, this framework and the semantic information methods are applied to statistical learning, statistical mechanics, hypothesis evaluation (including falsification), confirmation, and Bayesian reasoning. Theoretical applications illustrate the reasonability and practicability of this framework. This framework is helpful for interpretable AI. To interpret neural networks, we need further study.

Example 1. There are five labels: y1 = "child", y2 = "youth", y3 = "middle aged", y4 = "elder", and y5 = "adult". Notice that some youths and all middle-aged people and elders are also adults. Suppose that ten thousand people go through a door. For everyone denoted by x, entrance guards judge if x is adult, or if "x is adult" is true. If 7000 people are judged to be adults, then the logical probability of y5 = "x is adult" is 7000/10,000 = 0.7. If the task of entrance guards is to select one from the five labels for every person, there may be only 1000 people who are labeled "adult". The statistical probability of "adult" should be 1000/10,000 = 0.1.
Why is the selected probability of y5 is less than its logical probability? The reason is that other 6000 adults are labeled "youth", "middle aged", or "elder". In other words, the reason is that a person can only make one of five labels' selected probabilities increase 1/10,000. In contrast, a person can make two or more labels' logical probabilities increase 1/10,000. For example, a 20-year-old man can make both logical probabilities of "youth" and "adult" increase 1/10,000.
An extreme example is that the logical probability of a tautology, such as "x is adult or not adult", is 1. In contrast, its statistical probability is almost 0 in general, because a tautology is rarely selected.
From this example, we can find: 1. A hypothesis or label yj has two probabilities: a logical probability and a statistical or selected probability. If we use P(yj) to represent its statistical probability, we cannot use P(yj) for its logical probability. 2. Statistical probabilities are normalized (the sum is 1), whereas logical probabilities are not. The logical probability of a hypothesis is bigger than its statistical probability in general. 3. For given age x, such as x = 30, the sum of the truth values of five labels may be bigger than 1. 4. The logical probability of "adult" is related to the population age distribution P(x), which is a statistical probability distribution. Clearly, the logical probabilities of "adult" obtained from the door of a school and the door of a hospital must be different.
The logical probability of yj calculated above accords with Reichenbach's frequentist definition of logical probability [9]. However, Reichenbach has not distinguished logical probability and statistical probability.
In the popular probability systems, such as the axiomatic system defined by Kolmogorov [10], all probabilities are represented by "P" so that we cannot use two kinds of different probabilities at the same time. Nor can we distinguish whether "P" represents a statistical probability or a logical probability, such as in popular confirmation measures [11].
Popper complained about Kolmogorov's probability system, saying: "he assumes that, in 'P(a, b)' …a and b are sets; thereby excluding, among others, the logical interpretation according to which a and b are statements (or 'propositions', if you like)." ( [7], pp. 330-331, where P(a, b) is a conditional probability and should be written as P(a|b) now. ) Popper hence proposed his own axiomatic system in New Appendices of Conjectures and Refutations ( [7], pp. 329-355). In Popper's probability system, a and b in P(a|b) may be sets, predicates, propositions, and so on. However, Popper also only uses "P" to denote probability, and hence, his axiomatic system does not use statistical and logical probabilities at the same time. For this reason, Popper's probability system is not more practical than Kolmogorov's.
To distinguish statistical probability and logical probability and use both at the same time, I [12] propose the use of two different symbols "P" and "T" to identify statistical probabilities (including likelihood functions) and logical probabilities (including truth function). In my earlier papers [13][14][15], I used "P" for objective probabilities and "Q" for both subjective probability and logical probability. As subjective probability is also normalized and defined with "=" instead of "ϵ (belong to)" (see Section 2.1), I still use "P" for subjective probability now.

A Problem with the Extension of Logic: Can a Probability Function Be a Truth Function?
Jaynes ([16], pp. 651-660) made a similar effort as Popper to improve Kolmogorov's probability system. He is a famous Bayesian, but he admits that we also need frequentist probability, such as the probability for sample distributions. He has proved that there is an equivalence between Boltzmann's entropy and Shannon's entropy, in which the probability is frequentist or objectively statistical. He summarizes that probability theory is the extension of logic. In his book [16], he uses many examples to show that much scientific reasoning is probabilistic. However, his probability system lacks truth functions that are multivalued or fuzzy, as reasoning tools.
A proposition has a truth value; all truth values of propositions with the same predicate and different subjects form a truth function. In the classical logic (binary logic), the characteristic function of a set is also the truth function of a hypothesis. Suppose that there is a predicate (or propositional function): yj(x) = "x has property aj", and set Aj includes all x that have property aj. Then, the characteristic function of Aj is the truth function of yj(x). For example, x represents an age or a x-yearold person, and people with age ≥ 18 are defined as adults. Then, the set that includes all x with property age ≥ 18 is Aj = [18, ∞). Its characteristic function is also the truth function of predicate "x is adult" (see Figure 1). Figure 1. Solving the truth functions of "adult" and "elder" using prior and posterior probability distributions. The human brain can guess T("adult"|x) (cyan dashed line) or the extension of "adult" from P(x|"adult") (dark green line) and P(x) (dark blue dashdotted line) and estimate T("elder"|x) (purple thin dashed line) or the extension of "elder" from P(x|"elder") (grey thin line) and P(x). Equations (12) and (22) are new formulas for calculating T("adult"|x) and T("elder"|x).
If we extend binary logic to continuous value logic, the truth function should change between 0 and 1. Suppose that the number of a group of 60-year-old people is N60, and N60* people among the N60 people are judged to be elderly. Then, the truth value of proposition "a 60-year-old person is elderly" is N60*/N60. This truth value is about 0.8. If x = 80, the truth value should be 1.
According to the logical interpretation of probability, a probability should be a continuous logical value, and a conditional probability function P(yj|x) with variable x as the condition should be a truth function (e.g., fuzzy truth function). Shannon calls P(yj|x) the Transition Probability Function (TPF) ( [3], p. 11). In the following, we use an example in natural language to explain why TPFs are not truth functions and how the human brain thinks using conceptual extensions (or denotations), which can be represented by truth functions.
Example 2. (refer to Figure 1). Suppose we know the population age prior distribution P(x) and the posterior distributions P(x|"adult") and P(x|"elder"), where the extension of "adult" is crisp and that of "elder" is fuzzy. Please solve the truth functions or the extensions of labels "adult" and "elder".
According to the existing probability theories, including Jaynes' probability theory, we cannot obtain the truth function from P(x) and P(x|yj). We cannot get TPF P(yj|x) either. Because of P(yj|x) = P(x|yj)P(yj)/P(x) according to Bayes' theorem, without P(yj), we cannot obtain P(yj|x). However, the human brain can estimate the extension of yj from P(x) and P(x|yj) without P(yj) (see Figure 1). With the extension of a label, even if P(x) is changed, the human brain can still predict the posterior distribution and classify people with different labels.
This example shows:


The existing probability theories lack the methods used by the human brain (1) to find the extension of a label (an inductive method) and (2) to use the extension as the condition for reasoning or prediction (a deductive method).


The TPF and the truth function are different, because a TPF is related to how many labels are used, whereas a truth function is not.

Distinguishing (Fuzzy) Truth Values and Logical Probabilities-Using Zadeh's Fuzzy Set Theory
Dubois and Prade [17] pointed out that there was a confusing tradition-researchers often confused the degree of belief (logical probability) in a hypothesis and the truth value of a proposition; we could only do logical operations with the truth values of propositions instead of logical probabilities. I agree with them. Section 7.2 will discuss why we cannot directly do logical operations with logical probabilities. Now, we further distinguish the logical probability and the (continuous or fuzzy) truth value.
The sentence "x is elderly" (where x is one of a group of people) is a propositional function; "this man is elderly" is a proposition. The former has a logical probability, whereas the latter has a truth value. As a propositional function includes a predicate "x is elderly" and the universe X, we also call a propositional function as a predicate. Therefore, we need to distinguish the logical probability of a predicate and the truth value of a proposition.
However, in the probability systems mentioned above, either we cannot distinguish the logical probability of a predicate and the truth value of a proposition (such as in Kolmogorov's probability system), or the truth value is binary (such as in the probability systems of Carnap [18] and Reichenbach [9]).
In Carnap's probability system [18], the logical probability of a predicate is irrelevant to the prior probability distribution P(x). From Example 1, we can find that it is relevant.
Reichenbach [9] uses the statistical result of binary logic to define the logical probability of the logical expression of two proposition sequences (e.g., two predicates) without using continuous truth functions. Additionally, logical probability and statistical probability are not distinguished.
Therefore, the probability systems mentioned above are not satisfactory for the extension of logic. Fortunately, Zadeh's fuzzy set theory [19,20] provides more that we need than the abovementioned probability systems. As the membership grade of an instance xi in fuzzy set θj is the truth value of proposition yj(xi) = "xi is in θj", the membership function of a fuzzy set is equivalent to the truth function of a predicate yj(x). The probability of a fuzzy event proposed by Zadeh [20] is just the logical probability of the predicate "x is in θj". Zadeh [21] thinks that the fuzzy set theory and the probability theory are complementary rather than competitive; fuzzy sets can enrich our concepts of probability.
However, it is still difficult for a machine to obtain truth functions or membership functions.

Can We Use Sampling Distributions to Optimize Truth Functions or Membership Functions?
Although the fuzzy set theory has made significant achievements in knowledge representation, fuzzy reasoning, and fuzzy control, it is not easy to use the fuzzy set theory for statistical learning, because membership functions are usually defined by experts rather than obtained from statistics.
A sample includes some examples. For instance, (x: age, y: label) is an example; many examples, such as (5, "child"), (20, "youth"), (70, "elder") …, form a sample. The posterior distribution P(x|yj) represents a sampling distribution. The core method of statistical learning is to optimize a likelihood function P(x|θj) (θj is a model or a set of parameters) with a sampling distribution P(x|yj). We also want to use sampling distributions to optimize truth functions or membership functions so that we can connect statistics and logic.
A significant advance in this direction is that Wang [22,23] developed the Random Set Falling Shadow Theory. According to this theory, a fuzzy set is produced by a random set; a set value taken by a random set is a range, such as ages 15-30 or 18-28 with label "youth". With many different ranges, we can calculate the proportion in which x belongs to the random set. The limit of this proportion is the membership grade of x in the corresponding fuzzy set.
However, it is still difficult to obtain membership functions or truth functions from the statistics of the random set. The cause is that there are a significant number of labeled examples with a single instance from real life, such as (70, "elder"), (60, "elder"), and (25, "youth"), but there are only very few examples with labeled sets or ranges, such as (15-30, "youth"). Therefore, the statistical method of random sets is not practical.
Due to the above reasons, fuzzy mathematics can perform well in expert systems, but not in statistical learning; many statisticians refuse to use fuzzy sets.

Purpose, Methods, and Structure of This Paper
Many researchers want to unify probability and logic. The natural idea is to define a logical probability system, as Keynes [24], Reichenbach [9], Popper [7], and Carnap [18] did. Some researchers use probabilities as logical variables to set up probability logic [25], such as two types of probability logic proposed by Reichenbach [9] and Adams [26]. Additionally, many researchers want to unify probabilities and fuzzy sets [27][28][29] following Zadeh. Many valuable efforts for the probabilification of the classical logic can be found in [25].
My effort is a little different. I also follow Zadeh to probabilify (e.g., fuzzify) the classical logic, but I want to use both statistical probabilities and logical probabilities at the same time and make them compatible, which means that the consequences of logical reasoning are equal to those of statistical reasoning when samples are enormous. In short, I want to unify statistics and logic, not only probability and logic.
A probability framework is a mathematical model that is constructed by the probabilities of some random variables taking values from some universes. For example, Shannon's communication model is a probability framework. The P-T probability framework includes statistical probabilities (represented by "P") and logical probability (represented by "T"). We may call Shannon's mathematical model for communication the P probability framework.
The purposes of this paper are to propose the P-T probability framework and test this framework through its uses in semantic communication, statistical learning, statistical mechanics, hypothesis evaluation (including falsification), confirmation, and Bayesian reasoning.
The main methods are 1. using the statistical probability framework, e.g., the P probability framework, adopted by Shannon for electrocommunication as the foundation, adding Kolmogorov's axioms (for logical probability) and Zadeh's membership functions (as truth functions) to the framework, 2. setting up the relationship between truth functions and likelihood functions by a new Bayes Theorem, called Bayes' Theorem III, so that we can optimize truth functions (in logic) with sampling distributions (in statistics), and 3. using the P-T probability framework and the semantic information Formulas (1) to express verisimilitude and testing severity and (2) to derive two practical confirmation measures and several new formulas to enrich Bayesian reasoning (including fuzzy syllogisms).
The P-T probability framework was gradually developed in my previous studies for the semantic information G theory (or say the G theory; "G" means the generalization of Shannon's information theory) [12,14,15] and statistical learning [12]. It is the first time in this paper that I name it the P-T probability framework and provide it with philosophical backgrounds and applications.
The rest of this paper is organized as follows. Section 2 defines the P-T probability framework. Section 3 explains this framework by its uses in semantic communication, statistical learning, and statistical mechanics. Section 4 shows how this framework and the semantic information measure support Popper's thought about scientific progress and hypothesis evaluation. Section 5 introduces two practical confirmation measures related to the medical test and the Raven Paradox, respectively. Section 6 summarizes various formulas for Bayesian reasoning and introduces a fuzzy logic that is compatible with Boolean Algebra. Section 7 discusses some questions, including how the theoretical applications exhibit the reasonability and practicality of the P-T probability framework. Section 8 ends with conclusions.

The Probability Framework Adopted by Shannon for Electrocommunication
Shannon uses the frequentist probability system [30] for the P probability framework, which includes two random variables and two universes. Definition 1. (for the P probability framework adopted by Shannon):  X is a discrete random variable taking a value x ϵ X, where X is the universe {x1, x2, …, xm}; P(xi) = P(X = xi) is the limit of the relative frequency of event X = xi. In the following applications, x represents an instance or a sample point.  Y is a discrete random variable taking a value y ϵ Y = {y1, y2, …, yn}; P(yj) = P(Y = yj). In the following applications, y represents a label, hypothesis, or predicate.  P(yj|x) = P(Y = yj|X = x) is a Transition Probability Function (TPF) (named by Shannon [3]).
Shannon names P(X) the source, P(Y) the destination, and P(Y|X) the channel (see Figure 2a). A Shannon channel is a transition probability matrix or a group of TPFs: m m n n n m n P y x P y x P y x P y x P y x P y x P y x P y x P Y X P y x P y x P y x P y x where  means equivalence. A Shannon channel or a TPF can be treated as a predictive model to produce the posterior distributions of x: P(x|yj), j = 1,2, …, n (see Equation (6)).
(a) (b) Figure 2. The Shannon channel and the semantic channel. (a) The Shannon channel described by the P probability frame, (b) the semantic channel described by the P-T probability frame. A membership function ascertains the semantic meaning of yj. A fuzzy set θj may be overlapped or included by another.

The P-T Probability Framework for Semantic Communication
Definition 2. (for the P-T probability framework):


The yj is a label or a hypothesis, yj(xi) is a proposition, and yj(x) is a propositional function. We also call yj or yj(x) a predicate. The θj is a fuzzy subset of universe X, which is used to explain the semantic meaning of a propositional function yj(x) = "x ϵ θj" = "x belongs to θj" = "x is in θj". The θj is also treated as a model or a set of model parameters.  A probability that is defined with "=", such as P(yj) = P(Y = yj), is a statistical probability. A probability that is defined with "ϵ", such as P(X ϵ θj), is a logical probability. To distinguish P(Y = yj) and P(X ϵ θj), we define T(yj) = T(θj) = P(X ϵ θj) as the logical probability of yj.  T(yj|x) = T(θj|x) = P(x ϵ θj) = P(X ϵ θj|X = x) is the truth function of yj and the membership function of θj. It changes between 0 and 1, and its maximum is 1.
According to Tarski's truth theory [31], P(X ϵ θj) is equivalent to P("X ϵ θ" is true) = P(yj is true). According to Davidson's truth condition semantics [32], the truth function of yj ascertains the (formally) semantic meaning of yj. A group of truth functions, T(θj|x), j = 1, 2, …, n, forms a semantic channel: The Shannon channel and the semantic channel are illustrated in Figure 2. Now, we explain the relationship between the above logical probability and Kolmogorov's probability. In Kolmogorov's probability system, a random variable takes a set as its value. Let 2 X be the power set (Borel set), including all possible subsets of X, and let Θ denote the random variable taking a value θ ϵ {θ1, θ2, …, θ|2 X |}. Then, Kolmogorov's probability P(θj) is the probability of event Θ = θj. As Θ = θj is equivalent to X ϵ θj, we have P(Θ = θj) = P(X ϵ θj) = T(θj). If all sets in 2 X are crisp, then T(θj) becomes Kolmogorov's probability.
According to the above definition, we have the logical probability of yj: The logical probability is the average truth value, and the truth value is the posterior logical probability. Notice that a proposition only has its truth value without its logical probability in general. For example, "every x is white", where x is a swan, has a logical probability, whereas "this swan is white" only has a truth value. Statistical probability P(yj) is equal to logical probability T(θj) for every j only when the following two conditions are tenable:


The universe of θ only contains some subsets of 2 X that form a partition of X, which means any two subsets in the universe are disjoint.  The yj is always correctly selected.
For example, the universe of Y is Y = {"Non-adult", "Adult"} or {"Child", "Youth", "Middle age", "Elder"}; Y ascertains a partition of X. Further, if these labels are always correctly used, then P(yj) = T(θj) for every j. Many researchers do not distinguish statistical probability and logical probability, because they suppose the above two conditions are always tenable. However, in real life, the two conditions are not tenable in general. When we use Kolmogorov's probability system for many applications without distinguishing statistical and logical probabilities, the two conditions are necessary. However, the latter condition is rarely mentioned.
The logical probability defined by Carnap and Bar-Hillel [33] is different from T(θj). Suppose there are three atomic propositions a, b, and c. The logical probability of a minimum term, such as ab c (a and b and not c), is 1/8; the logical probability of = bc abc abc  is 1/4. This logical probability defined by Carnap and Bar-Hillel is irrelevant to the prior probability distribution P(x). However, the logical probability T(θj) is related to P(x). A less logical probability needs not only a smaller extension, but also rarer instances. For example, "x is 20-year-old" has a less logical probability than "x is young", but "x is over 100 years old" with a larger extension has a still less logical probability than "x is 20-year-old", because people over 100 are rare.
In mathematical logic, the true value of a universal predicate is equal to the logical multiplication of the truth values of all propositions with the predicate. Hence, Popper and Carnap conclude that the logical probability of a universal predicate is zero. This conclusion is bewildering. However, T(θj) is the logical probability of a predicate itself yj(x) itself without the quantifier (such as ∀x). It is the average truth value. Why do we define the logical probability in this way? One reason is that this mathematical definition conforms to the literal definition that logical probability is the probability in which a hypothesis is judged to be true. Another reason is that this definition is useful for extending Bayes' Theorem.
Bayes' Theorem I. It was proposed by Bayes and can be expressed by Kolmogorov Notice that there are one random variable, one universe, and two logical probabilities.
Bayes' Theorem II. It is expressed by frequentist probabilities and used by Shannon. There are two symmetrical formulas: j j j j j j P y x P x y P y P x P x P x y P y    (7) Notice there are two random variables, two universes, and two statistical probabilities.
Bayes' Theorem III. It is used in the P-T probability framework. There are two asymmetrical formulas: The two formulas are asymmetrical, because there is a statistical probability and a logical probability on the left sides. Their proofs can be found in Appendix A.
T(θj) in Equation (8) is the horizontally normalizing constant (which makes the sum of P(x|θj) be 1), whereas T(θj) in Equation (9) is the longitudinally normalizing constant (which makes the maximum of T(θj|x) be 1).

The Matching Relation between Statistical Probabilities and Logical Probabilities
How can we understand a group of labels' extensions? We can learn from a sample. Let D be a sample {(x(t), y(t))|t = 1 to N; x(t) ϵ X; y(t) ϵ Y}, where (x(t), y(t)) is an example, and the use of each label is almost reasonable. All examples with label yj in D form a sub-sample denoted by Dj.
Using Bayes' Theorem II, we further obtain T*(θj|x) = P(yj|x)/max(P(yj|x)), j = 1, 2, …, n, Which means that the semantic channel matches the Shannon channel. A TPF P(yj|x) indicates the using rule of a label yj, and hence, the above Formulas (12) and (13)  In my mind, "-" in "P-T" means Bayes' Theorem III and the above two formulas, which serves as a bridge between statistical probabilities and logical probabilities and hence as a bridge between statistics and logic. Now, we can resolve the problem in Example 2 (see Figure 1). According to Equation (12), we have two optimized truth functions T*("adult"|x) = [P(x|"adult")/P(x)]/max(P(x|"adult")/P(x)), T*("elder"|x) = [P(x|"elder")/P(x)]/max(P(x|"elder")/P(x)), no matter whether the extensions are fuzzy or not. After we obtain T*(θj|x), we can make new probability predictions P(x|θj) using Bayes' Theorem III when P(x) is changed.
Next, we prove that Equation (13) is compatible with the statistical method of random sets for membership functions proposed by Wang [23] (refer to Figure 3). with different x in Dj, and the number of (xj*, yj) is Nj*. Then, we divide all examples with yj into Nj* rows. Every row can be treated as a set Sk. It is easy to prove [37] that the truth function obtained from Equation (13) is the same as the membership function obtained from the statistics of a random set.
Therefore, the so obtained truth functions accord with Reichenbach's frequency interpretation of logical probability [9]. However, we should distinguish two kinds of probabilities that come from two different statistical methods. Popper was aware that the two kinds of probability were different, and there should be some relation between them ( [7], pp. 252-258). Equation (13) should be what Popper wanted.
However, the above conclusions provided in Equations (12) and (13) need Dj to be big enough. Otherwise, P(yj|x) is not smooth, and hence P(x|θj) is meaningless. In these cases, we need to use the maximum likelihood criterion or the maximum semantic information criterion to optimize truth functions (see Section 3.2).

The Logical Probability and the Truth Function of a GPS Pointer or a Color Sense
The semantic information is conveyed not only by natural languages, but also by clocks, scales, thermometers, GPS pointers, stock indexes, and so on, as pointed out by Floridi [38]. We can use a Gaussian truth function (a Gaussian function without normalizing coefficient) as the truth function of yj = "x is about xj," where xj is a reading, x is the actual value, and σ is the standard deviation. For a GPS device, xj is the pointed position (a vector) by yj, x is the actual position, and σ is the Root Mean Square (RMS), which denotes the accuracy of a GPS device.

Example 3.
A GPS device is used in a train, and P(x) is uniformly distributed on a line (see Figure 4). The GPS pointer has a deviation. Try to find the most possible position of the GPS device, according to yj. Using the statistical probability method, one might think that the GPS pointer and the circle for the RMS tell us a likelihood function P(x|θj) or a TPF P(yj|x). It is not a likelihood function, because the most possible position is not xj pointed by yj. It is also not TPF P(yj|x), because we cannot know its maximum. It is reasonable to think that the GPS pointer provides a truth function.
Using the truth function, we can use Equation (8) to make a semantic Bayes prediction P(x|θj), according to which the position with the star is the most possible position. Most people can make the same prediction without using any mathematical formula. It seems that human brains automatically use a similar method: predicting according to the fuzzy extension of yj and the prior knowledge P(x).
A color sense can also be treated as a reading of a device, which conveys semantic information [15]. In this case, the truth function of a color sense yj is also the similarity function or the confusion probability function between x and xj; logical probability T(θj) is the confusion probability of other colors that are confused with xj by our eyes.
We can also explain the logical probability of a hypothesis yj with the confusion probability. Suppose that there exists a Plato's idea xj for every fuzzy set θj. Then, the membership function of θj or the truth function of yj is also the confusion probability function between x and xj [14].
Like the semantic uncertainty of a GPS pointer, the semantic uncertainty of a microphysical quantity may also be expressed by a truth function instead of a probability function.

From Shannon's Information Measure to the Semantic Information Measure
Shannon's mutual information is defined as: (15) where the base of the log is 2. If Y = yj, I(X; Y) becomes the Kullback-Leibler (KL) divergence: If X = xi, I(X; yj) becomes Using likelihood function P(x|θj) to replace posterior distribution P(x|yj), we have (the amount of) semantic information conveyed by yj about xi: The above formula makes use of Bayes' Theorem III. Its philosophical meaning will be explained in Section 4. If the truth value of proposition yj(xi) is always 1, then the above formula becomes Carnap and Bar-Hillel's semantic information formula [33]. The above semantic information formula can be used to measure not only the semantic information of natural languages, but also the semantic information of a GPS pointer, a thermometer, or a color sense. In the latter cases, the truth function also means the similarity function or the confusion probability function.
To average I(xi; θj) for different xi, we have average semantic information where P(xi|yj) (i = 1,2, …) is the sampling distribution. This formula can be used to optimize truth functions.
To average I(X; θj) for different yj, we have semantic mutual information: Both I(X; θj) and I(X; Θ) can be used as the criterion of classifications. We also call I(X; θj) generalized Kullback-Leibler (KL) information and I(X; Θ) generalized mutual information.
If the truth function is a Gaussian truth function, I(X; Θ) is equal to the generalized entropy minus the mean relative squared error: Therefore, the maximum semantic mutual information criterion is similar to the Regularized Least Square (RLS) criterion that is getting popular in machine learning. Shannon proposed the fidelity evaluation function for data compression in his famous paper [3]. A specific fidelity evaluation function is the rate-distortion function R(D) [39], where D is the upper limit of the average of distortion d(xi, yj) between xi and yj, and R is the minimum mutual information for given D. R(D) will be further introduced in relation to random events' control in Section 3.3.
Replacing d(xi, yj) with I(xi; θj), I developed another fidelity evaluation function R(G) [12,15], where G is the lower limit of semantic mutual information I(X; Θ), and R(G) is the minimum Shannon mutual information for given G. G/R(G) indicates the communication efficiency, whose upper limit is 1 when P(x|θj) = P(x|yj) for all j. R(G) is called the rate-verisimilitude function (this verisimilitude will be further discussed in Section 4.3). The rate-verisimilitude function is useful for data compression according to visual discrimination [15] and the convergence proofs of mixture models and maximum mutual information classifications [12].

Optimizing Truth Functions and Classifications for Natural Language
An essential task of statistical learning is to optimize predictive models (such as likelihood functions and logistic functions) and classifications. As by using a truth function T(θj|x) and the prior probability distribution P(x) we can produce a likelihood function; a truth function can also be treated as a predictive model. Additionally, a truth function as a predictive model has the advantage that it still works when P(x) is changed. Now, we consider Example 2 again. When sample Dj is not big enough, and hence P(x|yj) is unsmooth, we cannot use Equations (12) or (13) to obtain a smooth truth function. In this case, we can use the generalized KL formula to get an optimized continuous truth function. For example, the optimized truth function is where "arg max ∑" means to maximize the sum by selecting parameters θelder of the truth function.
In this case, we can assume that T(θelder) is a logistic function: , where u and v are two parameters to be optimized. If we know P(yj|x) without knowing P(x), we may assume that P(x) is constant to obtain sampling distribution P(x|yj) and logical probability T(θelder) [12]. It is called Logical Bayesian Inference [12] to (1) obtain the optimized truth function T*(θj|x) from sampling distributions P(x|yj) and P(x) and (2) make the semantic Bayes prediction P(x|θj) using T*(θj|x) and P(x). Logical Bayesian Inference is different from Bayesian Inference [5]. The former uses prior P(x), whereas the latter uses prior P(θ).
In statistical learning, it is called label learning to obtain smooth truth functions or membership functions with parameters. From the philosophical perspective, it is the induction of labels' extensions.
In popular statistical learning methods, the label learning of two complementary labels, such as "elder" and "non-elder", is easy, because we can use a pair of logistic functions as two TPFs P(θ1|x) and P(θ0|x) or two truth functions T(θ1|x) and T(θ0|x) with parameters. However, multi-label learning is difficult [40], because it is impossible to design n TPFs with parameters. Nevertheless, using the P-T probability framework, multi-label learning is also easy, because every label's learning is independent [12].
With optimized truth functions, we can classify different instances into different classes using a classifier yj = f(x). For instance, we can classify people with different ages into classes with labels "child", "youth", "adult", "middle aged", and "elder". Using the maximum semantic information criterion, the classifier is This classifier changes with the population age distribution P(x). With the prolongation of the human life span, v in the truth function T*(θelder|x) and the division point of "elder" will automatically increase [12].
The maximum semantic information criterion is compatible with the maximum likelihood criterion and is different from the maximum correctness criterion. It encourages us to reduce the underreports of small probability events. For example, if we classify people over 60 into the elder class according to the maximum correctness criterion, then we may classify people over 58 into the elder class according to the maximum semantic information criterion, because the elder people are less numerous than non-elder or the middle-aged people. To predict earthquakes, if one uses the maximum correctness criterion, he may always predict "no earthquake will happen next week". Such predictions are meaningless. If one uses the maximum semantic information criterion, he will predict "the earthquake will be possible next week" in some cases, even if he will make more mistakes.
The P-T probability framework and the G theory can also be used for improving maximum mutual information classifications and mixture models [12].

Truth Functions Used as Distribution Constraint Functions for Random Events' Control
A truth function is also a membership function and can be treated as the constraint function of the probability distribution of a random event or the density distribution of random particles. We simply call this function the Distribution Constraint Function (DCF).
The KL divergence and the Shannon mutual information can also be used to measure the control amount of controlling a random event. Let X be a random event, and P(x) = P(X = x) be the prior probability distribution. X may be the energy of a gas molecule, the size of a machined part, the age of one of the people in a country, and so on. P(x) may also be a density function.
If P(x|yj) is the posterior distribution after a control action yj, then the KL divergence I(X; yj) is the control amount (in bits), which reflects the complexity of the control. If the ideal posterior distribution is P(x|θj), then the effective control amount is Notice that P(xi|θj) is on the left of "log" instead of the right. For generalized KL information I(X; θj), when prediction P(x|θj) approaches fact P(x|yj), I(X; θj) approaches its maximum. In contrast, for effective control amount Ic(X; θj), as fact P(x|yj) approaches ideality P(x|θj), Ic(X; θj) approaches its maximum. For an inadequate P(x|yj), Ic(X; θj) may be negative. P(x|yj) may also have parameters.
A truth function T(θj|x) used as a DCF means that there should be If θj is a crisp set, this condition means that x cannot be outside of θj. If θj is fuzzy, it means that x outside of θj should be limited. There are many distributions P(x|yj) that meet the above condition, but only one needs the minimum KL information I(X; yj). For example, assuming that xj makes T(θj|xj) = 1, if P(x|yj) = 1 for x = xj and P(x|yj) = 0 for x ≠ xj, then P(x|yj) meets the above condition. However, this P(x|yj) needs information I(X; yj) that is not minimum. I studied the rate-tolerance function R(Θ) [41] that is the extension of the complexity-distortion function R(C) [42]. Unlike the rate-distortion function, the constraint condition of the complexitydistortion function is that every distortion d(xi, yj) = (xi − yj) 2 is less than a given value C, which means the constraint sets possess the same magnitude. Unlike the constraint condition of R(C), the constraint condition of R(Θ) is that the constraint sets are fuzzy and possess different magnitudes. I have concluded [14,15]:


For given DCFs T(θj|x) (j = 1, 2, …, n) and P(x), when P(x|yj) = P(x|θj) = P(x)T(θj|x)/T(θj), the KL divergence I(X; yj) and Shannon's mutual information I(X; Y) reach their minima; the effective control amount Ic(X; yj) reaches its maximum. If every set θj is crisp, I(X; yj) = −logT(θj) and I(X; Y) = −∑j P(yj)logT(θj).  A rate-distortion function R(D) is equivalent to a rate-tolerance function R(Θ), and a semantic mutual information formula can express it with truth functions or DCFs (see Appendix B for details). However, an R(Θ) function may not be equivalent to an R(D) function, and hence, R(D) is a special case of R(Θ).
Let dij = d(xi, yj) be the distortion or the loss when we use yj to represent xi. D is the upper limit of the average distortion. For given P(x), we can obtain the minimum Shannon mutual information I(X; Y), e.g., R(D). The parameterization of R(D) ( [43], P. 32) includes two formulas: where parameter s = dR/dD ≤ 0 reflects the slope of R(D). The posterior distribution P(y|xi) is which makes I(X; Y) reach its minimum. As s ≤ 0, the maximum of exp(sdij) is 1 as s = 0. An often-used distortion function is d(xi, yj) = (yjxi) 2 . For this distortion function, exp(sdij) is a Gaussian function (without the coefficient). Therefore, exp(sdij) can be treated as a truth function or a DCF T(θxi|y), where θxi is a fuzzy set on Y instead of X; λi can be treated as the logical probability T(θxi) of xi. Now, we can find that Equation (27) is actually a semantic Bayes formula (in Bayes' Theorem III). An R(D) function can be expressed by the semantic mutual information formula with a truth function that is equal to an R(Θ) function (see Appendix B for details).
In statistical mechanics, a similar distribution to the above P(y|xi) is the Boltzmann distribution [44] ( | ) exp( ) / , exp( ) where P(xi|T) is the probability of a particle in the ith state with energy ei, or the density of particles in the ith state with energy ei; T is the absolute temperature, k is the Boltzmann constant, and Z is the partition function. Suppose that ei is the ith energy, Gi is the number of states with ei, and G is the total number of all states. Then, P(xi) = Gi/G is the prior distribution. Hence, the above formula becomes Now, we can find that exp[−ei/(kT)] can be treated as a truth function or a DCF, Z' as a logical probability, and Equation (29) as a semantic Bayes formula in Bayes' Theorem III.
For local equilibrium systems, the different areas of a system have different temperatures. I ( [14], pp. 102-103) derived the relationship between minimum Shannon mutual information R(Θ) and thermodynamic entropy S with the help of the truth function or the DCF (see Appendix C for details): This formula indicates that the maximum entropy principle is equivalent to the minimum mutual information principle.

How Popper's Thought about Scientific Progresses is Supported by the Semantic Information Measure
As early as 1935, in The Logic of Scientific Discovery [7], Popper put forward that the less the logical probability of a hypothesis, the easier it is to be falsified, and hence the more empirical information there is. He says: "The amount of empirical information conveyed by a theory, or its empirical content, increases with its degree of falsifiability." (p. 96) "The logical probability of a statement is complementary to its degree of falsifiability: it increases with decreasing degree of falsifiability. "(p. 102) However, Popper had not provided a proper semantic information formula. In 1948, Shannon proposed the famous information theory with statistical probability [3]. In 1950, Carnap and Bar-Hillel proposed a semantic information measure with logical probability: where p is a proposition, mp is its logical probability. However, this formula is irrelevant to the instance that may or may not make p true. Therefore, it can only indicate how severe the test is, not how well p survives the test.
In 1963, Popper published his book Conjectures and Refutations [45]. In this book, he affirms more clearly that the significance of scientific theories is to convey information. He says: "It characterizes as preferable the theory which tells us more; that is to say, the theory which contains the greater amount of experimental information or content; which is logically stronger; which has greater explanatory and predictive power; and which can therefore be more severely tested by comparing predicted facts with observations. In short, we prefer an interesting, daring, and highly informative theory to a trivial one." ( [45], p. 294).
In this book, Popper proposes using ( [45], p. 526) P(e, hb)/P(e, b) or log[P(e, hb)/P(e, b)] to represent how well a theory survives a severe test, where e is interpreted as the supporting evidence of the theory h, and b is background knowledge. P(e, hb) and P(e, b) in [45] are conditional probabilities, whose modern forms are P(e|h, b) and P(e|b). Different from Carnap and Bar-Hillel's semantic information formula, the above formula can reflect how well evidence e supports theory h. However, P(e, hb) is the conditional probability of e. It is not easy to explain P(e, hb) as a logical probability. Hence, we cannot say that log[P(e, hb)/P(e, b)] represents the amount of semantic information.
There have been several semantic or generalized information measures [38,46,47]. My semantic information measure supports Popper's theory about hypothesis evaluation more than others. We use an example to show its properties.
Suppose that yj is a GPS pointer or a hypothesis "x is about xj" with a Gaussian truth function exp[−(x − xj) 2 /(2σ 2 )] (see Figure 5). Then, we have the amount of semantic information:  Figure 5 shows that the truth function and the semantic information change with x. Small P(xi) means that xi is unexpected; large P(xi|θj) means that the prediction is correct; log[P(xi|θj)/P(xi)] indicates how severe and how well yj is tested by xi. Large T(θj|xi) means that yj is true or close to the truth; small T(θj) means that yj is precise. Hence, log[T(θj|xi)/T(θj)] indicates the verisimilitude of xj reflecting xi. Unexpectedness, correctness, testability, truth, precision, verisimilitude, and deviation are all reconciled in the formula. Figure 5 indicates that the less the logical probability is, the more information there is; the larger the deviation is, the less information there is; a wrong hypothesis will convey negative information. According to this measure, the information conveyed by a tautology or a contradiction is 0. These conclusions accord with Popper's thought.

How the Semantic Information Measure Supports Popper's Falsification Thought
Popper affirms that a hypothesis with less logical probability is more easily falsified. If it survives tests, it can convey more information. The semantic information formula for I(x; θj) reflects this thought. Popper affirms that a counterexample can falsify a universal hypothesis. The generalized KL information (Equation (19)) supports this point of view. The truth function of a universal hypothesis only takes value 0 or 1. If there is an instance xi that makes T(θj|xi) = 0, then I(xi; θj) is −∞. The average information I(X; θj) is also −∞, which falsifies the universal hypothesis.
Lakatos partly accepted Kuhn's thought about falsification. He pointed out [48] that in scientific practices, scientists do not readily give up a theory (or a hypothesis) when minor observed facts falsify it. They often add some auxiliary hypotheses to the theory or its hard core so that the theory with the auxiliary hypotheses accords with more observed facts. Lakatos hence describes his improved falsification as sophisticated falsification.
Lakatos is correct. For example, according to the hard core of science, distances from a GPS device to three satellites ascertain a GPS device's position. However, to provide accurate positioning, the manufacturers also need to consider additional factors, including satellite geometry, signal blockage, atmospheric conditions, and so on. However, Popper is not necessarily wrong. We can plead for Popper with the following reasons:


Popper claims that scientific knowledge grows by repeating conjectures and refutations. Repeating conjectures should include adding auxiliary hypotheses.  Falsification is not the aim. Falsifiability is only the demarcation criterion of scientific and nonscientific theories. The aim of science is to predict empirical facts with more information.
Scientists hold a scientific theory depending on if it can convey more information than other theories. Therefore, being falsified does not means being given up.
However, there is still a problem. That is, how can a falsified hypothesis convey more information than others? According to the average semantic information formula or the Generalized KL formula, we can raise the average semantic information by  increasing the fuzziness or decreasing the predictive precision of the hypothesis to a proper level and  reducing the degree of belief in a rule or a major premise.
We may change a universal hypothesis into a universal hypothesis that is not strict. For example, we may change "all swans are white" into "almost all swans are white," "all swans may be white," or "swans are white; the degree of confirmation is 0.9".
A GPS device is a quantitative example. A GPS device uses RMS or the like to indicate its accuracy (correctness and precision). A larger circle around the GPS pointer on a GPS map (see Figure  4) means a larger RMS and lower accuracy. If a GPS device shows a few more actual positions beyond the circle, but not too far, we may not give up the GPS device. We can assume that it has a slightly larger RMS and continue to use it. We choose the best hypothesis, as we want a GPS device with the highest accuracy. If a GPS device may direct wrong directions, we may reduce the degree of belief in it to decrease average information loss [12].

For Verisimilitude: To Reconcile the Content Approach and the Likeness Approach
In Popper's earlier works, he emphasizes the importance of hypotheses with less logical probabilities, without stressing hypotheses' truth. In contrast, in Popper's later book Realism and the Aim of Science [49], he explains that science aims at better explanatory theories, which accord with facts and hence are true, even if we do not know or cannot prove that they are true. Lakatos [50], therefore, points out that Popper's game of science and his aim of science are contradictive. The game of science consists of bold conjectures and severe refutations; the conjectures need not be true. However, the aim of science consists in developing true or truth-like theories about the mindindependent world. Lakatos' criticism is related to Popper's concept of verisimilitude. Popper where a is a proposition, 1 − P(a) means the information content. This formula means that the less the logical probability, the more the information content there is, and hence, the higher the verisimilitude is. Popper wishes that Vs(a) changes from −1 to 1. However, according to this formula, Vs(a) cannot appear between −1 and 0. The more serious problem is that he only makes use of logical probability P(a) without using the truth value of proposition a or without using the consequence that may be closer to truth than another [51], so that it cannot express likeness between the prediction and the consequence. Popper later admitted this mistake and emphasized that we need true explanatory theories that accord with the real world [49]. Now, researchers use three approaches to interpret verisimilitude [52]: the content approach, the consequence approach, and the likeness approach. As the latter two are relevant and different from the content approach, theories of verisimilitude have routinely been classified into two rival camps: the content approach and the likeness approach [52]. The content approach emphasizes tests' severity, unlike the likeness approach that emphasizes hypotheses' truth or closeness to truth. Some researchers think that the content approach and the likeness approach are irreconcilable [53], just as Lakatos thinks that Popper's game of science and his aim of science are contradictive. There are also researchers, such as Oddie [52], who try to combine the content approach and the likeness approach. However, they admit that it is not easy to combine them.
This paper continues Oddie's effort. Using the semantic information measure as verisimilitude or truth-likeness, we can combine the content approach and the likeness approach easily.
In Equation (32) and Figure 5, the truth function T(θj|x) is also the confusion probability function; it reflects likeness between x and xj. The xi (or X = xi) is the consequence, and the distance between xi and xj in the feature space reflects the likeness. The log[1/T(θj)] represents the testing severity and potential information content. Using Equation (32), we can easily explain an often-mentioned example: why "the sun has 9 satellites" (8 is true) has higher verisimilitude than "the sun has 100 satellites" [52].
Another often-mentioned example is how to measure the verisimilitude of weather predictions [52]. Using the semantic information method, we can assess more practical weather predictions. We use a vector x = (h, r, w) to denote weather, where h is temperature, r is rainfall, and w is wind speed. Let xj = (hj, rj, wj) be the predicted weather and xi = (hi, ri, wi) be the actual weather (consequence). The prediction is yj = "x is about xj". For simplicity, we assume that h, r, and w are independent. The Gaussian truth function may be: This truth function accords with the core of the likeness approach [52]: that the likeness of a proposition depends on the distance between two instances (x and xj). If the consequence is xi, then the truth value T(θj|xi) of proposition yj(xi) is the likeness. Additionally, the information I(xi; θj) = log[T(θj|xi)/T(θj)] is the verisimilitude, which has almost all desirable properties for which the three approaches are used. This verisimilitude I(xi; θj) is also related to prior probability distribution P(x). The correct prediction of unusual weather has much higher verisimilitude than that of common weather if both predictions are right. Now, we can explain that the Regularized Least Squared Error criterion is getting popular, because it is similar to the maximum average verisimilitude criterion (refer to Equation (21)).

The Purpose of Confirmation: Optimizing the Degrees of Belief in Major Premises for Uncertain Syllogisms
Confirmation is often explained as assessing the evidential impact on hypotheses (incremental confirmation) [11,54] or the evidential support for hypotheses (absolute confirmation) [55,56]. The degree of confirmation is the degree of belief that is supported by evidence or data [57]. Researchers only use confirmation measures to assess hypotheses. There have been various confirmation measures [11]. I also proposed two different confirmation measures [58]. In the following, I briefly introduce my study on confirmation related to scientific reasoning.
In my study on confirmation, the task, purpose, and method of confirmation are a little different from those in the popular confirmation theories.


The task of confirmation: Only major premises, such as "if the medical test is positive, then the tested person is infected" and "if x is a raven, then x is black", need confirmation. The degrees of confirmation are between −1 and 1. A proposition, such as "Tom is elderly", or a predicate, such as "x is elderly" (x is one of the given people), needs no confirmation. The truth function of the predicate reflects the semantic meaning of "elderly" and is determined by the definition or the idiomatic usage of "elderly". The degree of belief in a proposition is a truth value, and that in a predicate is a logical probability. The truth value and the logical probability are between 0 and 1 instead of −1 and 1.


The purpose of confirmation: The purpose of confirmation is not only for assessing hypotheses (major premises), but also for probability predictions or uncertain syllogisms. A syllogism needs a major premise. However, as pointed out by Hume and Popper, it is impossible to obtain an absolutely right major premise for an infinite universe by induction. However, it is possible to optimize the degree of belief in the major premise by the proportions of positive examples and counterexamples. The optimized degree of belief is the degree of confirmation. Using a degree of confirmation, we can make an uncertain or fuzzy syllogism. Therefore, confirmation is an important link in scientific reasoning according to experience.


The method of confirmation: I do not directly define a confirmation measure, as most researchers do. I derive the confirmation measures by optimizing the degree of belief in a major premise with the maximum semantic information criterion or the maximum likelihood criterion. This method is also the method of statistical learning, where the evidence is a sample.
From the perspective of statistical learning, a major premise comes from a classifier or a predictive rule. We use the medical test as an example to explain the relationship between classification and confirmation (see Figure 6). The binary signal detection and the classification with labels "elder" and "non-elder" are similar. Classifications for watermelons and junk emails are also similar. Figure 6. Illustrating the medical test and the binary classification for explaining confirmation. If x is in E1, we use e1 as prediction "h is h1"; if x is in E0, we use e0 as prediction "h = h0".
In Figure 6, h1 denotes an infected specimen (or person), h0 denotes an uninfected specimen, e1 is positive, and e0 is negative. We can treat e1 as a prediction "h is infected" and e0 as a prediction "h is uninfected". The x is the observed feature of h; E1 and E2 are two sub-sets of the universe of x. If x is in E1, we select e1; if x is in E0, we select e0. For the binary signal detection, we use "0" or "1" in the destination to predict 0 or 1 in the source according to the received analog signal x.
The two major premises to be confirmed are "if e1 then h1", denoted by e1→h1, and "if e0 then h0", denoted by e0→h0. A confirmation measure is denoted by c(e→h).
The x' ascertains a classification. For a given classification or predictive rule, we can obtain a sample including four examples (e1, h1), (e0, h1), (e1, h0), and (e0, h0). Then, we can use the four examples' numbers a, b, c, and d (see Table 1) to construct confirmation measures.  Information measures and confirmation measures are used for different tasks. To compare two hypotheses and choose a better one, we use an information measure. Using the maximum semantic information criterion, we can find the best x' for the maximum mutual information classifications [12]. To optimize the degree of belief in a major premise, we need a confirmation measure. An information measure is used for classifications, whereas a confirmation measure is used after classifications.

Channel Confirmation Measure b* for Assessing a Classification as a Channel
A binary classification ascertains a Shannon channel, which includes four conditional probabilities, as shown in Table 2.  We regard predicate e1(h) as the combination of believable and unbelievable parts (see Figure 7). The truth function of the believable part of e1 is T(E1|h) ϵ {0,1}. There are T(E1|h1) = T(E0|h0) = 1 and T(E1|h0) = T(E1|h0) = 0. The unbelievable part is a tautology, whose truth function is always 1. Then, we have the truth functions of predicates e1(h) and e0(h): Model parameter b1 is the proportion of the believable part, and b1′ = 1 − |b1| is the proportion of the unbelievable part and also the truth value of y1(h0), where h0 is a counter-instance. The b1′ may be regarded as the degree of disbelief in the major premise e1→h1. The four truth values form a semantic channel, as shown in Table 3. Table 3. The semantic channel ascertained by two degrees of disbelief b1′ and b0′.
In the medical test, the likelihood ratio measure LR is used for assessing how reliable a testing result (positive or negative) is. Measure F proposed by Kemeny and Oppenheim [55] is Like measure F, measure b* is also the function of LR, and hence both can be used for assessing the medical test. Compared with LR, b* and F can indicate the distance between a test (any b*) and the best test (b* = 1) or the worst test (b*= −1) better. Compared with F, b* is better for probability predictions. For example, from b1* > 0 and P(h), we have This formula is simpler than the classical Bayes formula (see Equation (6)). If b1* = 0, then P(h1|θe1) = P(h1). If P(h1|θe1) < 0, then we can make use of Consequent Symmetry to make the probability prediction [58]. So far, it is still problematic to use b*, F, or another measure to assess how well a probability prediction or clarify the Raven Paradox.

Prediction Confirmation Measure c* for Clarifying the Raven Paradox
Statistics not only uses the likelihood ratio to indicate how reliable a testing method (as a channel) is, but also uses the correct rate to indicate how possible the predicted event is. Measures F and b*, like LR, cannot indicate the quality of a probability prediction. For example, b1* > 0 does not mean P(h1|θe1) > P(h0|θe1). Most other confirmation measures have similar problems [58].
We now treat probability prediction P(h|θe1) as the combination of a believable part with proportion c1 and an unbelievable part with proportion c1′, as shown in Figure 8. We call c1 the degree of belief in rule e1→h1 as a prediction. When the prediction accords with the fact, e.g., P(h|θe1) = P(h|e1), c1 becomes c1*. Then, we derive the prediction confirmation measure   (45) In like manner, we obtain It is easy to prove that c*(e1→h1) also possesses Consequence Symmetry. Making use of this symmetry, we can obtain c*(e1→h0) = −c*(e1→h1) and c*(e0→h1) = -c*(e0→h0).
When c1* > 0, according to Equation (45), we have the correct rate of rule e1→h1: When c*(e1→h1) < 0, we may make use of Consequence Symmetry to make the probability prediction. However, when P(h) is changed, we should still use b* with P(h) for probability predictions.
For the medical test, we need both conformation measures. Measure b* tells the reliability of a test as means or the channel in comparison with other tests, whereas measure c* tells the possibility that a person is infected. For scientific predictions, such as earthquake predictions, b* and c* have similar meanings. Now, we can use measure c* to clarify the Raven Paradox. Hemple [59] proposed the confirmation paradox or the Raven Paradox. According to the Equivalence Condition in the classical logic, "if x is a raven, then x is black" (Rule I) is equivalent to "if x is not black, then x is not a raven" (Rule II). A piece of white chalk supports Rule II; hence it also supports Rule I. However, according to the Nicod Criterion [60], a black raven supports Rule I, a nonblack raven undermines Rule I, and a non-raven thing, such as a black cat or a piece of white chalk, is irrelevant to Rule I. Hence, there exists a paradox between the Equivalence Condition and the Nicod Criterion.
To clarify the Raven Paradox, some researchers, including Hemple [59], affirm the Equivalence Condition and deny the Nicod Criterion; some researchers, such as Scheffler and Goodman [61], affirm the Nicod Criterion and deny the Equivalence Condition. Some researchers do not fully affirm or deny the Equivalence Condition or the Nicod Criterion. Figure 9 shows a sample for the major premise "if x is a raven, then x is black". First, we consider measure F to see if we can use it to eliminate the Raven Paradox. The difference between F(e1→h1) and F(h0→e0) is that their counterexamples are the same (c = 1), yet their positive examples are different. When d increases to d + Δd, F(e1→h1) = (ad − bc)/(ad + bc + 2ac) and F(h0→e0) = (ad − bc)/(ad + bc + 2dc) unequally increase. Therefore, though measure F denies the Equivalence Condition, it still affirms that Δd affects both F(e1→h1) and F(h0→e0), and hence, measure F does not accord with the Nicod Criterion.
Among all confirmation measures, only measure c* can ensure that f(a, b, c, d) is only affected by a and c. There is c*(e1→h1) = (6 − 1)/6 = 5/6. However, many researchers still think that the Nicod Criterion is incorrect; this criterion accords with our intuition only because a confirmation measure c(e1→h1) can evidently increase with a and slightly increase with d. For example, Fitelson and Hawthorne [62] believe that measure LR may be used to explain that a black raven can confirm "ravens are black" more strongly than a non-black non-raven thing. Is it true?
For the above example, LR, F, and b* do ensure that the increment Δf caused by Δa = 1 is bigger than Δf caused by Δd = 1. However, when a = d = 20 and b = c = 10, except for c*, no measure can ensure Δf/Δa > Δf/Δd (see Table 13 in [58] for details), which means that all popular confirmation measures cannot be used to explain that a black raven can confirm "ravens are black" more strongly than a piece of white chalk.
As c*(e1→h1) = (a − c)/max(a, c) and c*(h0→e0) = (d − c)/max(d, c), the Equivalence Condition does not hold, and measure c* accords with the Nicod Criterion very well. Therefore, the Raven Paradox does not exist anymore according to measure c*.

How Confirmation Measures F, b*, and c* are Compatible with Popper's Falsification Thought
Popper affirms that a counterexample can falsify a universal hypothesis or a major premise. However, for an uncertain major premise, how do counterexamples affect its degree of confirmation? Confirmation measures F, b*, and c* can reflect that the existence of fewer counterexamples is more important than the existence of more positive examples.
Popper affirms that a counterexample can falsify a universal hypothesis, according to which we can explain that for the falsification of a strict universal hypothesis, it is important to have no counterexample. Now, for the confirmation of a major premise that is not strict universal hypothesis, we can explain that it is important to have fewer counterexamples. Therefore, confirmation measures F, b*, and c* are compatible with Popper's falsification thought.
Scheffler and Goodman [61] proposed selective confirmation based on Popper's falsification thought. They believe that black ravens (whose number is a) support "ravens are black", because black ravens undermine "ravens are not black". Their reason why non-black ravens (whose number is c) support "ravens are not black" is that non-black ravens undermine the opposite hypothesis "ravens are black". Their explanation is significant. However, they did not provide the corresponding confirmation measure. Measure c*(e1→h1) is what they need. Now, we can find that it is confirmation that allows us not to give up a falsified major premise and to keep it with an optimized degree of belief for obtaining more information.
More discussions about confirmation can be found in [58].

Viewing Induction from the New Perspective
From the perspective of statistical learning, induction includes:  induction for probability predictions: to optimize likelihood functions with sampling distributions,  induction for the semantic meanings or the extensions of labels: to optimize truth functions with sampling distributions, and  induction for the degrees of confirmation of major premises: to optimize the degrees of belief in major premises with the proportions of positive examples and counterexamples after classifications.
From the above perspective, induction includes confirmation, and confirmation is part of induction. Only rules or if-then statements with two predicates need confirmation.
The above inductive methods also include the selections or conjectures of predictive models (likelihood functions and truth functions) with the semantic information criterion. The iteration algorithms for mixture models and maximum mutual information classifications [12] reflect repeating conjectures and refutations. Therefore, these inductive methods are compatible with Popper's thought that scientific progress comes from conjectures and refutations.

The Different Forms of Bayesian Reasoning as Syllogisms
According to the above analyses, Bayesian reasoning has different forms, as shown in Table 4. The interpretations with blue words indicate newly added contents because of the use of the P-T probability framework. xi (X = xi) P(yj|xi) Conditional SP yj, P(x) P(x|yj) = P(x)P(yj|x)/P(yj) P(yj) = ∑i P(yj|xi) P(xi)

Bayes' Theorem II (Bayes' prediction)
Between two sets Bayes' Theorem I (θ' is the complement of θ) Between an instance and a set (or model) It is important that the reasoning with truth functions or degrees of confirmation is compatible with statistical reasoning (using Bayes' Theorem II). When T*(θj|x)∝P(yj|x) or T(θej|h)∝P(ej|h), the consequences are the same as those from the classical statistics.
On the other hand, the fuzzy syllogisms are compatible with the classical syllogism, because when b1* = 1 or c1* = 1, for given e1, the consequence is P(h1) = 1 and P(h0) = 0. However, the above fuzzy syllogisms have their limitations. The syllogism with b*(e1→h1) is the generalization of a classical syllogism. The fuzzy syllogism is:
When b1* = 1, if the minor premise becomes "x is in E2, and E2 is included in E1", then this syllogism becomes Barbara (AAA-1) [63], and the consequence is P(h1|θe1) = 1. Hence, we can use the above fuzzy syllogism as the fuzzy Barbara.
As the Equivalence Condition does not hold for both channels' confirmation and predictions' confirmation, if the minor premise is h = h0, we can only use a converse confirmation measure b*(h0→ e0) or c*(h0→e0) as the major premise to obtain the consequence (see [58] for details).

Fuzzy Logic: Expectations and Problems
Truth functions are also membership functions. Fuzzy logic can simplify statistics for membership functions and truth functions.
Suppose that the membership functions of three fuzzy sets A, B, and C are a(x), b(x), and c(x), respectively. Let the logical expression of A, B, and C be F(A, B, C) with three operators ∩, ∪, and c and the logical expression of three truth functions be f(a(x), b(x), c(x)) with three operators ∧, ∨, and  . The operators ∩ and ∧ can be omi ed. There are 2 8 different expressions with A, B, and C. To simplify statistics, we expect If the above equation is tenable, we only need to optimize three truth functions a(x), b(x), and c(x) and calculate the truth functions of various expressions F(A, B, C). Further, we expect that the fuzzy logic in f(...) is compatible with Boolean algebra and Kolmogorov's probability system. However, in general, the logical operations of membership functions do not follow Boolean algebra. Zadeh's fuzzy logic is defined with [19]: a(x)∧b(x) = min(a(x), b(x)), a(x) = 1 − a(x). According to this definition, the law of complementarity does not hold (e.g., A∩A c ≠ ϕ and A∪A c ≠ X ) since There are also other definitions. To consider correlation between two predicates "x is in A" and "x is in B", Wang et al. [64] define: min( ( ), ( )), positively correlated, If two predicates are always positively correlated, then the above operations become Zadeh's operations. According the above definition, since A and A c are negatively correlated, there are a(x) ∧a(x) = max(0, a(x) + 1 − a(x) − 1) ≡ 0, a(x) ∨a(x) = min(1, a(x) + 1 − a(x)) ≡ 1.
Thus, the law of complementarity is tenable.
To build a symmetrical model of color vision in the 1980s, I [65,66] proposed the fuzzy quasi-Boolean algebra on X, which is defined with the Boolean algebra on cross-set [0,1]XX. This model of color vision is explained as a fuzzy 3-8 decoder, and hence is called the decoding model (see Appendix E). Its practicability can be proved by the fact that the International Commission on Illumination (CIE) recommended a symmetrical model, which is almost the same as mine, for color transforms in 2006 [67]. Using the decoding model, we can easily explain color blindness and color evolution by spitting or merging three sensitivity curves [68].
We can use the fuzzy quasi-Boolean algebra to simplify statistics for the truth functions of natural languages. In these cases, we need to distinguish atomic labels and compound labels and assume that the random sets for the extensions of these atomic labels become wider or narrower at the same time.
However, the correlation between different labels or statements is often complicated. When we choose a set of operators of fuzzy logic, we need to balance reasonability and feasibility.

How the P-T Probability Framework has Been Tested by Its Applications to Theories
The P-T Probability Framework includes statistical probabilities and logical probabilities. Logical probabilities include truth functions, which are also membership functions and can reflect the extensions of hypotheses or labels. Using truth function T(θj|x) and prior distribution P(x), we can produce likelihood function P(x|θj) and train T(θj|x) and P(x|θj) with sampling distributions P(x|yj). Then, we can let a machine reason like the human brain with the extensions of concepts.
By semantic information methods with this framework, we can use logical probabilities to express semantic information: I(xi; θj) = log[T(θj|x)/T(θj)]. We can also use the likelihood function to express predictive information, because I(xi; θj) = log[T(θj|x)/T(θj)] = log[P(xi|θj)/P(xi)]. When we calculate the average semantic information I(X; θj) and I(X; Θ), we also need statistical probabilities, such as P(x|yj), to express sampling distributions.
In the popular methods of statistical learning, only statistical probabilities, including subjective probabilities, are used. For binary classification, we can use a pair of logistic functions. However, for multi-label learning and classification, there is no simple method [40]. With (fuzzy) truth functions now, we can easily solve the extensions of multi-labels for classifications by Logical Bayesian Inference [12]. Using Shannon's channels with statistical probabilities and semantic channels with logical probabilities, we can let two channels mutually match to achieve maximum mutual information classifications and to speed up the convergence of mixture models with better convergence proofs [12]. Section 3.3 shows that the semantic Bayes formula with the truth function and the logical probability has been used in rate-distortion theory and statistical mechanics. Therefore, the P-T probability framework can be used not only for communication or epistemology but also for control or ontology.
For the evaluation of scientific theories and hypotheses, with logical probabilities, we can use semantic information I(x; θj) as verisimilitude and testing severity, both of which can be mutually converted. With statistical probabilities, we can express how sampling distributions test hypotheses.
For confirmation, using truth functions, we can conveniently express the degrees of belief in major premises. With statistical probabilities, we can derive the degrees of confirmation, e.g., the degrees of belief optimized by sampling distributions.
For Bayesian reasoning, with the P-T probability framework, we can have more forms of reasoning (see Table 4), such as:  two types of reasoning with Bayes' Theorem III,  Logical Bayesian Inference from sampling distributions to optimized truth functions, and  fuzzy syllogisms with the degrees of confirmation of major premises.
In short, with the P-T probability framework, we can resolve many problems more conveniently.

How to Extend Logic to the Probability Theory?
Jaynes [16] concludes that the probability theory is the extension of logic. This conclusion is a little different from that probability is the extension of logic. The former includes the latter. The former emphasizes that probability is an important tool for scientific reasoning, which includes Bayesian reasoning, the maximum likelihood estimation, the maximum entropy method, and so on. Although Jaynes interprets probabilities with average truth values ( [16], p. 52), this interpretation is not the extension. When an urn contains N balls with only two colors: white and red, we may use truth value Ri to represent a red ball on the ith draw and R i = Wi to represent a white ball on the ith draw. However, if the balls have more colors, this interpretation will not sound so good. In this case, the frequentist interpretation is simpler.
Zadeh's fuzzy set theory is an essential extension of the classical logic to probability for logical probability. Using (fuzzy) truth functions or membership functions, we can let a machine learn human brains' reasoning with fuzzy extensions. This reasoning can be an essential supplement to the probabilistic reasoning that Jaynes summarized.
This task is difficult, because it is hard to avoid the contradiction between the probability logic and statistical reasoning. According to Kolmogorov's axiomatic system, we have

T(A∪B) = T(A) + T(B) − T(A∩B).
However, we cannot obtain T(A∩B) from T(A) and T(B), because T(A∩B) is related to P(x) and the correlation between two predicates "x is in A" and "x is in B". The operators "∧" and "∨" can only be used for the operations of truth values instead of logical probabilities. For example, A is a set {youths}, and B is a set {adults}. Suppose that when people include high school students and their parents, the prior distribution is P1(x); when people include high school students and soldiers, the prior distribution is P2(x). T1(A) is obtained from P1(x) and T(A|x), T2(A) from P2(x) and T(A|x), and so on. T1(A∩B) should be much smaller than T2(A∩B), even if T1(A) = T2(A) and T1(B) = T2(B). A better method for T(A∩B) than Equation (52) is first to obtain truth function, such as T(A∩B|x) = T(A|x)∧T(B|x). Then we have Similarly, to obtain T(F (A, B, …)), we should first obtain truth function f(T(A|x), T(B|x), …) using fuzzy logic operations.
Various forms of Bayesian reasoning in Table 4 for the extension of the classical logic are compatible with statistical reasoning. For example, Bayes' Theorem III is compatible with Bayes' Theorem II. If we use a probability logic for a fuzzy or uncertain syllogism, we had better check if the reasoning is compatible with statistical reasoning.
In the two types of probability logic proposed by Reichenbach [9] and Adams [26], there is P(p => q) = 1 − P(p) + P(pq), as the extension of p => q (p impies q) in mathematical logic, where p and q are two proposition sequences or predicates. My extension is different. I used confirmation measures b*(e1→h1) and c*(e1 → h1), which may be negative, to represent the extended major premises. The most extended syllogisms in Table 4 are related to Bayes' formulas. The measure c*(p→q) is the function of P(q|p) (see Equation (44)); they are compatible. It should also be reasonable to use P(q|p) (p and q are two predicates) as the measure for assessing a fuzzy major premise. However, P(q|p) and P(p => q) are different. We can prove P(q|p) ≤ P(p => q) = 1-P(p) + P(pq).
Equation (55) is used because there is ( However, no matter how many examples support p => q, the degree of belief in "if p then q" cannot be 1, as pointed by Hume and Popper [7]. When counterexamples exist, p => q becomes P(p => q) = 1 − P(p) + P(pq), which is also not suitable as the measure for assessing fuzzy major premises.
In addition, it is much simpler to obtain P(q) or P(pq) from P(q|p) and P(p) = 1 than from P(p => q) and P(p) = 1 [9,26]. If P(p) < 1, according to statistical reasoning, we also need ( | ) P q p to obtain P(q) = P(p)P(q|p) + [1 − P(p)] ( | ) P q p . Using a probability logic, we should be careful about whether the results are compatible with statistical reasoning.

Comparing the Truth Function with Fisher's Inverse Probability Function
In the P-T probability framework, the truth function plays an important role. It is called Likelihood Inference to use the likelihood function P(x|θj) as the inference tool (where θj is a constant). It is called Bayesian Inference to use Bayesian posterior P(θ|D) as the inference tool (where P(θ|D) means parameters' posterior distribution for given data or a sample). It is called the Logical Bayesian Inference to use the truth function T(θj|x) as the inference tool [12].
IPF P(θj|x) can make use of the prior knowledge P(x) well. When P(x) is changed, P(θj|x) can still be used for probability predictions. However, why did Fisher and other researchers give up P(θj|x) as the inference tool?
When n = 2, we can easily construct P(θj|x), j = 1,2, with parameters. For instance, we can use a pair of logistic functions as the IPFs. Unfortunately, when n > 2, it is hard to construct P(θj|x), j = 1,2, …, n, because there is normalization limitation ∑j P(θj|x) = 1 for every x. That is why a multi-class or multi-label classification is often converted into several binary classifications [40]. P(θj|x) and P(yj|x) as predictive models also have a serious disadvantage. In many cases, we can only know P(x) and P(x|yj) without knowing P(θj) or P(yj) so that we cannot obtain P(yj|x) or P(θj|x). Nevertheless, we can get truth function T*(θj|x) in these cases. There is no normalization limitation, and hence it is easy to construct a group of truth functions and train them with P(x) and P(x|yj), j = 1,2, …, n, without P(yj) or P(θj).
We summarize that the truth function has the following advantages:  We can use an optimized truth function T*(θj|x) to make probability prediction for different P(x) as well as we use P(yj|x) or P(θj|x).  We can train a truth function with parameters by a sample with a small size as well as we train a likelihood function.


With the truth function, we can indicate the semantic meaning of a hypothesis or the extension of a label. It is also the membership function, which is suitable for classification.  To train a truth function T(θj|x), we only need P(x) and P(x|yj), without needing P(yj) or P(θj).  Letting T*(θj|x)∝P(yj|x), we can bridge statistics and logic.

Answers to Some Questions
If probability theory is the logic of science, may any scientific statement be expressed by a probabilistic or uncertain statement? The P-T probability framework supports Jaynes' point of view. The answer is yes. The Gaussian truth function can be used to express a physical quantity with a small uncertainty, as well as it is used to express the semantic meaning of a GPS pointer. The semantic information measure can be used to evaluate the result predicted by a physical formula. Guo [69] studied the operations of fuzzy numbers and concluded that the expectation of the function of fuzzy numbers is equal to the function of the expectations of the fuzzy numbers. For example, (fuzzy 3)*(fuzzy 5) = (fuzzy 15) = (fuzzy 3*5). According to this conclusion, the probabilistic or uncertain statements may be compatible with certain statements about physical laws.
Regarding the question of whether the existing probability theory, the P-T probability framework, and the G theory are scientific theories and if they can be tested by empirical fact. My answer is that they are not scientific theories tested by empirical facts but tools for scientific theories and applications. They can be tested by their abilities to resolve problems with scientific theories and applications.
Regarding the question of whether there are identifiable logical constraints for assessing scientific theories: I believe that a scientific theory needs not only self-consistency, but also compatibility with other accepted scientific theories. Similarly, a unified theory of probability and logic as a scientific tool should also be compatible with other accepted mathematical theories, such as statistics. For this reason, in the P-T probability framework, an optimized truth function (or a semantic channel) is equivalent to a TPF (or a Shannon channel) when we use them for probability predictions. Additionally, the semantic mutual information has its upper limit: Shannon's mutual information; the fuzzy logic had better be compatible with Boolean algebra.

Some Issues that Need Further Studies
Many researchers are talking about interpretable AI, but the problem of meaning is still very much with us today [70]. The P-T probability framework should be helpful for interpretable AI. The reasons are:


The human brain thinks using the extensions (or denotations) of concepts more than interdependencies. A truth function indicates the (fuzzy) extension of a label and reflects the semantic meaning of the label; Bayes' Theorem III expresses the reasoning with the extension.  The new confirmation methods and the fuzzy syllogisms can express the induction and the reasoning with degrees of belief that the human brain uses, and the reasoning is compatible with statistical reasoning.  The Boltzmann distribution has been applied to the Boltzmann machine [71] for machine learning. With the help of the semantic Bayes formula and the semantic information methods, we can better understand this distribution and the Regularized Least Square criterion related to information.
However, it is still not easy to interpret neural networks with semantic information for machine learning. Can we interpret a neural network as a semantic channel that consists of a set of truth functions? Can we apply the Channels' Matching algorithm [12] to neural networks for maximum mutual information classifications? These issues need further studies.
This paper provides a bridge between statistics and logic. It should be helpful for the semantic dictionary based on statistics that AI needs. However, there is much work to do, such as to design and optimize the truth functions of terms in natural languages, such as "elder", "heavy rain", "normal body temperature", and so on. The difficulty is that the extensions of these terms change from area to area. For example, the extension of "elder" depends on the life span of people in the area; the extensions on rainfalls of "heavy rain" in coastal areas and in desert areas are different. These extensions are related to prior distribution P(x). For unifying logical methods and statistical methods for AI, the efforts of more people are needed.
When samples are not big enough, the degrees of confirmation we obtain are not reliable. In these cases, we need to combine the hypothesis-testing theory to replace the degrees of confirmation with the degree intervals of confirmation. We need further study. This paper only extends a primary syllogism with the major premise "if e = e1 then h = h1" to some effective fuzzy syllogisms. It will be complicated to extend more syllogisms [63]. We need further study for the extension.
We may use truth functions as DCFs and the generalized information/entropy measures for random events' control and statistical mechanics (see Section 3.3). We need further studies for practical results.

Conclusions
As pointed by Popper, a hypothesis has both statistical probability and logical probability. This paper has proposed the P-T probability framework that uses "P" to denote statistical probability and "T" to denote logical probability. In this framework, the truth function of a hypothesis is equivalent to the membership function of a fuzzy set. Using the new Bayes theorem (Bayes' Theorem III), we can convert a likelihood function and a truth function one to another so that we can use sampling distributions to train truth functions. The maximum semantic information criterion used is equivalent to the maximum likelihood criterion. Statistics and logic are hence connected.
I have introduced how the P-T probability framework is used for the semantic information G theory or the G theory. The G theory is a natural generalization of Shannon's information theory. It can be used to improve statistical learning and explain the relationship between information and thermodynamic entropy, in which the minimum mutual information distribution is equivalent to the maximum entropy distribution.
I have shown how the P-T probability framework and the G theory support Popper's thought about scientific progress, hypothesis evaluation, and falsification. The semantic information measure can reflect Popper's testing severity and verisimilitude. The semantic information approach about verisimilitude can reconcile the content approach and the likeness approach [52].
I have shown how to use the semantic information measure to derive channel confirmation measure b* and prediction confirmation measure c*. Measure b* is compatible with the likelihood ratio and changes between -1 and 1. It can be used to assess medical tests, signal detections, and classifications. Measure c* can be used to assess probability predictions and clarify the Raven Paradox. Two confirmation measures are compatible with Popper's falsification thought.
I have provided several different forms of Bayesian reasoning, including fuzzy syllogisms with confirmation measures b* and c*. I have introduced a fuzzy logic that was used to set up a symmetrical model of color vision. This fuzzy logic is compatible with Boolean algebra, and hence compatible with the classical logic.
The above theoretical applications of the P-T probability framework illustrate its reasonability and practicability. We should be able to find the wider applications of the P-T probability framework. However, to combine the logical and statistical methods for AI, there is still much work to do. In order to apply the P-T probability framework and the G theory to deep learning with neural networks and to random events' control for practical results, we need further studies.
Funding: This research received no external funding.
Acknowledgments: I thank the editors of Philosophies and its special issue Science and Logic for giving me a chance to introduce this study systematically. I also thank the two reviewers for their comments, which vastly improve this paper. I should also thank people who influenced my destiny so that I have technical and theoretical research experience for this article.  where N is the number of particles. For a local equilibrium system, where Tj is the temperature of the jth area (yj), and Nj is the number of particles in the jth area. We now consider minimum mutual information R(Θ) for given distribution constraint functions T(θj|x) = exp[−ei/(kTj)] (j = 1, 2, …). The logical probability of yj is T(θj) = Zj/G, and the statistical probability is P(yj) = Nj/N. From Appendix B and the above equation for S, we derive where ej = Ej/Nj is the average energy of a particle in the j-th area.

(A10)
Appendix E. Illustrating the Fuzzy Logic in the Decoding Model of Color Vision From cones' responses b, g, and r, we can use the fuzzy decoding to produce the eight outputs as mental color signals, as shown in Figure A11. Because of P(p) = P(pq) + P(p q ) ≤ 1, we have P(pq) ≤ 1 − P(p q ), and thus P pq P pq P pq P q p P p P pq P pq P pq P pq P pq P P Pq P p P pq P p q