Causal Confirmation Measures: From Simpson's Paradox to COVID-19

When we compare the influences of two causes on an outcome, if the conclusion from every group is against that from the conflation, we think there is Simpson's Paradox. The Existing Causal Inference Theory (ECIT) can make the overall conclusion consistent with the grouping conclusion by removing the confounder's influence to eliminate the paradox. The ECIT uses relative risk difference Pd = max(0, (R - 1)/R) (R denotes the risk ratio) as the probability of causation. In contrast, Philosopher Fitelson uses confirmation measure D (posterior probability minus prior probability) to measure the strength of causation. Fitelson concludes that from the perspective of Bayesian confirmation, we should directly accept the overall conclusion without considering the paradox. The author proposed a Bayesian confirmation measure b* similar to Pd before. To overcome the contradiction between the ECIT and Bayesian confirmation, the author uses the semantic information method with the minimum cross-entropy criterion to deduce causal confirmation measure Cc = (R -1)/max(R, 1). Cc is like Pd but has normalizing property (between -1 and 1) and cause symmetry. It especially fits cases where a cause restrains an outcome, such as the COVID-19 vaccine controlling the infection. Some examples (about kidney stone treatments and COVID-19) reveal that Pd and Cc are more reasonable than D; Cc is more useful than Pd.


Introduction
Causal confirmation is the expansion of Bayesian confirmation. It is also a task of causal inference. The Existing Causal Inference Theory (ECIT), including Rubin's potential outcomes model [1,2] and Pearl's causal graph [3,4], has achieved great success. But causal confirmation is rarely mentioned.
Bayesian confirmation theories are also called confirmation theories, which can be divided into incremental and inductive schools. The incremental school affirms that the confirmation measures the supporting strength of evidence e to hypothesis h, as explained by Fitelson [5]. Following Carnap [6], the incremental school's researchers often use the increment of a hypothesis' probability or logical probability, P(h|e) -P(h), as a confirmation measure. Fitelson discussed causal confirmation with this measure and obtained some conclusions incompatible with the ECIT [5]. On the other hand, the inductive school [7,8] considers confirmation as induction's modern form, whose task is to measure a major premise's creditability supported by a sample or sampling distribution.
A confirmation measure is often denoted by C(e, h) or C(h, e). The author (of this paper) agrees with the inductive school and suggests using C(e→h) to express a confirmation measure so that the task is clear [8]. In this paper, we use "x=>y" to denote "Cause x leads to outcome y." Although the two schools understand confirmation differently, both use sampling distribution P(e, h) to construct confirmation measures. There have been many confirmation measures [8,9]. Most researchers agree that an ideal confirmation measure should have the following two desired properties:  normalizing property [10,11], which means C(e, h) should change between -1 and 1 so that the difference between a rule e→h and the best or the worst rule is clear;  hypothesis symmetry [12] or consequent symmetry [8], which means C(e1→h1) = -C(e1 →h0). For example, C(raven→black) = -C(raven→non-black).
The author in [8] distinguished channels' confirmation and predictions' confirmation and provided channels' confirmation measure b*(e→ h) and predictions' confirmation measure c*(e→h). Both have the above two desired properties and can be used for the probability predictions of h according to e.
Bayesian confirmation confirms associated relationships, which are different from causal relationships. Association includes causality, but many associated relationships are not causal relationships. One reason is that the existence of association is symmetrical (if P(h|e) ≠ 0, then P(e|h) ≠ 0), whereas the existence of causality is asymmetrical. For example, in medical tests, P(positive|infected) reflects both association and causality. But inversely, P(infected|positive) only indicates association. Another reason is that two associated events, A and B, such as electric fans' easy selling and air conditioners' easy selling, are the outcomes caused by the third event (hot weather). Neither P(A|B) nor P(B|A) indicates causality.
Causal inference only deals with uncertain causal relationships in nature and human society without considering those in mathematics, such as (x+1)(x-1)<x 2 because (x+1)(x-1) = x 2 -1. We know that Kant distinguishes analytic judgments and synthetic judgments. Although causal inference is a mathematical method, it is used for synthetic judgments to obtain uncertain rules in biology, psychology, economics, etc. In addition, causal confirmation only deals with binary causality.
Although causal confirmation was rarely mentioned in the ECIT, the researchers of causal inference and epidemiology have provided many measures (without using the term "confirmation measure") to indicate the strength of causation. These measures include Risk Difference： 1 1 1 0 ( | ) ( | ), RD P y x P y x   (1) Relative Risk difference or the risk ratio (like the likelihood ratio for medical tests): ( | ) / ( | ), RR P y x P y x  (2) and the probability of causation Pd (used by Rubin and Greenland [12]) or the probability of necessity PN (used by Pearl [3]). There is Pd is also called Relative Risk Reduction (RRR) [12]. In the above formula, max (0, .) means its minimum is 0. This function is to make Pd more like a probability. Measure b* proposed by the author before [8] is like Pd, but b* changes between -1 and 1. The above risk measures can measure not only risk or relative risk but also success or relative success raised by the cause. The risk measures in Equations (1)-(3) are significant; however, they do not possess the two desired properties and hence are improper as causal confirmation measures.
We will encounter Simpson's Paradox if we only use sampling distributions for the above measures. Simpson's paradox has been accompanying the study of causal inference, as the Raven Paradox has been going with the study of Bayesian confirmation. Simpson proposed the paradox [13] using the following example.
Example 1 [14]. The admission data of the graduate school of the University of California, Berkeley (UCB) for the fall of 1973 showed that 44% of male applicants were accepted, whereas only 35% of female applicants were accepted. There was probably gender bias. However, in most departments, female applicants' acceptance rates were higher than male applicants.
Was there a gender bias? Should we accept the overall conclusion or the grouping conclusion (i.e., that from every department)? If we take the overall conclusion, we can think that the admission had a bias against the female. On the other hand, if we accept the grouping conclusion, we can say that the female applicants were priorly accepted. Therefore, we say there exists a paradox.
Example 1 is a little complicated and easy to raise arguments. To simplify the problem, we use Example 2, which the researchers of causal inference often mentioned, to explain Simpson's paradox quantitatively.
We use x1 to denote a new cause (or treatment) and x0 to denote a default cause or no cause. If we need to compare two causes, we may use x1 and x2, or xi and xj, to represent them. In these cases, we may assume that one is default like x0.
Example 2 [15,16]. Suppose there are two treatments x1 and x2, for patients with kidney stones. Patients are divided into two groups according to their stones' sizes. Group g1 includes patients with small stones, and Group g2 has large ones. Outcome y1 means the treatment's success. Success rates shown in Figure 1 are possible. In each group, the success rate of x2 is higher than that of x1; however, the overall conclusion is the opposite. In each group, the success rate of x2, P(y1|x2, g), is higher than that of x1, P(y1|x1, g); however, using the method of finding the center of gravity, we can see that the overall success rate of x2, P(y1|x2) = 0.65, is lower than that of x1, P(y1|x1) = 0.7.
According to Rubin's potential outcomes model [1], we should accept the grouping conclusion: x2 is better than x1. The reason is that the stones' size is a confounder, and the overall conclusion is affected by the confounder. We should eliminate this influence. The method is to imagine the patients' numbers in each group are unchanged whether we use x1 or x2. Then we replace weighting coefficients P(gi|x1) and P(gi|x2) with P(gi) (i = 1, 2) to obtain two new overall success rates. Rubin [1] expresses them as P(y1 x1 ) and P(y1 x2 ); whereas Pearl [3] expresses them as P(y1|do(x1)) and P(y1|do(x2)). Then, the overall conclusion is consistent with the grouping conclusion.
Should we always accept the grouping conclusion when the two conclusions are inconsistent? It is not sure! Example 3 is a counterexample.
Example 3 (from [17]). Treatment x1 denotes taking a kind of antihypertensive drug, and treatment x0 means taking nothing. Outcome y1 denotes recovering health, and y0 means not. Patients are divided into group g1 (with high blood pressure) and group g0 (with low blood pressure). It is very possible that in each group g, P(y1|g, x1) < P(y1|g, x0) (which means x0 is better than x1); whereas overall result is P(y1|x1) > P(y1|x0) (which means x1 is better than x0).
The ECIT tells us that we should accept the overall conclusion that x1 is better than x0 because blood pressure is a mediator, which is also affected by x1. We expect that x1 can move a patient from g1 to g0; hence we need not change the weighting coefficients from P(g|x) to P(g). The grouping conclusion, P(y1|g, x1) < P(y1|g, x0), exists because the drug has a side effect.
There are also some examples where the grouping conclusion is acceptable from one perspective, and the overall conclusion is acceptable from another.
Example 4 [18]. The United States statistical data about COVID-19 in June 2020 show that COVID-19 led to a higher Case Fatality Rate (CFR) of Non-Hispanic whites than others (overall conclusion). We can find that only 35.3% of the infected people are Non-Hispanic whites, whereas 49.5% of the infected people who died from COVID-19 are Non-Hispanic whites. It seems that COVID-19 is more dangerous to Non-Hispanic whites. But Dana Mackenzie points out [18] that we will obtain the opposite conclusion from every age group because the CFR of Non-Hispanic whites is lower than that of other people in every age group. So, there exists Simpson's Paradox. The reason is that Non-Hispanic whites have longer lifespans and a relatively large proportion of the elderly, while COVID-19 is more dangerous to the elderly.
Kügelgen et al. [19] also point out the existence of Simpson's paradox after they compared the CFRs of COVID-19 (reported in 2020) in China and Italy. Although the overall conclusion is that the CFR in Italy is higher than in China, the CFR of every age group in China is higher than in Italy. The reason is that the proportion of the elderly in Italy is larger than in China.
According to Rubin's potential outcomes model or Pearl's causal graph, if we think that the reason for Non-Hispanic whites' longevity is good medical conditions instead of their race, then the lifespan is a confounder. Therefore, we should accept the grouping conclusion. On the other hand, if we believe that Non-Hispanic whites are longevous because they are whites, then the lifespan is a mediator, so we should accept the overall conclusion.
Example 1 is similar to Example 4, but the former is not easy to understand.. The data show that the female applicants tended to choose majors with low admission rates (perhaps because lower thresholds resulted in more intense competition). This tendency is like the lifespan of the white. If we regard lifespan as a confounder, Berkeley University had no gender bias against the female. On the other hand, if we believe the female as the cause of this tendency, the overall conclusion is acceptable, and gender bias should have existed. Which of the two judgments is right depends on one's perspective.
Pearl's causal graph [3] makes it clear that for the same data if supposed causal relationships are different, conclusions are also different. So, it is not enough to have data only. We also need the structural causal model. However, the incremental school's philosopher Fitelson argues that from the perspective of Bayesian confirmation, we should accept the overall conclusion according to the data without considering causation; Simpson's paradox does not exist according to his rational explanation. His reason is that we can use the measure [5] to measure causality. He proves (see Fact 3 of Appendix in [5]) that if there is then there must be P(y1|x1) > P(y1). The result is the same when ">" is replaced with "<". Therefore, he affirms that, unlike RD and Pd, measure i does not result in the paradox. However, Equation (5) expresses a rigorous condition, which excludes all examples with joint distributions P(y, x, g) that cause the paradox, including his simplified example about the admissions of the UCB.
One can't help asking:  For Example 2 about kidney stones, is it reasonable to accept the overall conclusion without considering the difficulties of treatments?  Is it necessary to extend or apply a Bayesian confirmation measure incompatible with the ECIT and medical practices to causal confirmation?  Except for the incompatible confirmation measure, are there no compatible confirmation measures? Besides the incremental school's confirmation measures, there are also the inductive school's confirmation measures, such as F proposed by Kemeny and Oppenheim in 1952 and b* provided by the author in 2020.
This paper mainly aims at:  combining the ECIT to derive causal confirmation measure Cc(x1=>y1) ("C" stands for confirmation and "c" for the cause), which is similar to Pd but can measure negative causal relationships, such as "vaccine => infected";  explaining that measures Cc and Pd are more suitable for causal confirmation than measure i by using some examples with Simpson's Paradox;  supporting the inductive school of Bayesian confirmation in turn. When the author proposed measure b*, he also provided measure c* for eliminating the Raven Paradox [8]. For extending c* to causal confirmation, this paper presents measure Ce(x1=>y1), which indicates the outcome's inevitability or the cause's sufficiency.

Bayesian confirmation: Incremental school and inductive school
A universal judgment is equivalent to a hypothetical judgment or a rule, such as "All ravens are black" is equivalent to "For every x, if x is a raven, then x is black". Both can be used as a major premise for a syllogism. Because of the criticism of Hume and Popper, most philosophers no longer expect to obtain absolutely correct universal judgments or major premises by induction but hope to get their degrees of belief. A degree of belief supported by a sample or sampling distribution is the degree of confirmation.
It is worth noting that a proposition does not need confirmation. Its truth value comes from its usage or definition [8]. An analytical judgment also does not need confirmation. For example, "People over 18 are adults" does not need confirmation; whether it is correct depends on the government's definition. Only major premises (such as "All ravens are black" and "If a person's Nucleic Acid Test is positive, he is likely to be infected with COVID-19") need confirmation.
The natural idea is to use conditional probability P(h|e) to confirm a major premise or rule denoted with e→h. This measure is also recommended by Fitelson [5], and called confirm f. There is confirm f = f(e,h) = P(h|e). (Carnap, 1962 [6], Fitelson, 2017 [5]) However, P(h|e) depends very much on the prior probability P(h) of h. For example, where COVID-19 is prevalent, P(h) is large, and P(h|e) is also large. Therefore, P(h|e) cannot reflect the necessity of e. An extreme example is that h and e are independent of each other, but if P(h) is large, P(h|e) = P(h, e)/P(e) = P(h) is also large. At this time, P(h|e) does not reflect the creditability of the causal relationship. For example, h = "There will be no earthquake tomorrow", P(h) = 0.999, and e = "Grapes are ripe". Although e and h are irrelative, P(h|e) = P(h) = 0.999 is very large. However, we cannot say that the ripe grape supports no earthquake happening.
On the other hand, the inductive school's researchers use the difference (or likelihood ratio) between two conditional probabilities representing the proportions of positive and negative examples to express confirmation measures. These measures include: S(e1, h1)= P(h1|e1)-P(h1|e0) (Christensen, 1999 [24]), They are all positively related to the Likelihood Ratio (LR + = P(e1|h1) / P(e1|h0)). For example, L = log LR + , and F = (LR + -1) / (LR + + 1) [7]. Therefore, these measures are compatible with risk (or reliability) measures, such as Pd, used in medical tests and disease control. Although the author has studied semantic information theory for a long time [27,28,29], he is on the side of the inductive school of Bayesian confirmation. The reason is that information evaluation occurs before classification, whereas confirmation is needed after classification [8,25].
Although the researchers understand confirmation differently, they all agree to use a sample including four types of examples (e1, h1), (e0, h1), (e1, h0), and (e0, h0) with different proportions as the evidence to construct confirmation measures [10,8]. The main problem with the incremental school is that they do not distinguish the evidence of a major premise and that of the consequent of the major premise well. When they use the four examples' proportions to construct confirmation measures, e is regarded as the major premise's antecedent, whose negation e0 is meaningful. However, when they say "to evaluate the supporting strength of e to h", e is understood as a sample, whose negation e0 is meaningless. It is more meaningless to put a sample e or e0 in an example (e1, h1) or (e0, h1).
We compare D (i.e., measure i) and S to show the main difference between the two schools' measures. Since we can find that D changes with P(e0) or P(e1), but S does not. P(e) means the source, and P(h|e) means the channel. D is related to the source and the channel, but S is only related to the channel. Measures F and b* are also only related to channel P(e|h). Therefore, the author call b* the channels' confirmation measure.

The P-T probability framework and the methods of semantic information and cross-entropy for channels' confirmation measure b*(e→h)
In the P-T probability framework [25], there are both statistical probability P and logical probability (or truth value) T; the truth function of a predicate is also a membership function of a fuzzy set [30]. Therefore, the truth function also changes between 0 and 1. The purpose of proposing this probability framework is to set up the bridge between statistics and fuzzy logic.
The logical probability of yj is Zadeh calls it the fuzzy event's probability [31]. When yj is true, the conditional probability of x is Fuzzy set θj can also be understood as a model parameter; hence P(x|θj) is a likelihood function.
The differences between logical probability and statistical probability are:  The statistical probability is normalized (the sum is 1), whereas the logical probability is not. Generally, we have T(θ0) + T(θ1) + ... > 1.
We can use the sample distribution to optimize the model parameters. For example, we use x to represent the age, use a logistic function as the truth function of the elderly: T("elderly"|x) = 1 / [1+exp (-bx+a)], and use a sampling distribution to optimize a and b.
The (amount of) semantic information about xi conveyed by yj is For different x, the average semantic information conveyed by yj is In the above formula, H(X|θj) is a cross-entropy: The cross-entropy has an important property: when we change P(x|θj) so that P(x|θj) = P(x|yj), H(X|θj) reaches its minimum. It is easy to find from Equation (10) that I(X; θj) reaches its maximum as H(X|θj) reaches its minimum. The author has proved that if P(x|θj) = P(x|yj), then T(θj|x)∝P(yj|x) [24]. If for all j, T(θj|x)∝P(yj|x), we say that the semantic channel matches the Shannon channel.
We use the medical test as an example to deduce the channels' conformation measure b*. We define h∈{h0, h1} = {infected, uninfected} and e∈{e0, e1} = {positive, negative}. The Shannon channel is P(e|h), and the semantic channel is T(e|h). The major premise to be confirmed is e1→h1, which means "If one's test is positive, then he is infected." We regard a fuzzy predicate e1(h) as the linear combination of a clear predicate (whose truth value is 0 or 1) and a tautology (whose truth value is always 1). Let the tautology's proportion be b1', and the clear predicate's proportion be 1 -b1'. Then we have The b1' is also called the degree of disbelief of rule e1→h1. The degree of disbelief optimized by a sample, denoted by b1'*, is the degree of disconfirmation. Let b1* denote the degree of confirmation; we have b1' * =1-|b1 * |. By maximizing average semantic information I(H; θ1) or minimizing cross-entropy H(X|θj), we can deduce (see Section 3.2 in [8]) Suppose that likelihood function P(h|e1) is decomposed into an equiprobable part and a part with 0 and 1. Then, we can deduce the predictions' confirmation measure c*: Measure b* is compatible with the likelihood ratio and suitable for evaluating medical tests. In contrast, measure c* is appropriate to assess the consequent inevitability of a rule and can be used to clarify the Raven Paradox [8]. Moreover, both measures have the normalizing property and symmetry mentioned above.

Causal inference: talking from Simpson's Paradox
According to the ECIT, the grouping conclusion is acceptable for Example 2 (about kidney stones), whereas the overall conclusion is acceptable for Example 3 (about blood pressure). The reason is that P(y1|x1) and P(y1|x0) may not reflect causality well; besides the observed data or joint probability distribution P(y, x, g), we also need to suppose the causal structure behind the data [3].
Suppose there is the third variable, u. Figure 2 shows the causal relationships in Examples 2, 3, and 4. Figure 2 (a) shows the causal structure of Example 2, where u (kidney stones' size) is a confounder that affects both x and y. Figure 2 (b) describes the causal structure of Example 3, where u (blood pressure) is a mediator that affects y but is affected by x. In Figure 2 (c), u can be interpreted as either a confounder or a mediator. The causality will differ from different perspectives, and P(y1|do(x)) will also differ. In all cases, we should replace P(y|x) with P(y|do(x)) (if they are different) to get RD, RR, and Pd. We should accept the overall conclusion for the example where u is a mediator. But for the example where u is a confounder, how do we obtain a suitable P(y|do(x))? According to Rubin's potential outcomes model, we use Figure 3 to explain the difference between P(y|do(x)) and P(y|x).
To find the difference in the outcomes caused by x1 and x2, we should compare the two outcomes in the same background. However, there is often no situation where other conditions remain unchanged except for the cause. For this reason, we need to replace x1 with x2 in our imagination and see the shift in y1 or its probability. If u is a confounder and not affected by x, the number of members in g1 and g2 should be unchanged with x, as shown in Figure 3. The solution is to use P(g) instead of P(g|x) for the weighting operation so that the overall conclusion is consistent with the grouping conclusion. Hence, the paradox no longer exists. Figure 3. Eliminating Simpson's paradox as the confounder exists by modifying the weighting coefficients. After replacing P(gk|xi) with P(gk) (k=1,2; i=1,2), the overall conclusion is consistent with the grouping conclusion; the average success rate of x2, P(y1|do(x2)) = 0.7, is higher than that of x1, P(y1|do(x1)) = 0.65.
Rubin's reason [2] for replacing P(g|x) with P(g) is that for each group, such as g1, the two subgroups' members (patients) treated by x1 and x2 are interchangeable (i.e., Pearl's causal independence assumption mentioned in [5]). If a member is divided into the subgroup with x1, its success rate should be P(y1|g, x1); if it is divided into the subgroup with x2, the success rate should be P(y1|g, x2). P(g|x1) and P(g|x2) are different only because half of the data are missing. But we can fill in the missing data using our imagination.
If u is a mediator, as shown in Figure 2 (b), a member in g1 may enter g2 because of x, and vice versa. P(g|x0) and P(g|x1) are hence different without needing to be replaced with P(g). We can let P(y1|do (x)) = P(y1|x) directly and accept the overall conclusion.

Probability measures for causation
In Rubin and Greenland's article [13], is explained as the probability of causation, where t is one's age of exposure to some harmful environment. R(t) is the age-specific infection rate (infected population divided by uninfected population). If y1 denotes the infection, x1 denotes the exposure, and x0 means no exposure, then there is R(t) = P(y1|do(x1), t) / P(y1|do(x0), t). Its lower limit is 0 because the probability cannot be negative. When the change of t is neglected, considering the lower limit, we can write the probability of causation as Pearl uses PN to represent Pd and explains PN as the probability of necessity [3]. Pd is very similar to confirmation measure b* [8]. The main difference is that b* changes between -1 and 1. Robert van Rooij and Katrin Schulz [32] argue that conditionals of the form "If x, then y" are assertable only if is high. This measure is similar to confirmation measure Z. The difference between Pd and Δ * Px y is that Pd, like b*, is sensitive to counterexamples' proportion P(y1|x0), whereas Δ * Px y is not. Table 1 shows their differences.  [33] support the Ramsey test hypothesis, implying that the subjective probability of a natural language conditional, P(if p then q), is the conditional subjective probability, P(q|p). This measure is confirm f in [5].
The author suggests [8] that we should distinguish two types of confirmation measures for x→y. One is to stand for the necessity of x compared with x0; the other is for the inevitability of y. P(y|x) may be good for the latter but not for the former. The former should be independent of P(x) and P(y). Pd is such one.
However, there is a problem with Pd. If Pd is 0 when y is uncorrelated to x, then Pd should be negative instead of 0 when x inversely affects y (e.g., vaccine affects infection). Therefore, we need a confirmation measure between -1 and 1 instead of a probability measure between 0 and 1.

Defining Causal Posterior Probability (CPP)
To avoid treating association as causality, we first explain what kind of posterior probabilities indicate causality. Posterior probability and conditional probability are often regarded as the same. But Rubin emphasizes that probability P(y x ) is not conditional; it is still marginal. To distinguish P(y x ) and marginal probability P(y), we call P(y x ), i.e., P(y|do(x)), the causal posterior probability. What posterior probability is the CPP? We use the following example to explain.
About the population age distribution, let z be age and the population age distribution be p(z). We may define that a person with z > = 60 is called an elderly, that is, P(y1|z) = 1 for z ≥ z0. The label of an elderly is y1, and the label of a non-elderly is y0. The probability of the elderly is Let x1 denote the improved medical condition. After a period, p(z) become p(z x1 ) = p(z|do(x1)), and P(y1) become Let x0 be the medical condition existing already. We have There are similar examples:  About whether a drug (x1) can lower blood pressure, blood sugar, blood lipid, or uric acid (z) or not, if z drops to a certain level z0, we say that the drug is effective (y1).  About whether a fertilizer (x1) can increase grain yield (z), if z increases to a certain extent z0, the grain yield is regarded as a bumper harvest (y1).  Can a process x1 reduce the deviation z of a product's size? If the deviation is smaller than the tolerance (z0), we consider the product qualified (y1). From the above examples, we can find that the action x can be the cause of a causal relationship because it can cause the change of probability distribution p(z) of objective result z, rather than the change of probability distribution P(y|.) of outcome y. The reason is that P(y|.) also changes with the dividing boundary z0. For example, if the dividing boundary of the elderly changes from z0 = 60 to z0' = 65, the posterior probability P(y1|z0') of y1 will become smaller. This change seemly also reflects causality. However, the author thinks this change is due to a mathematical cause, which does not reflect the causal relationship we want to study. Therefore, we need to define the CPP more specifically. Definition 1. Random variable Z takes a value z∈ {z1, z2, …}, and p(z) is the probability distribution of the objective result. Random variable Y takes a value y∈{y0, y1} and represents the outcome, i.e., the classification label of z. The cause or treatment is x ∈ {x0, x1} or {x1, x2}. If replacing x0 with x1 (or x1 with x2) can cause the change of probability distribution p(z), we call x the cause, p(z|x) or p(z x ) the CPP distribution, and P(y x) = P(y|do(x)) the CPP.
According to the above definition, given y1, the conditional probability distribution p(z|y1) is not the CPP distribution because the probability distribution of z does not change with y.
Suppose that x1 is the vaccine for COVID-19, y1 is the infection, and e1 is the testpositive. Then P(y1|x1) or P(y1|do(x1)) is the CPP, whereas P(y1|e1) is not. We may regard y1 as the conclusion obtained by the best test, e1 is from a common test, and P(y1|e1) is the probability prediction of y1. P(y1|e1) is not a CPP because e1 does not change p(z) and the conclusion from the best test.

Using x2/x1=>y1 to compare the influences of two causes on an outcome
In associated relationships, x0 is the negation of x1; they are complementary. But in causal relationships, x1 is the substitute for x0. For example, consider taking medicines to cure the disease. Let x0 denote taking nothing, and x1 and x2 represent taking two different medicines. Each of x1 and x2 is a possible alternative to x0 instead of the negation of x0. Furthermore, in some cases, x1 may include x0 (see Table 6 in Section 4.3).
To compare x1 with x0, we may selectively use "x1/x0=>y1" or "x1=>y1". For Example 2 with a confounder, if we consider the treatment as replacing x2 with x1 in our imagination, we can easily understand why the number of patients in each group should be unchanged, that is, P(g|x1) = P(g|x2) = P(g). The reason is that the replacement will not change everyone's kidney stone size.
In Example 3, u is a mediator, and the number of people in each group (with high or low blood pressure) is also affected by taking an antihypertensive drug x1. When we replace x0 with x1, P(g|x1) ≠ P(g|x0) ≠ P(g) is reasonable, and hence the weighting coefficients need not be adjusted. In this case, we can directly let P(y1|do(x)) = P(y1|x).

Deducing causal confirmation measure Cc by the methods of semantic information and crossentropy
We use x1=>y1 as an example to deduce the causal confirmation measure Cc. If we need to compare any two causes, xi and xk, we may assume that one is default as x0.
The logical probability of s1 is (see Equation (7)): The predicted probability of x1 by y1 and s1 is where θj can be regarded as the parameter of truth function T(sj|x).
The average semantic information conveyed by y1 and s1 about x is where H(X|θ1) is a cross-entropy. We suppose that sampling distribution P(x, y) has be modified so that P(y|x) = P(y|do(x)). According to the property of cross-entropy, H(X|θ1) reaches its minimum so that I(X; θj) reaches its maximum as P(x|θ1) = P(x|y1), i.e., From the above two equations, we obtain which represents the degree of correlation between xi and yj and may be independent of P(x) and P(y), unlike P(xi, yj). From Equations (25) and (26), we obtain the optimized degree of disbelief, i.e., the degree of disconfirmation: b1'* = m(x0,y1)/m(x1,y1).
In the above formulas, we assume b1*> 0 and hence m(x1, y1) ≥ m(x0,y1). If m(x1, y1) < m(x0, y1), b1* should be negative, and b1'* should be m(x1, y1) / m(x0, y0). Then we have Combining the above two equations, we derive the confirmation measure: where R = P(y1|x1) / P(y1|x0) is the relative risk or the likelihood ratio used for Pd. Measure Cc has the normalizing property since its maximum is 1 as m(x0, y1)=0, and the minimum is -1 as m(x1, y1)=0. It has Cause Symmetry since Similarly, letting probability distribution P(y|x1) be the linear combination of a uniform probability distribution and a 0-1 distribution, we can obtain another causal confirmation measure: ) . max( ( | ), ( | )) max( ( | ),1 ( | )) e P y x P y x P y x C x y P y x P y x P y x P y x This measure can be regarded as the direct extension of Bayesian confirmation measure c*(e1→h1) [8]. It increases monotonically with the Bayesian confirmation measure f(h1, e1) = P(h1|e1), which is used by Fitelson et al. [8,33]. However, Ce has the normalizing property and the Outcome Symmetry:

Causal confirmation measures Cc and Ce for probability predictions
From y1, b1*, and P(x), we can make the probability prediction about x: where b1* > 0, θ1 represents y1 with b1'*, and θ0 means y0 with b0'*. If b1*< 0, we let T(s1|x1) = b1' and T(s1|x0) = 1, and then use the above formula. Following the probability prediction with Bayesian confirmation measure c* [8], we can also make probability prediction for given x1 and Ce1. For example, when Ce1 is greater than 0, there is where θx1 denotes x1 and Ce1. Given the semantic channel ascertained by b1 > 0 and b0 > 0, as shown in Table 2, we can obtain the corresponding Shannon channel P(y|x). According to Equation (32), we can deduce Table 3 shows Example 2 with detailed data about kidney stone treatments [15]. The data were initially provided in [16]. In Table 3, *% means a success rate, and the number behind it is the patients' number. The stone size is a confounder. The conclusion from every group (with small or large stones) is that Treatment x2 (i.e., Treatment A in [15]) is better than Treatment x1 (i.e., Treatment B in [15]); whereas the conclusion according to average success rates, P(y1|x2) = 0.78 and P(y1|x1) = 0.83, Treatment x1 is better than Treatment x2. There seems to be a paradox.
We tested Equation (38) by this example. The Shannon channel P(y|x) derived from the two degrees of disconfirmation b1'* and b0'* is the same as P(y|do(x)) shown in the last two rows of Table 3. Table 4 shows Example 2 with the detailed data about CFRs of COVID-19. The original data was obtained from the website of the Centers for Disease Control and Prevention (CDC) in the United States till July 2, 2022 [34]. The data only include reported cases; otherwise, the CFRs should be lower. The x1 represents the Non-Hispanic white, and x2 means the other races. P(y1|x1, g) and P(y1|x2, g) are the CFRs of x1 and x2 in an age group g. See Appendix I for the original data and median results.  Table 5 shows the overall (average) CFRs vary before and after we change the weighting coefficient from P(g|x) to P(g). From Table 4, we can find that for different age groups, the CFR of the Non-Hispanic white is lower than or close to that of the other races. However, for all age groups (see Table 5), the overall (average) CFR (1.04) of the Non-Hispanic white is higher than the CFR (0.73) of the other races. After replacing P(g|x) with P(g), the overall CFR (0.80) of the Non-Hispanic white is also lower than that (1.05) of the other races.

COVID-19: vaccine's negative influences on CFR and Mortality
Using causal probability measure Pd is not convenient to measure the "probability" of "vaccine => infection" or "vaccine => death". Since Pd is regarded as the probability, whose minimum value is 0, while the vaccine's influence is negative. However, there is no problem using Cc because Cc can be negative. Table 5 shows data obtained from the website of the US CDC [35] and the two degrees of causal confirmation. The numbers of cases and deaths are among 100,000 people (ages over 5) in a week (from June 20 to 26, 2022). The negative degree of causal confirmation -0.63 means that the vaccine reduced the infection rate by 63%. The -0.79 means that the vaccine reduced the CFR by 79%.
We obtained the above results without considering the vaccine's side effects, possibly resulting in chronic deaths.

Why can Pd and Cc better indicate the strength of causation than D in theory?
We call m(xi, yj) (i=0,1; j=0,1) the probability correlation matrix, which is not symmetrical. Although there exists P(x, y) first and then m(x, y) from the perspective of calculation, there exists m(x, y) first and then P(x, y) from the perspective of existence. That is, given P(x), m(x, y) only allows specific P(y) to happen.
We can also make probability predictions with m(x, y) (like using Bayes' formula):  30), we can find that Pd and Cc only depend on m(x, y) and are independent of P(x) and P(y). The two degrees of disconfirmation, b1'* and b0' *, ascertain a semantic channel and a Shannon channel. Therefore, the two degrees of causal confirmation, Cc1 = b1* and Cc0 = b0*, indicate the strength of the constraint relationship (causality) from x to y. Like Cc, Measure Pd is also only related to m(x, y). D and Δ*Px y are different; they are related to P(x), so they do not indicate the strength of causation well.
For example, considering the vaccine's effect on the CFR of COVID-19 (see Table 6), Pd or Cc are irrelated to vaccination coverage rate P(x1), whereas measure Δ*Px y is related to P(x1). Measure D is associated with P(y) and is also related to P(x1). Pd and Cc1 obtained from one region also fits other areas for the same variant of COVID-19. In contrast, Δ*Px y and D are not universal because the vaccination coverage rate P(x1) differs in different areas.
According to the incremental school's view of Bayesian confirmation, P(y1) is a prior probability, and P(y1|x) -P(y1) is its increment. However, when measure D is used for causal confirmation, P(y1) is obtained from P(x) and P(y1|x) after the treatment, so P(y) is no longer a priori probability, which is also a fatal problem with the incremental school.
In addition, as the result of induction, Cc and Pd can indicate the degree of belief of a fuzzy major premise and can be used for probability predictions, whereas D and Δ * Px y cannot.

Why are Pd and Cc better than D in practice?
Two calculation examples in Sections 4.1 and 4.2 support the conclusion that measures Pd and Cc are better than D in practice. The reasons are as follows. For example, Table 6 shows that according to the virulence of the virus, COVID-19 will increase the mortality rate of vaccinated people from 1.3% to 1.318%. Therefore, the degree of causal confirmation is Cc1 = Pd = 0.014, which means that 1.4% of the deaths will be due to COVID-19. However, the meanings of D and Δ*Px y are not precise.
Different from measure RD (see Equation (1)), Pd and Cc indicate relative risk or the relative change of the outcome. Many people think COVID-19 is very dangerous because it can kill millions in a country. However, the mortality rate it brings is much lower than that caused by common reasons. Pd and Cc can tell the relative change in the mortality rate (see Table 6). Although it is essential to reduce or delay deaths, it is also vital to decrease the economic loss due to the fierce fight against the epidemic. Therefore, Pd and Cc can help decision-makers balance between reducing or delaying deaths and reducing financial losses.

The confounder's influence is removed from Pd and Cc
When there is a confounder, as shown in Section 4.1, using Pd or Cc, we can eliminate Simpson's Paradox and make the overall conclusion consistent with the grouping conclusion: treatment x2 is better than treatment x1. For example, if we use D to compare the success rates of two treatments, although we can avoid Simpson's Paradox, the conclusion is unreasonable. The reason is that we neglect the difficulties of treatments for size-different kidney stones. If a hospital only accepts patients who are easy to treat, its overall success rate must be high; however, such a hospital may not be a good one. 5.2.3．Pd and Cc allow us to view the third factor, u, from different perspectives For the example in Section 4.2, if we think that one's longevity is related to his race, we can take the lifespan as a mediator and then directly accept the overall conclusion (Non-Hispanic whites have a higher CFR than other people) as Fitelson does by using D.
On the other hand, if we believe that one's longevity is not due to his race, then the lifespan is a confounder. Therefore, we can make the overall conclusion consistent with the grouping conclusion and then use Pd and Cc.
It is worth noting that it is concluded that the CFR of Non-Hispanic whites is lower than that of other people probably because medical conditions affect the CFRs. However, existing data do not contain information about the medical conditions of different races. Otherwise, with the medical condition as a confounder, the CFRs of different races might be similar. This issue is worth researching further.
5.3．Why had we better replace Pd with Cc? Section 4.3 provides the calculation of the two negative degrees of causal confirmation that reflect the impacts of the vaccine on infection and mortality. The negative degrees of confirmation mean that the vaccine can reduce infections and deaths. But if we use Pd as the probability of causation, Pd can only take its lower limit 0. Although we can replace Pd(vaccinated => death) with Pd(unvaccinated => death) to ensure Pd > 0, it does not conform to our thinking habits to take being vaccinated as the default cause. In addition, Cc has cause symmetry, whereas Pd does not.
When we use Pd to compare two causes x1 and x2, such as two treatments for kidney stones (see Section 4.1), we have to consider which of P(y1|x2) and P(y1|x1) is larger. However, using Cc, we need not consider that because it is unnecessary to worry about if (R -1) / R < 0.
The correlation coefficient in mathematics is between 1 and -1. Cc can be understood as the probability correlation coefficient. The difference is that the former has only one coefficient between x and y, whereas the latter has two coefficients: Cc1 = Cc(x1=>y1) and Cc0 = Cc(x0=>y0).

Necessity and sufficiency in causality
Measures Pd and Cc only indicate the necessity of cause x to outcome y; they do not reflect the sufficiency of x or the inevitability of y. On the other hand, measure f = P(y|x) and Ce can indicate the outcome's inevitability.
The medical industry uses the odds ratio to indicate both the necessity and sufficiency of the cause to the outcome. The odds rate [2] is P y x P y x OR P y x P y x   It is the product of two likelihood ratios. We can use 1 max( ,1) as the confirmation measure of both x0 => y0 and x1 => y1 for the same purpose. But ORN has the normalizing property and symmetry.
Their antecedents and consequents are the same. However, from the right sides' values of the above two equations, we may not be able to obtain the left sides' values because an associated relationship may not be a causal relationship.

Conclusions
Fitelson, a representative of the incremental school of Bayesian confirmation, uses D(x1, y1) = P(y1|x1) -P(y1) to denote the supporting strength of the evidence to the consequence and extend this measure for causal confirmation without considering the confounder. This paper has shown that measure D is incompatible with the ECIT and popular risk measures, such as Pd = max(0, (R -1) / R). Using D, one can only avoid Simpson's Paradox but cannot eliminate it or provide a reasonable explanation as the ECIT does.
On the other hand, Rubin et al. use Pd as the probability of causation. Pd is better than D, but it is improper to call Pd a probability measure and use the probability measure to measure causation. If we use Pd as a causal confirmation measure, it lacks the normalizing property and symmetry that an ideal confirmation measure should have. This paper has deduced causal confirmation measure Cc(x1=>y1) = (R -1) / max(R, 1) by the semantic information method with the minimum cross-entropy criterion. Cc is similar to the inductive school's confirmation measure b* proposed by the author earlier. But the positive examples' proportion P(y1|x1) and the counterexamples' proportion P(y1|x0) are replaced with P(y1|do(x1)) and P(y1|do(x0)) so that Cc is an improved Pd. Compared with Pd, Cc has the normalizing property (it changes between -1 and 1) and the cause symmetry (Cc(x0/x1=>y1) = -Cc (x1/x0=>y1)). Since Cc may be negative, it is also suitable for evaluating the inhibition relationship between cause and outcome, such as between vaccine and infection. This paper has provided some examples with Simpson's Paradox for calculating the degrees of causal confirmation. The calculation results show that Pd and Cc are more reasonable and meaningful than D, and Cc is better than Pd mainly because Cc may be less than zero. In addition, this paper has also provided a causal confirmation measure Ce(x1 =>y1) that indicates the inevitability of the outcome y1.
Since measure Cc and the ECIT support each other, the inductive school of Bayesian confirmation are also supported by the ECIT and the epidemical risk theory.
However, like all Bayesian confirmation measures, causal confirmation measure Cc and Ce also use size-limited samples; hence, the degrees of causal confirmation are not strictly reliable. Therefore, replacing a degree of causal confirmation with a degree interval is necessary to retain the inevitable uncertainty. This work needs further studies by combining existing theories.
Author Contributions: The author is the sole contributor.
Funding: This research received no external funding.
Data Availability Statement: Not applicable.