A Definition of Conditional Probability with Non-Stochastic Information

The current definition of a conditional probability enables one to update probabilities only on the basis of stochastic information. This paper provides a definition for conditional probability with non-stochastic information. The definition is derived by a set of axioms, where the information is connected to the outcome of interest via a loss function. An illustration is presented.


Introduction
The theory of conditional probability is a well-established mathematical theory that provides a procedure to update probabilities by taking new information into account. The new work in this paper is motivated by the fact that such a procedure is available only if the information which is used to update the probability concerns stochastic events, that is, events to which a probability is assigned. In other words, such information needs to be already included in the probability model. The statistical implications are most pertinent to Bayesian inference where non-stochastic information, obtained by experts for example, is often employed.
To set the scene, Ω denotes the set of possible outcomes of an experiment, which we assume is finite, p(ω) is a probability mass function on Ω, and I is an arbitrary piece of information about the unknown outcome of the experiment, i.e., information for which one would consider revising p(ω). We argue that there must be a formal way to derive the updated probability mass function, which we write as p I (ω). For example, Ω might represent a horse race and a bookmaker has set p. It starts to rain, so I = "it is raining", and the bookmaker revises p to p I . One imagines that the bookmaker does so from experience and uses no formal rule to implement the change. This lack of a formal rule then questions the extent of the relevance of subjective probability if one can appeal to no formal mechanism for revising belief distributions with arbitrary, yet relevant, information. Recall that the formal Bayes conditional probability rule requires the specification of p(I|ω) for all possible pieces of information I that one could receive-to us, at least, this is an impossible task.
It is impossible to set the probabilities p(I|ω) a priori for all possible I, yet this is required for the validation of the mathematical implementation of the Bayes rule. However, setting a loss function l(ω, I) once I has been received, is, we argue, a totally viable task. This would establish the loss, on some scale, e.g., monetary, if the elementary event ω occurs once I has been received. Our formal updating rule relies on the construction of such l(ω, I).
I is a piece of information for which it is possible to construct a loss function l(·, I) on Ω that is valued into [0, ∞] and is not identically infinite. So, l(ω, I) is the loss that the experimenter assigns to the outcome ω in Ω when holding the piece of information I. We look at an illustration involving such l(ω, I) later on in the paper. Once I has been received, the probability of updating is P, where P(B) = ∑ ω∈B p(ω) for every B ⊂ Ω.
For example, if I is the information "event B ⊂ Ω has occurred" and then our formal rule will coincide with the usual definition of conditional probability. When the standard definition of conditional probability does not apply, e.g., I is not such a statement or B is not a subset of Ω, for example, an alternative definition based on a mathematical decision theoretic framework can be used. When information received is non-stochastic, but relevant to an outcome of interest, we cannot use a probability distribution and so we need an alternative way to connect the information I with the outcome of interest ω. We do so using loss functions and a set of axioms.
This paper provides a definition for conditional probability on the basis of I. Reference [1] addressed this issue considering the minimization of a cumulative loss function which involved g-divergences. In this paper, instead, the definition of conditional probability is solely based on a set of axioms. This idea was developed by reference [2] who defined a framework for general Bayesian inference and discussed its application to important statistical problems. The present paper also addresses the issue of calibrating the conditional probability with non-stochastic information.

Relationship to the Literature
In the literature, alternative definitions of conditional probability, such as the Jeffrey's Rule of conditioning, are given where new information is not put in terms of the occurrence of an event included in the model. These definitions rely on the assumption that information can be given in the form of a constraint on the probability. Constraints considered are of the type where g is a measurable real function on Ω, and p I is the updated probability function. The idea is to minimize the Kullback-Leibler divergence subject to the constraint (2), which represents the information I. This problem can be solved using Lagrange multipliers. For more detail about conditionalization based upon constraints on the conditional distribution, see references [3][4][5][6][7][8]. Our approach is different as we can deal with more types of information. On the other hand, our definition can encompass potentially arbitrary information about the outcome; all we need is to construct a loss function l(ω, I) for each ω in Ω once I has been received.

Motivation
The need for a definition of conditional probability outside of the usual set-up is the avoidance of paradoxes where one knows event B has occured yet is not a subset of Ω. Paradoxes arise when B is not deemed to be part of the stochastic model yet is subsequently forced to be so.
Such difficulties arise in different puzzles, such as, for instance, Freund's puzzle of the two aces, introduced by reference [9]. For other puzzles about conditional probability, see, for instance, reference Gardner [10].
These puzzles have been widely used to discuss the concept of conditional probability. According to reference [11], such a concept is justifiable only on the basis of "a set of rules that tell, at each step, what can happen next", which he calls "protocol". For conditioning on an event B, he asserts that "we are assuming B is in your probability model, i.e., in the field of events to which you assign probabilities.
So it is implicit in the principle of total evidence that your probability model should include a model for what you learn". On the other hand, Hutchison [12,13] emphasizes that the updating process needs to take into account the circumstances under which the truth of I was conveyed. Also, Bar-Hillel and Falk [14] claims that knowing how knowledge was obtained is "a crucial ingredient to select the appropriate model". These authors present different views about the concept of conditionalization, but all agree on the fact that there would not be a problem if it was known how the information I had become available and therefore, one could build a model including I.
The concept of conditional probability distributions is certainly appropriate as a procedure to update probabilities on the basis of any new information that has already been included in the probability model. However, it can be difficult to construct a model that considers all possible relevant information that could become available in the future. Therefore, a problem arises when one obtains some new and possibly unexpected information and wants to use it to update a probability distribution. Indeed, it does not seem appropriate to assess the probability of something which has been already observed. Our basic assumption is that the information I can be connected to the outcome of interest via a loss function l(ω, I). In this way, it is possible to update the probability P, even if I is some new unexpected information that was not included in the probabilistic framework.
Clearly, we are dealing with information about outcomes of interest which take many forms. It is well-known that Bayesian inference can rely on expert opinions which often take the form of non-stochastic information. In particular, our framework allows coherent updates of probabilities involving such information, and hence, the practical relevance of our framework to Bayesian inference.

Results
This section reports the current definition of conditional probability and presents and motivates our definition for conditional probability with non-stochastic information.

The New Definition
If p(ω) represents prior beliefs about ω, then we argue that a valid and coherent update of p(·) on the basis of I is the posterior, p I (·), where and λ is a positive constant. Looking at this form, it seems that we are proposing the equivalent of a Bayes rule with This is correct only if the piece of information I is an element of a known space of possible outcomes. Indeed, only in such a case would it be possible to assess a probability distribution in such a space.
In our proposal, a much less complex framework is considered, since it is required to just set the loss l(ω, I) once I has been received. Our axioms which lead to (3) do not regard (4) as the probability.
We shall now go into the details of how some natural assumptions imply (3). The following axioms are considered: The posterior form is given by p I (·) = ψ(l(·, I), p(·)), for some function (ψ). Since l(·, I) and p(·) are all that are available and set, it is clear that the update must solely depend on these functions. We are, hence, asking for the unique ψ which provides the update for all Ω, i.e., Ω invariant. This is a reasonable requirement since how one updates should not depend on Ω.
This ensures we end up with p I 1 ,I 2 (ω) as the same object whether we update with (I 1 , I 2 ) together or {I 1 , I 2 } one after the other. 3.
If l(ω, I) = ∞ for every ω ∈ B c and some B ⊂ Ω, then where l(·, I)| B is the restriction of l(·, I) to B, J(B) is the information that B certainly occurs or has occured, and p J(B) is p restricted and normalized to B, i.e., p J(B) (ω) = p(ω) 1(ω ∈ B)/ ∑ ω∈B p(ω). In other words, outcomes with infinite loss are disregarded in the updating.

5.
If l(ω, I) ≡ constant, then ψ(l(·, I), p(·)) ≡ p(·). That is, if the observation provides no information about ω, since the loss function is a constant, then the posterior is the same as the prior.
As a consequence of these, we have the following theorem: Theorem 1. For |Ω| < ∞, with axioms 1 to 5, it is uniquely implied that for some λ > 0.
Proof of Lemma 1. Combination of axioms 2 and 5 yield axiom 6. Indeed, one just needs to consider (5) the case where the loss function l(·, I 1 ) or l(·, I 2 ) is constant. Combining conditions 3 and 5, one obtains that updating with the loss (1) corresponding to information J(B) yields the conditional probability p J(B) (ω) = p(ω)I B (ω)/P(B) that is obtained by restricting and normalizing the prior to B. Now, consider I 1 = J(B) and let I 2 be any piece of information associated with a given loss function (l). The sum of the two corresponding losses is infinite on B c and coincides with l on B, which is the loss considered in condition 3. Therefore, based on condition 2, condition 7 is obtained.
To deal with the odds, (5) becomes where h is a replication of l with a possible different I. Moreover, with a constant loss, the posterior is equal to the prior (condition 5), i.e., for every t > 0. At this stage, we consider a prior with three mass points, say Ω = {ω 1 , ω 2 , ω 3 }. The prior is given by {z 1 , z 2 , 1 − z 1 − z 2 } (i.e., p(ω 1 ) = z 1 and p(ω 2 ) = z 2 ), with l(ω i , I) = l i , for i = 1, 2, 3. The loss can be zero at one point without loss of generality (condition 6), so it takes values into the set {l 1 , l 2 , 0}. Let us consider the updating rule, φ, for priors with just two mass points. We can use this to update the conditional probability of {ω 1 } given {ω 1 , ω 3 }, i.e., z 1 /(z 1 + z 3 ). To this aim, we update the prior with masses z 1 /(z 1 + z 3 ) and z 3 /(z 1 + z 3 ) considering just the loss values, (l 1 , 0), i.e., disregarding the point ω 2 with its loss (l 2 ). In other words, we aim to obtain the right-hand-side of (8) and apply condition 7 given by Lemma 1. t 1,3 denotes the odds corresponding to the conditional probability of ω 1 given {ω 1 , ω 3 }, that is t 1,3 = z 1 /z 3 . t 1,3 is updated on the basis of the loss values (l 1 ) and zero, i.e., with φ(l 1 , t 1,3 ). Similarly, we define t 1,2 = z 1 /z 2 and t 2,3 = z 2 /z 3 . We update t 1,2 on the basis of l 1 and l 2 , i.e., with φ(l 1 − l 2 , t 1,2 ) and t 2,3 with φ(l 2 , t 2,3 ). Clearly, t 1,3 = t 1,2 t 2,3 , and this factorization of conditional odds has to hold also after updating, i.e., where t 1,3 = t 1,2 · t 2,3 . Formally, this identity is a consequence of (8), i.e., updating the conditional probability is the same as conditioning the updated probability. Since (11) must hold for every t 2,3 , t 1,2 > 0, and for every l 1 , l 2 ∈ R, for every t, s > 0 and l 1 , l 2 ∈ R. If l 2 = 0, and say l 1 = l, recalling (10), for every t, s > 0 and every real l. Letting t = 1, we find that for every s > 0 and every l ∈ R. A combination of (9) and (12) yields φ(l + h, 1) = φ(l, φ(h, 1)) = φ(l, 1)φ(h, 1) for every h, l ∈ R which, in turn, implies by monotonicity that φ(l, 1) = exp(−λl) for some λ ∈ R. Indeed, according to condition 4, φ(l, 1) is a monotone, not an increasing function of l. This implies that λ must be positive. Hence, for every l ∈ R and every t > 0. In this way, we are basically done with the two points case. Let us now consider the general finite case. We want to update the prior p(·) with mass points at {ω 1 , . . . , ω m } given by {z 1 , . . . , z m−1 , 1 − (z 1 + . . . , z m−1 )}, where z 1 , . . . , z m−1 are non-negative and their sum is less than or equal to one, and it is convenient to set z m := 1 − (z 1 + · · · + z m−1 ). In terms of odds, the prior is given by the vector (t 1 , . . . , t m−1 ) (being t i = z i /(1 − z i ), or equivalently, Moreover, we consider an R m−1 valued function φ(l, t) = (φ 1 (l, t), . . . , φ m−1 (l, t)), which provides the vector of the updated odds as being l = (l 1 , . . . , l m ) and t = (t 1 , . . . , t m−1 ). Now the question is how we could recover φ from φ, where the latter gives the updating rule for the two points case. Recall the notation used for the conditional odds, i.e., t i,j = z i /z j , is the odds corresponding to the conditional probability of {ω i } given {ω i , ω j } for a distinct i, j = 1, . . . , m. We can see that According to (8), this identity will also have to be satisfied by the updated odds. Since we update t i,j with φ(l i − l j , t i,j ), we must have which by (13) becomes: t i,j = z i /z j , (14) becomes and the updated probability of {ω j } is In this way, we have shown how to extend our coherent updating rule from the two points case to the general finite case.
The proof implies the existence of λ > 0, and this is no surprise since loss functions are only defined up to scalar factors. Hence, the setting of λ is a calibration issue and to set this, we need more information that is not associated with any specific information (I). So, l(ω) is set to be the loss function that contains initial beliefs at the outset before any further information (I) is obtained, and this l(ω) is defined to be on the same scale as any l(ω, I). We regard l(ω) = l(ω, I 0 ), where I 0 is the prior information used to set p.
The prior expected loss is then defined by Now, entropy is also regarded as an expected loss based on the self-information loss function −log p(ω). Hence, we also have expected loss/entropy which is measured as and so calibration is achieved by matching these expected losses, so that Finally, in this section, we note that it is quite straightfoward to extend the uniqueness argument to all countably infinite Ω, which replaces the uniqueness argument for all Ω. However, we need more work to extend separate uniqueness to the general Ω.

An Illustration
The loss function l is chosen by the decision-maker on the basis of the available information. Such information sometimes happens to be stochastic, i.e., belonging to a set B of outcomes to which a probability is assigned. If this is the case, one should update the probability by means of the usual conditional probability. It is tantamount to use the loss function (1) in (3). If the available information is not stochastic, then one can resort to the approach described in the present paper to properly assess the loss function l. A simple and very concrete example is now presented. Consider a horse race in which six horses participate. In order to decide how to bet, one assesses the probability for each horse to win. p(j) demotes the probability that horse number j wins for j ∈ {1, . . . , 6}. In this example, the elements of Ω are the numbers corresponding to the horses participating in the race, of which one will win.
Before the race begins, it starts raining. Since conditions have changed, the probabilities need to be updated. It is problematic to pursue this aim by resorting to the current definition of conditional probability. In fact, this requires knowledge of the probability that it rains and that the horse number j wins. As an alternative, one could calculate the conditional probabilities of a win for each horse by applying Bayes' theorem, which requires the probability that it rains given the win of horse j. However, it is raining and the race has not yet been run! It is therefore appropriate to resort to the definition of a conditional probability distribution given in this paper. So, p(j) is the prior probabilty that horse j wins and I = is the probability that "it is raining". l(j) be the loss function, i.e., the loss incurred to the bookmaker, that is, the governor of the probabilities, if horse j wins. To elaborate, such an l(j) could be the assessed loss if the bookmaker puts all the takings received or a fixed amount on horse j to win.
The bookmaker then, on the same scale, assesses l(j, I) for each j which adjusts the loss l(j). Obviously, a higher score will be given to those horses whose ability to run is more affected by the rain. In this way, one can use the ideas for setting λ to get l(j) = λ l(j, I). The updated probability that the j-th horse wins is revised to p I (j) = exp{− l(j) p(j)} ∑ 6 i=1 exp{− l(i) p(i)} , for j = 1, . . . , 6.

Discussion
We have established a framework in which we can update probabilities in the light of general, i.e., non-stochastic, information. Given that we cannot connect the information and the outcome of interest via a probability model, we do so through a loss function. Minimizing a cumulative loss function involving the information on one side and the probability distribution on the other yields the updated probability distribution. When the information is stochastic, we employ the self information loss function; the solution then reverts to the standard definition of conditional probability.
We believe the framework has direct implications for Bayesian inference where "word of mouth" information from experts is often obtained. It is interesting to note that our framework allows this information to update p at any point. Usually, the non-stochastic information I comes before the stochastic information given by an event B included in the probability framework, but in our approach, we can have B before I.

Author Contributions:
The order of the authors is alphabetical. They contributed equally.
Funding: This research received no external funding.