Update of Prior Probabilities by Minimal Divergence

The present paper investigates the update of an empirical probability distribution with the results of a new set of observations. The update reproduces the new observations and interpolates using prior information. The optimal update is obtained by minimizing either the Hellinger distance or the quadratic Bregman divergence. The results obtained by the two methods differ. Updates with information about conditional probabilities are considered as well.


Introduction
The present work is inspired by the current practices in Information Geometry [1][2][3] where minimization of divergences is an important tool. In Statistical Physics a divergence is called a relative entropy. Its importance was noted rather late in the twentieth century, after the work of Jaynes on the maximal entropy principle [4]. Estimation in the presence of hidden variables by minimizing a divergence function is briefly discussed in Chapter 8 of [2].
Assume now that some observation or experiment yields new statistical data. The approach is then to look for a probability distribution that reproduces the newly observed probabilities and that interpolates the data with missing information coming from a prior.
No further model assumptions are imposed. Hence, the statistical model under consideration consists of all probability distributions that are consistent with the newly obtained empirical data. Internal consistency of the empirical data ensures that the model is not empty. The update is the model point that minimizes the chosen divergence function from the prior to the manifold of the model.
In the context of Maximum Likelihood Estimation (MLE) one usually adopts a parameterized model. The dimension of the model can be kept low and properties of the model can be used to ease the calculations. One assumes that the new data can lead to a more accurate estimation of the limited number of model parameters. It can then happen that the model is misspecified [5] and that the update is only a good approximation of the empirical data.
Here, the model is dictated by the newly acquired empirical data and the update is forced to reproduce the measured data. Finding the probability distribution is then an underdetermined problem. Minimization of the divergence from the prior probability distribution solves the underdetermination.
In Bayesian statistics, the update q(B) of the probability p(B) of an event B equals The quantities p emp (A) and p emp (A c ) are the empirical probabilities obtained after repeated measurement of event A and its complement A c . Expression (1) has been called Jeffrey conditioning [6]. It implies the sufficiency conditions q(B|A) = p(B|A) and q(B|A c ) = p(B|A c ). It is an updating rule used in Radical Probabilism [7]. This expression is also obtained when minimizing the Hellinger distance between the prior and the model manifold. A proof of the latter follows later on in Section 4.
The present approach is a special case of minimizing a divergence function in the presence of linear constraints. See the introduction of [8] for an overview of early applications of this technique. Two classes of generalized distance functions satisfy a natural set of axioms: the f-divergences of Csiszár and the generalized Bregman divergences. The squared Hellinger distance belongs to the former class. The other divergence function considered here is the square Bregman divergence. Both Hellinger and square Bregman have special properties that make it easy to work with them.
A broad class of generalized Bregman divergences satisfies the Pythagorean equality [8,9]. Pythagorean inequalities hold for an even larger class [10]. The Pythagorean relations derived in the present work make use of the specific properties of the Hellinger distance and of the quadratic Bregman divergence. It is unclear how to prove them for more general divergences.
One incentive for starting the present work is a paper of Banerjee, Guo, and Wang [11,12]. They consider the problem of predicting a random variable Z 1 given observations of a random variable Z 2 . It is well-known that the conditional expectation, as defined by Kolmogorov, is the optimal predictor. They show that this statement remains true when the metric distance is replaced by a Bregman divergence. It is shown in Theorem 2 below that a proof in a more general context yields a deviating result.
The next Section fixes notations. Section 3 collects some results about the squared Hellinger distance and the quadratic Bregman divergence. Section 4 discusses the optimal choice and contains the Theorems 1 and 2. The proof of the theorems can be adapted to cover the situation that a subsequent measurement also yields information on conditional probabilities. This is shown in Section 4.3. Section 5 treats a simple example. A final section summarizes the results of the paper.

Empirical Data
Consider a probability space Ω, µ. A measurable subset A of Ω is called an event. Its probability is denoted p(A) and is given by where I A (x) equals 1 when x ∈ A and 0 otherwise. The conditional expectation of a random variable f given an event A with non-vanishing probability p(A) is given by The probability space Ω, µ reflects the prior knowledge of the system at hand. When new data become available an update procedure is used to select the posterior probability space. The latter is denoted Ω, ν in what follows. The corresponding probability of an event A is denoted q(A).
The outcome of repeated experiments is the empirical probability distribution of the events, denoted p emp (A). The question at hand is then to establish a criterion for finding the update ν of the probability distribution µ that is as close as possible to µ while reproducing the empirical results.
The event A defines a partition A, A c of the probability space Ω, µ. As before A c denotes the complement of A in Ω. In what follows a slightly more general situation is considered in which the event A is replaced by a partition (O i ) n i=1 of the measure space Ω, µ into subsets with non-vanishing probability. The notations p i and µ i are used, with Introduce the random variable g defined by g(x) = i when x ∈ O i . Repeated measurement of the random variable g yields the empirical probabilities They may deviate from the prior probabilities p i . In some cases one also measures the conditional probabilities p emp (B|O i ) = Emp Prob of B given that g(x) = i of some other event B.

A Geometric Approach
In this section two divergences are reviewed, the squared Hellinger distance and the quadratic Bregman divergence.

Squared Hellinger Distance
For simplicity the present section is restricted to the case that the sample space Ω is the real line.
Given two probability measures µ and σ, both absolutely continuous w.r.t. the Lebesgue measure, the squared Hellinger distance is the divergence D 2 (σ||µ) defined by It satisfies Let (O i ) i be a partition of Ω, µ and let g(x) = i when x belongs to O i , as before. Let p i and µ i be defined by (2). Consider the following functions of i, with i in {1, . . . , n}, with equality if and only if σ i = µ i for all i.
First prove the following two lemmas.

Lemma 1.
Assume that the probability measure ν i is absolutely continuous w.r.t. the measure µ i , with Radon-Nikodym derivative given by dν i (x) = f i (x) dµ i . Then one has

Proof. One calculates
Now take σ i = ν i to obtain the desired results.

Lemma 2. (Pythagorean relation) For any i is
Proof. The proof follows by taking ν i = µ i in the previous lemma.

Bregman Divergence
In the present section the squared Hellinger distance, which is an f-divergence, is replaced by a divergence of the Bregman type. In addition let Ω be a finite set equipped with the counting measure ρ. It assigns to each subset A of Ω the number of elements in A. This number is denoted |A|. The expectation value E µ f of a random variable f w.r.t. the probability measure µ is given by Given a partition of Ω into sets O i one can define conditional probability measures with probability mass function ρ i given by Similarly, conditional probability measures with probability mass function µ i are given by Fix a strictly convex function φ : R → R. The Bregman divergence of the probability measures σ and µ is defined by In the case that φ(x) = x 2 /2, which is used below, it becomes For convenience, this case is referred to as the quadratic Bregman divergence.
The following result, obtained with the quadratic Bregman divergence, is more elegant than the result of Lemma 2.
Let σ i be any probability measure with support in O i . Then the following Pythagorean relation holds.
Use now that φ (u) = u and the normalization of the probability measures ν i and σ i to find the desired result.

Updated Probabilities
The following result proves that the standard Kolmogorovian definition of the conditional probability minimizes the Hellinger distance between the prior probability measure µ and the updated probability measure ν. The optimal choice of the updated probability measure ν is given by corresponding probabilities q(B). They satisfy Note that the probability measure ν given by uses the Kolmogorovian conditional probability as the predictor because the probabilities determined by the µ i are obtained from the prior probability distribution µ by p i (x) = p(x|O i ).
By the above theorem this predictor is the optimal one w.r.t. the squared Hellinger distance.
Proof. With the notations of the previous section is (3) ).
Proposition 1 shows that it is minimal if and only if σ i = µ i for all i.
Next, consider the use of the quadratic Bregman divergence in the context of a finite probability space.
Then the following hold.
(a) A probability distribution ν is defined by ν = ∑ i p emp i ν i with (b) Let σ be any probability measure on Ω satisfying σ = ∑ n i=1 p emp i σ i , where each of the σ i is a probability distribution with support in O i . Then the quadratic Bregman divergence satisfies the Pythagorean relation

(a)
The assumption (6) guarantees that the ν i (x) are probabilities.
In the above calculation the third line is obtained by eliminating p i µ i using the definition of ν i . This gives (c) The empirical probabilities are strictly positive by assumption. Hence, it follows that D φ (µ||σ i ) = D φ (µ||ν i ) for all i and hence, that σ i = ν i for all i. The latter implies σ = ν.
The optimal update ν can be written as This result is in general quite different from the update proposed by Theorem 1, which is The updates proposed by the two theorems coincide only in the special cases that either p emp i = p i for all i or that µ i = ρ i for all i. In the latter case the prior distribution µ = ∑ i p i ρ i is replaced by the update ν = ∑ i p emp i ρ i . The entropy of the update when event O i is observed, according to Theorem 1, equals S(ν i ) = S(µ i ). According to Theorem 2 it equals If p i ≤ p emp i then it follows that The former inequality follows because the entropy is a concave function. The latter follows because entropy is maximal for the uniform distribution ρ i . On the other hand, if p i > p emp i then one has In the latter case the decrease of the entropy is stronger than in the case of the update based on the squared Hellinger distance. In conclusion, the update relying on the quadratic Bregman divergence looses details of the prior distribution by making a convex combination with a uniform distribution weighed with the probabilities of the observation. It does this moreso for the events with observed probability larger than predicted; this is when p emp i > p i . Note that Theorem 2 cannot always be applied because it contains restrictions on the empirical probabilities. In particular, if the prior probability µ(x) of some point x in Ω vanishes then the condition (6) requires that the empirical probability p emp i of the partition O i to which the point x belongs is larger than or equal to the prior probability p i .

Update of Conditional Probabilities
The two previous theorems assume that no empirical information is available about conditional probabilities. If such information is present then an optimal choice should make use of it. In one case the solution of the problem is straightforward. If the probabilities p emp i are available together with all conditional probabilities p emp (B|O i ) and there exists an update ν which reproduces these results then it is unique. Two cases remain: (1) The information about the conditional probabilities is incomplete; (2) the information is internally inconsistent -no update exists which reproduces the data.
Let us tackle the problem by considering the case that the only information that is available besides the probabilities p emp i is the vector of conditional probabilities p emp (B|O i ) of a fixed event B, given the outcome of the measurement of the random variable g as introduced in Section 2.
The following result is independent of the choice of divergence function.

Proposition 3.
Fix an event B in Ω. Assume that the conditional probabilities p(B|O i ), i = 1, . . . , n, are strictly positive and strictly less than 1. Assume in addition that p emp i p emp (B|O i ) ≤ 1 for all i. Then there exists an update ν with corresponding probabilities q(·) such that q(O i ) = p emp i and q(B|O i ) = p emp (B|O i ), i = 1, . . . , n.
Proof. An obvious choice is to take ν of the form ν = ∑ i p emp i ν i with ν i of the form with a i ≥ 0 and b i ≥ 0. Normalization of the ν i gives the conditions Reproduction of the conditional probabilities gives the conditions The latter gives The normalization condition (9) becomes It has a positive solution for b i because p emp i p emp (B|O i ) ≤ 1 and p(B c ∩ O i ) > 0.

The Hellinger Case
The optimal updates can be derived easily from Theorem 1. Double the partition by introduction of the following sets They have prior probabilities The empirical probability of the set O + i is taken equal to p emp The optimal update ν follows from Theorem 1 and is given by By construction it is One now verifies that q(O i ) = p emp i and q(B|O i ) = p emp (B|O i ), which is the intended result.

Example
Assume that the prior probability distribution is binomial with parameters n, λ, where n is known with certainty. The probability mass function is given by The probability distribution and the value of the parameter λ are for instance the result of theoretical modeling of the experiment. Or they are obtained from a different kind of experiment.
The experiment under consideration yields accurate values for the probability p emp of the two events X = 1 and X = 2. The problem at hand is to predict by extrapolation the probability of the event X = k for other values of k. A fit of the data with a binomial distribution is likely to fail because two accurate data points are given to determine a single parameter λ. The binomial model can be misspecified.
The geometric approach followed in the present paper yields an update from the binomial distribution to another distribution, one which is reproducing the data. The update is conducted in an unbiased manner. Quite often one is tempted to replace the model, in the case of the binomial model, by a model with one extra free parameter.
Let us see what are the results of minimizing divergence functions. The probability space Ω is the set of integers 0, 1, 2, . . . , n equipped with the uniform measure. Choose events This gives for p i := Prob (X ∈ O i ) The optimal update according to Theorem 1, minimizing the Hellinger distance, is given by the probabilities In particular, the probability mass function ν(k) := ν({k}) becomes The optimal update according to Theorem 2, minimizing the quadratic Bregman divergence, is given by (7). The auxiliary measures µ i , ρ i , and ν i have probability mass functions given by The probability mass function ν(k) := ν({k}) becomes The condition (6) is the requirement that all ν(k) are non-negative. Because the probabilities µ(k) can become very small this essentially means that p emp 3 should be larger than p 3 . The amount of probability missing in the empirical probabilities p emp 1 and p emp 2 is equally distributed over the remaining n − 1 points of Ω. On the other hand, when minimizing the Hellinger distance the excess or shortage of probability is compensated by multiplying all remaining probabilities by a constant factor.
A numerical comparison with n = 20 and λ = 1/8 is found in Figure 1.

Summary
It is well known that the use of unmodified prior conditional probabilities is the optimal way for updating a probability distribution after new data become available. The update procedure minimizes the Hellinger distance between prior and posterior probability distributions. For the sake of completeness a proof is given in Theorem 1.
Alternatively, one can minimize the quadratic Bregman divergence instead of the Hellinger distance. The result is given in Theorem 2. The conservation of probability is handled in a different way in the two cases, either by multiplying prior probabilities with a suitable factor or by adding an appropriate term.
The example of Section 5 shows that the two update procedures have different effects and that neither of them may be satisfactory. This raises the question whether the present approach should be improved by choosing divergences other than Hellinger or Bregman.
In the present research, the work of Banerjee, Guo, and Wang [11] was considered as well. They prove that minimization of the Hellinger distance can be replaced by minimization of a Bregman divergence, without modifying the outcome. It is shown in Theorem 2 that, in a different context, the use of the Bregman divergence yields results quite distinct from those obtained by minimizing the Hellinger distance.
Funding: This research received no external funding.