The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning

Adamčík, Martin

doi:10.3390/e16126338

Open AccessArticle

The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning

by

Martin Adamčík

Martin de Tours School of Management and Economics, Assumption University, MSME Building, 4th Floor, 88 Moo 8 Bang Na-Trad Km. 26 Bangsaothong, 10540 Samuthprakarn, Thailand

Entropy 2014, 16(12), 6338-6381; https://doi.org/10.3390/e16126338

Submission received: 19 October 2014 / Revised: 24 November 2014 / Accepted: 25 November 2014 / Published: 1 December 2014

(This article belongs to the Special Issue Maximum Entropy Applied to Inductive Logic and Reasoning)

Download

Browse Figures

Versions Notes

Abstract

:

The aim of this paper is to develop a comprehensive study of the geometry involved in combining Bregman divergences with pooling operators over closed convex sets in a discrete probabilistic space. A particular connection we develop leads to an iterative procedure, which is similar to the alternating projection procedure by Csiszár and Tusnády. Although such iterative procedures are well studied over much more general spaces than the one we consider, only a few authors have investigated combining projections with pooling operators. We aspire to achieve here a comprehensive study of such a combination. Besides, pooling operators combining the opinions of several rational experts allows us to discuss possible applications in multi-expert reasoning.

Keywords:

Bregman divergence; information geometry; pooling operator; discrete probability function; probabilistic merging; multi-expert reasoning

Graphical Abstract

1. Background

1.1. Introduction

Information geometry has been studied as a powerful tool for tackling various problems. It has been applied in neuroscience [1], expert systems [2], logistic regression [3], clustering [4] and probabilistic merging [5]. In this paper, we aim to present a comprehensive study of information geometry over a discrete probabilistic space in order to provide some specialized tools for researchers working in the area of multi-expert reasoning.

In the context of this paper, the domain of information geometry is the Euclidean space ℝ^J, for some fixed natural number J ≥ 2, where we measure a divergence from one point to another one. A divergence is, in general asymmetric, a notion of distance, and we will represent it here by an arrow. A divergence can represent a cost function having various constraints, so many engineering problems correspond to the minimization of a divergence.

For example, in the areas of neuroscience and expert systems, given evidence v and a training set of known instances W, we may search for an instance w ∈ W, which is “closest” to the evidence v, so as to represent it in the given training set W. An illustration is depicted in Figure 1.

A similar pattern of minimization appears also in the areas of clustering and regression. The aim of the former is to categorize several points into a given number of nodes in such a way that the sum of divergences from each point to its associated node is minimal. The aim of regression is to predict an unknown distribution of events based on the previously obtained statistical data by defining a function whose values minimize a sum of divergences to the data.

While several domains for divergences are considered in the literature, in the current presentation of information geometry, however, we will confine ourselves to the domain of positive discrete probability functions ⅅ^J, where ⅅ^J is the set of all w ∈ ℝ^J restricted by

\sum_{j = 1}^{J} w_{j} = 1

and w₁ > 0, …, w_J > 0. In our presentation, J ≥ 2 will be always fixed, but otherwise arbitrary.

Although in information geometry, it does not make sense to talk about beliefs, applications in multi-expert reasoning are often developed from that perspective. It is then argued that rational beliefs should obey the laws of probability, for example the Dutch book argument by Ramsey and de Finetti [6] is perhaps the most compelling argument. It is therefore of a particular interest to develop information geometry over a probabilistic space if we wish to eventually apply it to multi-expert reasoning.

In addition to our restriction to discrete probability functions, we will confine ourselves to a special type of divergence, called a Bregman divergence [7], which has recently attracted attention in machine learning and plays a major role in optimization; cf. [3]. A Bregman divergence over a discrete probabilistic space is defined by a given strictly convex function f: (0, 1)^J → ℝ, which is differentiable over ⅅ^J. For any v, w ∈ ⅅ^J, the Bregman divergence generated by the function f is given by:

D_{f} (w ‖ v) = f (w) - f (v) - (w - v) \cdot \nabla f (v),

where

\nabla f (v)

is the gradient of f and · denotes the inner (dot) product of two vectors, i.e.,

(w - v) \cdot \nabla f (v) = \sum_{j = 1}^{J} (w_{j} - v_{j}) \frac{\partial f (v)}{\partial v_{j}} .

We say that D_f(w‖v) is a Bregman divergence from v ∈ ⅅ^J to w ∈ ⅅ^J. Figure 2 depicts a geometrical interpretation of a Bregman divergence.

By the first convexity condition applied to the (convex and differentiable) function f (see, e.g., [8]), D_f(w‖v) ≥ 0 with equality holding only if w = v. This is the condition that makes D_f(·‖·) a divergence as defined in information geometry. Note that, since a differentiable convex function is necessarily continuously differentiable (see [9]), D_f(w‖v) is a continuous function. However, note that this is not sufficient to establish the differentiability of D_f.

It is worth mentioning that the restriction w₁ > 0,…, w_J > 0 for a probability function w that we have adopted here is important for the definition of a Bregman divergence. Some Bregman divergences do not have their generating function f differentiable over the whole space of probability functions. However, it is possible to define the notion of a Bregman divergence even if this condition is left out, but at the cost of some restrictions on f. We kindly refer the interested reader to [10] for further details. Nonetheless, the setting developed in [10] uses a rather complicated notation, which could prove to be impenetrable at first glance if it were adopted in the current paper.

In this paper, we study mainly Bregman divergences D_f(·‖·), which are convex, i.e., for all λ ∈ [0, 1] and all w⁽¹⁾, w⁽²⁾, v⁽¹⁾, v⁽²⁾ ∈ ⅅ^J:

λ D_{f} (w^{(1)} ‖ v^{(1)}) + (1 - λ) D_{f} (w^{(2)} ‖ v^{(2)}) \geq D_{f} (λ w^{(1)} + (1 - λ) w^{(2)} ‖ λ v^{(1)} + (1 - λ) v^{(2)}) .

Note that if D(·‖·) is a convex function, then D(·‖·) is a convex function also in each argument separately.

The following are examples of a convex Bregman divergence.

Example 1 (Squared Euclidean Distance). For any

J \geq 2 l e t f (x) = \sum_{j = 1}^{J} {(x_{j})}^{2}

. Then, the divergence:

D_{f} (w ‖ v) = \sum_{j = 1}^{J} {(w_{j} - v_{j})}^{2}

will be denoted by E2, and exceptionally, this divergence is symmetric.

Example 2 (Kullback–Leibler Divergence). For any

J \geq 2, l e t f (x) = \sum_{j = 1}^{J} x_{j} \log x_{j}

, where log denotes the natural logarithm. (Note that in the information theory literature, this logarithm is often taken with base two. However, this does not affect the results of this paper in any way.) The well-known divergence:

D_{f} (w ‖ v) = \sum_{j = 1}^{J} w_{j} \log \frac{w_{j}}{v_{j}}

will be denoted by KL.

The convexity of the KL-divergence is easy to observe and is well known; see, e.g., [10].

1.2. Projections

For given v ∈ ⅅ^J, a Bregman divergence D_f(w‖v) is a strictly convex function in the first argument. This can be easily seen by considering

D_{f} (w ‖ v) = f (w) - f (v) - \sum_{j = 1}^{J} (w_{j} - v_{j}) \frac{\partial f (v)}{\partial v_{j}}

where v is constant. f(v) is therefore constant, as well, and the claim follows, since strict convexity of f is not affected by adding the linear term

- \sum_{j = 1}^{J} (w_{j} - v_{j}) \frac{\partial f (v)}{\partial v_{j}}

.

Owing to the observation above, if v ∈ ⅅ^J is given and W ⊆ ⅅ^J is a closed convex nonempty set, we can define the D_f-projection of v into W. It is that unique point w ∈ W that minimizes D_f(w‖v) subject only to w ∈ W. This property is crucial for the applicability of Bregman divergences. Note, however, that D_f(·‖·) is not necessarily convex in its second argument; for a counterexample, consider the case

f (x) = \sum_{j = 1}^{4} {(x_{j})}^{3}

.

Perhaps the most useful property that a D_f-projection has is the extended Pythagorean property:

Theorem 1 (Extended Pythagorean Property). Let D_f be a Bregman divergence. Let w be the D_f-projection of v ∈ ⅅ^J into a closed convex nonempty set W ⊆ ⅅ^J. Let a ∈ W. Then:

D_{f} (a ‖ w) + D_{f} (w ‖ v) \leq D_{f} (a ‖ v) .

This property, in the case of the Kullback–Leibler divergence, was proven first by Csiszár in [11]. The proof of the generalized theorem above is given in [1,12], where the interested reader can find a comprehensive study of Bregman divergences within the context of differential geometry. We illustrate the theorem in Figure 3.

Notice that the squared Euclidean distance has a special role among all other Bregman divergences. It is symmetric, and it interprets the extended Pythagorean property “classically” as the relation of the sizes of the squares constructed on the sides of a triangle.

It is well-known that the Kullback–Leibler divergence is closely connected to the Shannon entropy defined for any w ∈ ⅅ^J by:

H (w) = - \sum_{j = 1}^{J} w_{j} \log w_{j},

where log denotes the natural logarithm. The importance of the Shannon entropy is that it could be described as a measure of the level of disorder, which in the context of information theory, can be interpreted as a measure of informational content. The higher the entropy of w is, the less information is carried by w. In some contexts, one can then argue that given several seemingly equally probable choices of a probability function, one should choose the one that carries the least additional information [13]. Given a closed convex nonempty set W, the most entropic point in W will be denoted by ME(W).

Now, trying to find the most entropic point in a closed convex nonempty set W ⊆ ⅅ^J is, in fact, equivalent to finding a special KL-projection (the KL-projection of the uniform probability function

\underset{J}{\underset{︸}{(\frac{1}{J}, \dots, \frac{1}{J})}}

) since:

\arg min_{w \in W} \sum_{j = 1}^{J} w_{j} \log \frac{w_{j}}{\frac{1}{J}} = \arg max_{w \in W} - \sum_{j = 1}^{J} w_{j} \log w_{j} = ME (W),

where arg min_x∈X f(x) denotes that unique argument x ∈ X, where f has its global minimum, whenever such a unique point exists. The expression arg max is defined accordingly.

Given the extensive justification of the Shannon entropy in various frameworks (see, e.g., [14,15]), it is perhaps not surprising that a common method of projecting in probabilistic expert systems is by means of the KL-projection; see [2,16]. In connection to the Shannon entropy, the KL-divergence is often referred to as the cross-entropy, and the projecting is called updating.

The above may perhaps be also an appealing reason to use projections in general to “represent” a given closed convex set of probability functions by a single point, in particular in expert reasoning. Moreover, recent use of projections by a Bregman divergence has become popular in other contexts; see, e.g., [4]. Remarkably, projections by a Bregman divergence also provide a unifying framework for a variety of techniques used in expert systems, such as logistic regression; see [3]. It is therefore of particular interest to investigate the geometry of Bregman divergences.

1.3. Pooling

In this subsection, we introduce probabilistic pooling, which is a method of aggregating several probability functions. Formally, a pooling operator Pool is defined for each n ≥ 1 as a mapping:

Pool : \underset{n}{\underset{︸}{D^{J} \times \dots \times D^{J}}} \to D^{J} .

Recall that J is a fixed natural number greater than or equal to two, which is otherwise arbitrary.

One possibility for choosing a pooling operator is to define one by means of a Bregman divergence. In particular, given a Bregman divergence

D_{f}, w^{(1)}, \dots, w^{(n)} \in D^{J}

and a ∈ ⅅⁿ, we can ask which point v ∈ ⅅ^J has the least sum of Bregman divergences D_f from w⁽¹⁾,…, w⁽ⁿ⁾ weighted by a₁,…, a_n, respectively. It turns out that the resulting probability function is unique, and in each coordinate, it is simply the weighted arithmetic mean of the corresponding coordinates of w⁽¹⁾,…, w⁽ⁿ⁾ ∈ ⅅ^J. In other words:

\arg min_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v) = (\sum_{i = 1}^{n} a_{i} w_{1}^{(i)}, \dots, \sum_{i = 1}^{n} a_{i} w_{1}^{(i)}) .

(1)

For a given family

A = {a_{n} : a_{n} \in D^{n}, n = 1, 2, \dots}

of weighting vectors, we define the pooling operator

{LinOp}_{A}

by Equation (1) for every a ∈

A

. Instead of the right-hand side of Equation (1), we will simply write LinOp_a(w⁽¹⁾,…, w⁽ⁿ⁾) if a ∈

A

. A special choice for

A

is the family

N = {a_{n} = (\frac{1}{n}, \dots, \frac{1}{n}) : n = 1, 2, \dots}

, and the pooling operator

{LinOp}_{N}

is well known in the literature as the LinOp-pooling operator.

The fact that Equation (1) actually holds can be observed by employing the following theorem, which is folklore in information theory.

Theorem 2 (Parallelogram Theorem). Let D_f be a Bregman divergence, w⁽¹⁾,…, w⁽ⁿ⁾, v ∈ ⅅ^J and a ∈ ⅅⁿ. Then:

\begin{matrix} \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v) = \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ {LinOp}_{a} (w^{(1)}, \dots, w^{(n)})) + \\ + D_{f} ({LinOp}_{a} (w^{(1)}, \dots, w^{(n)}) ‖ v) . \end{matrix}

Proof. Let w = LinOp_a(w⁽¹⁾,…, w⁽ⁿ⁾). The equality is easy to observe by:

\begin{matrix} \sum_{i = 1}^{n} a_{i} [f (w^{(i)}) - f (v) - \sum_{j = 1}^{J} (w_{j}^{(i)} - v_{j}) \frac{\partial f (v)}{\partial v_{j}}] = \\ = \sum_{i = 1}^{n} a_{i} [f (w^{(i)}) - f (w) - (w^{(i)} - w) \cdot \nabla f (w)] + \\ + [f (w) - f (v) - \sum_{j = 1}^{J} (w_{j} - v_{j}) \frac{\partial f (v)}{\partial v_{j}}] \end{matrix}

Since

\sum_{i = 1}^{n} a_{i} (w^{(i)} - w) \cdot \nabla f (w) = 0

. □

Since D_f(w‖v) = 0, only if w = v, and otherwise, it is positive, the unique minimum of the left-hand side of Equation (1) is at the point v = LinOp_a(w⁽¹⁾,…, w⁽ⁿ⁾).

The situation above can be naturally interpreted in terms of random variables. Assume that X is a random variable taking values in {w⁽¹⁾,…, w⁽ⁿ⁾} ⊆ ⅅ^J with the probability distribution a ∈ ⅅⁿ, and we are given the problem of finding a random variable Y, such that the expected value:

E (D_{f} (X ‖ Y))

is minimal. The unique answer to this question is then

Y = E (X) = {\sum_{i = 1}^{n} a_{i} w}^{(i)}

. This underlines the reason why the

{LinOp}_{A}

-pooling operator is so popular in the decision theory literature, where several experts, each with his own probability function w⁽ⁱ⁾ representing his beliefs, seek to find a single probability function to represent their joint beliefs. The

{LinOp}_{A}

-pooling operator simply yields the expected value as if expert’s beliefs were statistically obtained.

It is certainly interesting that the result above holds for any Bregman divergence, but as is shown in [17], Theorem 4, it is even more remarkable that Bregman divergences are the only divergences with such a property. However, we note that in order to establish this claim, a slightly more general setting was considered and that we have restricted the formulation of the original theorem to the only domain considered here (0, 1)^J:

Theorem 3 (Banerjee, Guo, Wang). Let F: (0, 1)^J × (0, 1)^J → ℝ be a divergence. Assume that

F (x ‖ y), \frac{\partial^{2} F (x ‖ y)}{\partial x_{i} \partial x_{j}}, 1 \leq i, j \leq J

are all continuous. Let (Ω, P,

ℱ

) be an arbitrary probability space, and let

G

be a sub-σ-algebra of

ℱ

. For all random variables X taking values in (0, 1)^J, if:

\arg min_{Y \in G} F (X ‖ Y) = E (X ‖ G)

then F (x‖y) = D_f(x‖y) for some strictly convex and differentiable function f : (0, 1)^J → ℝ.

While in the statistical sense, the

{LinOp}_{A}

-pooling operator, where

A

is a family of weighting vectors, seems to be well placed, in the fields of multi-expert reasoning and probabilistic merging, the so-called

{LinOp}_{A}

-pooling operator often appeals more. For every n ≥ 1 and every a ∈

A

, it is defined by:

{LogOp}_{a} (w^{(1)}, \dots, w^{(n)}) = (\frac{\prod_{i = 1}^{n} {(w_{1}^{(i)})}^{a_{i}}}{\sum_{j = 1}^{J} \prod_{i = 1}^{n} {(w_{j}^{(i)})}^{a_{i}}}, \dots, \frac{\prod_{i = 1}^{n} {(w_{J}^{(i)})}^{a_{i}}}{\sum_{j = 1}^{J} \prod_{i = 1}^{n} {(w_{j}^{(i)})}^{a_{i}}}) .

If w⁽¹⁾,…, w⁽ⁿ⁾ are considered to be beliefs of n-experts, respectively, then the

{LinOp}_{A}

-pooling operator appears to favor agreement over the expected value. For instance, consider the following example from utility theory. Say that Eleanor and George are looking for a film to watch and they have three options, A, B and C. Eleanor hates Movie A and under no circumstances would agree to watch it, while George absolutely loves it. Now, consider that the situation with respect to Film C is swapped: George hates it, while Eleanor would prefer to see it. They both consider Movie B uninteresting, but are willing to see it. The following probability functions could represent the preferences of Eleanor and George towards Movies A, B and C: (0, 0.1, 0.9) and (0.9, 0.1, 0), respectively. Moreover, we value the opinions of both of them equally, i.e.,

A = N

. Now, while the

{LinOp}_{N}

-pooling operator gives inconclusive (0.45, 0.1, 0.45) by the

{LinOp}_{N}

-pooling operator (in the literature, this operator is simply known as the LogOp-pooling operator), we obtain (0, 1, 0). If we take the advice, then Eleanor and George should see the only film that is acceptable for both of them.

The example above illustrates why taking products rather than the arithmetic mean is popular when considering utilities. However, recently, the

{LinOp}_{N}

-pooling operator attracted attention also in multi-expert probabilistic reasoning; a prominent example here is the social entropy process by Wilmers [18]. An intriguing idea that originates in the social entropy process is to swap the direction of the Kullback–Leibler projections and establish the corresponding conjugated KL-projection of w ∈ ⅅ^J into V ⊆ ⅅ^J as arg min_v_∈V KL(w‖v) (it is easy to check that KL(·‖·) is strictly convex in its second argument) and the conjugated parallelogram theorem [10]:

Theorem 4. Let w⁽¹⁾,…, w⁽ⁿ⁾, v ∈ ⅅ^J and a ∈ ⅅⁿ. Then:

\begin{matrix} \sum_{i = 1}^{n} a_{i} KL (v ‖ w^{(i)}) = \sum_{i = 1}^{n} a_{i} KL ({LogOp}_{a} (w^{(1)}, \dots, w^{(n)}) ‖ w^{(i)}) \\ + KL (v ‖ {LogOp}_{a} (w^{(1)}, \dots, w^{(n)})) . \end{matrix}

Proof. Let w = LogOp_a(w⁽¹⁾,…, w⁽ⁿ⁾). First note that:

\sum_{i = 1}^{n} a_{i} \sum_{j = 1}^{J} v_{j} \log \frac{v_{j}}{w_{j}^{(i)}} = \sum_{j = 1}^{J} v_{j} \log \frac{v_{j}}{\prod_{i = 1}^{n} {(w_{j}^{(i)})}^{a_{i}}} .

Now:

\begin{matrix} \sum_{j = 1}^{J} v_{j} \log \frac{v_{j}}{\prod_{i = 1}^{n} {(w_{j}^{(i)})}^{a_{i}}} = \sum_{j = 1}^{J} v_{j} \log \frac{v_{j}}{w_{j}} - \\ - (\sum_{j = 1}^{J} v_{j}) \log (\sum_{j = 1}^{J} \prod_{i = 1}^{n} (w_{j}^{(i)}) a_{i}) = \\ = (\sum_{j = 1}^{J} v_{j}) \log \frac{v_{j}}{w_{j}} + \sum_{i = 1}^{n} a_{i} \sum_{j = 1}^{J} w_{j} \log \frac{w_{j}}{w_{j}^{(i)}}, \end{matrix}

where we have used the fact that

\sum_{j = 1}^{J} v_{j} = 1

. □

As a consequence, for given w⁽¹⁾,…, w⁽ⁿ⁾ ∈ ⅅ^J, we get:

\arg min_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} KL (v ‖ w^{(i)}) = {LogOp}_{a} (w^{(1)}, \dots w^{(n)}) .

Therefore, the

{LinOp}_{A}

-pooling operator can be naturally interpreted in information geometry. The question now arises as to whether this can be done with some other Bregman divergences. We will investigate this later.

The reader perhaps wonders which are the main practical differences in using different pooling operators. The

{LinOp}_{A}

-pooling operator, for example, satisfies the marginalization property, that is the values on the coordinates of the resulting probability function depend only on the corresponding coordinates of the probability functions that are pooled. The

{LinOp}_{A}

-pooling operator does not have this property. On the other hand, the

{LinOp}_{A}

-pooling operator, unlike the

{LinOp}_{A}

-pooling operator, is externally Bayesian. That is the order in which we combine pooling and Bayesian updating is irrelevant. See [19] for more details.

We, however, do not seek any conclusive answer as to which pooling operator to use in any particular context. In this paper, we only aim to provide geometric tools that can be used in multi-expert reasoning. For elaborate work on pooling operators, we refer to the literature, e.g., [19] for a survey, [20] for a classical problem of the relationship between pooling and probabilistic independence or [18] for a modern account on

{LinOp}_{N}

and

{LinOp}_{N}

-pooling operators in probabilistic knowledge merging.

2. Projections and Pooling Combined

2.1. Averaging Projective Procedures

While the geometry of projections and the theory of pooling operators have been extensively studied in the literature (see the previous section), much less attention, however, was been devoted to the combination of them. A detailed study of this problem and a comprehensive analysis of the geometry involved is the main aim of this paper.

The central geometrical notion connecting projections and pooling in this paper is an averaging projective procedure F, which consists of a family of mappings

F_{[W_{1}, \dots, W_{n}]} : D^{J} \to D^{J}

, where sets W₁,…, W_n ⊆ ⅅ^J are closed convex and nonempty. A particular F is given by a family of strictly convex functions d_v, v ∈ ⅅ^J and a pooling operator Pool and is defined by the following two-stage process.

For an argument v ∈ ⅅ^J, ${put w}^{(i)} = \arg {min}_{w \in W_{i}} d_{v} (w), 1 \leq i \leq n$ .
Set $F_{[W_{1}, \dots W_{n}]} (v) = Pool (w^{(1)}, \dots, w^{(n)})$ .

For instance, the function d_v(·) can be D_f(·‖v) for some Bregman divergence D_f and in such a particular case

F_{[W_{1}, \dots, W_{n}]} : (v)

first D_f-projects the argument v into each of W₁,…, W_n, and then, it “averages” the resulting probability functions by a pooling operator Pool. Hence, the name: an averaging projective procedure. An illustration of F is depicted in Figure 4.

Note that W₁,…, W_n play dual roles in the definition above, which may perhaps appear clumsy. When they are fixed,

F_{[W_{1}, \dots, W_{n}]}

is a mapping ⅅ^J → ⅅ^J. However, the option to consider them also as variables will be the key to our following investigation and to the applicability of an averaging projective procedure in multi-expert reasoning, where W₁,…, W_n will represent the respective knowledge of n experts. A straightforward interpretation is that the first stage simplifies sets to single probability functions, which then are being merged to a final social belief function of the college of experts.

With regard to previous research, the cases of d_v(·) being KL(·‖v) and KL(v‖·) with Pool be taken to the

{LinOp}_{A}

-pooling operator and the

{LogOp}_{A}

-pooling operator, respectively, were introduced and investigated by Matúš in [21]. The idea of combining the projections by means of the squared Euclidean distance E2 with the

{LinOp}_{A}

-pooling operator was first introduced by Predd et al. in [22].

Example 3. In the definition of an averaging projective procedure, take d_v to be KL(·‖v) and Pool to be the

{LinOp}_{N}

-pooling operator. Now, F is the mapping ⅅ^J → ⅅ^J for every n ≥ 1 and all closed convex nonempty sets W₁,…, W_n ⊆ ⅅ^J given by

F_{[W_{1}, \dots, W_{n}]} : (v)

above.

In particular, take

J = 3, n = 2, W_{1} = {(x, \frac{1}{2} - x, \frac{1}{2}), \frac{1}{10} \leq x \leq \frac{2}{5}}, W_{2} = {(x, \frac{1}{4}, \frac{3}{4} - x), \frac{1}{10} \leq x \leq \frac{13}{20}}

and

v = (\frac{1}{3}, \frac{1}{6}, \frac{1}{2})

. Then, the KL-projecion of v into W₁ is actually v itself, since v ∈ W₁ and the KL-projection of v intoW₂ is

(\frac{3}{10}, \frac{1}{4}, \frac{9}{20})

. Therefore:

F_{[W_{1}, W_{2}]} (v) = {LinOp}_{(\frac{1}{2}, \frac{1}{2})} ((\frac{1}{3}, \frac{1}{6}, \frac{1}{2}), (\frac{3}{10}, \frac{1}{4}, \frac{9}{20})) = (\frac{\frac{1}{3} + \frac{3}{10}}{2}, \frac{\frac{1}{6} + \frac{1}{4}}{2}, \frac{\frac{1}{2} + \frac{9}{20}}{2}) .

2.2. Obdurate Operators

In this section, we approach averaging projective procedures using the framework of probabilistic knowledge merging as defined in [5]. A probabilistic merging operator:

Δ : \underset{n}{\underset{︸}{P (D^{J}) \times \dots \times P (D^{J})}} \to P (D^{J}),

is a mapping that maps a finite collection of closed convex nonempty subsets of ⅅ^J, say W₁,…, W_n, to a single closed convex nonempty subset of ⅅ^J. In the area of multi-expert reasoning, we can perhaps interpret ∆(W₁,…, W_n) as a representation of W₁,…, W_n, which themselves individually represent knowledge bases of n experts.

A merging operator O is obdurate if, for every n ≥ 1 and any W₁,…, W_n ⊆ ⅅ^J, we have that

O (W_{1}, \dots, W_{n}) = {F_{[W_{1}, \dots, W_{n}]} (v)}

, where v is some fixed argument and F is an averaging projective procedure. Note that this operator always produces a singleton. Obdurate processes thus first represent sets as single probability functions, and then, they pool them by a pooling operator.

Although this may sound like a fairly restrictive setting, many existing natural probabilistic merging operators are of this form. The prominent example is the merging operator of Kern-Isberner and Rödder (KIRP) [23]. In this particular case, v is the uniform probability function, d_v(·) is KL(·‖v) and Pool is given by:

Pool (w^{(1)}, \dots, w^{(n)}) = (\sum_{k = 1}^{n} \frac{H (w^{(k)})}{\sum_{i = 1}^{n} H (w^{(n)})} w_{1}^{(k)}, \dots, \sum_{k = 1}^{n} \frac{H (w^{(k)})}{\sum_{i = 1}^{n} H (w^{(n)})} w_{J}^{(k)}) .

Recall that H(w⁽ⁱ⁾) is the Shannon entropy of w⁽ⁱ⁾, which is, in fact, the most entropic point in W_i.

In [23], Kern-Isberner and Rödder argue that W₁,…, W_n ⊆ ⅅ^J can by considered as marginal probabilities in a subset U ⊆ ⅅ^J⁺ⁿ, such that every probability function v ∈ U marginalizes to a ⅅ^J -probability function belonging to one and only one set W_i. Since then, the point which KIRP produces is, in fact, the ⅅ^J-marginal of the most entropic point in U, following the justification of the Shannon entropy, they conclude that such a point is a natural interpretation of W₁,…, W_n by a single probability function. KIRP thus maps the uniform probability function to the ⅅ^J-marginal of the most entropic point in U. To date, KIRP has received much attention in the area of probabilistic knowledge merging.

However, any obdurate merging operator seems to be challenged by its violation of the following principle.

(CP) Consistency Principle. Let ∆ be a probabilistic merging operator. Then, we say that ∆ satisfies the consistency principle if, for every n ≥ 1 and all W₁,…, W_n ⊆ ⅅ^J:

\cap_{i = 1}^{n} W_{i} \neq \emptyset implies Δ (W_{1}, \dots, W_{n}) \subseteq \cap_{i = 1}^{n} W_{i} .

(CP) can be interpreted as saying that if the knowledge bases of a set of experts are collectively consistent, then the merged knowledge base should not consist of anything else than what the experts agree on.

This principle often falls under the following philosophical criticism. One might imagine a situation where several experts consider a large set of probability functions as admissible, while one believes in a single probability function. Although this one is consistent with the beliefs of the rest of the group, one might argue that it is not justified to merge the knowledge of the whole group into that single probability function.

More rigorously, Williamson [24] introduces a particular interpretation of the epistemological status of an expert’s knowledge base, which he calls “granting”. He rejects (CP), as several experts may grant the same piece of knowledge for inconsistent reasons.

On the other hand, Adamčík and Wilmers in [5] assume that the way in which the knowledge was obtained is considered irrelevant, and each expert has incorporated all of his relevant knowledge into what he is declaring, contrary to Williamson’s granting. This is sometimes referred to as the principle of total evidence [25] or the Watts assumption [26]. They argue that, although overall knowledge of any human expert can never be fully formalized, as a formalization is always an abstraction from reality, the principle of total evidence needs to be imposed in order to avoid confusion in any discussion related to methods of representing the collective knowledge of experts. Otherwise, there would be an inexhaustible supply of invalid arguments produced by a philosophical opponent challenging one’s reasoning using implicit background information, which is not included in the formal representation of a knowledge base.

However, in this paper, we do not wish to probe further into this philosophical argument, and instead, we present the following rather surprising theorem, which appeared for the first time in [10].

Theorem 5. There is no obdurate merging operator O that satisfies the consistency principle (CP).

Proof. Suppose that J ≥ 3. Let d be the function to minimize from the definition of O, where, for simplicity, we suppress the constant superscript. Let v ∈ ⅅ^J be the unique minimizer of d over some sufficiently large closed convex subset W of ⅅ^J. Let w, u ∈ W be such that d(v) < d(w) < d(u) and w = λv + (1 − λ)u for some 0 < λ < 1 (in particular, w is a linear combination of v and u).

Let s ∈ W be such that d(v) < d(s) < d(w) and s is not a linear combination of v and u. Then, there is s′, such that s′ = λs + (1 − λ)w for some 0 < λ ≤ 1, and d is strictly increasing along the line from s′ to w. This is because d is strictly convex and d(s) < d(w). Note that if J = 2, then s would be always a linear combination of v and u. Moreover, for sufficiently large W ⊆ ⅅ³, we can always choose w, u, s and s′ in W as above.

Now, we show that d is also strictly increasing along the line from s′ to u. Assume this is not the case. Then, by the same argument as before, there is s″, such that d(s″) < d(s′). Due to the construction, the line from v to s″ intersects the line from s′ to w; let us denote the point of intersection as r. Since d is strictly increasing along the line from s′ to w, we have that d(r) > d(s′) > d(s″) > d(v). This, however, contradicts the convexity of d. The situation is depicted in Figure 5.

Now, assume that W₁ = {λv + (1 − λ)w : λ ∈ [0, 1]}, W₂ = {λs′ + (1 − λ)w : λ ∈ [0, 1]}, V₁ = {λv + (1 − λ)u : λ ∈ [0, 1]} and V₂ = {λs′ + (1 − λ)u : λ ∈ [0, 1]}. Since v minimizes d and along the lines from s′ to w and from s′ to u, the function d is strictly increasing, we have that:

O (W_{1}, W_{2}) = {Pool (v, s^{'})} = O (V_{1}, V_{2}),

(2)

where Pool is a pooling operator used in the second stage of O. Suppose that O satisfies (CP). Then, O(W₁, W₂) = {w} and O(V₁, V₂) = {u}, which contradicts Equation (2).

The theorem above in some philosophical contexts can be used as an argument against the consistency principle, while from another perspective, it casts a shadow on the notion of an obdurate merging operator. This unfortunately includes the natural merging operator OSEP, or obdurate social entropy process, defined as follows. For every n ≥ 1 and all W₁,…, W_n ⊆ ⅅ^J:

OSEP (W_{1}, \dots, W_{n}) = {{LogOp}_{N} (ME (W_{1}), \dots, ME (W_{n}))} .

Recall that ME(W_i) denotes the most entropic point in W_i or equivalently the KL-projection of the uniform probability function into W_i, and

N

is the family of weighting vectors

(\frac{1}{n}, \dots, \frac{1}{n})

, one for every n ≥ 1. It is easy to observe that OSEP is really an obdurate merging operator.

In [10], it is proven that OSEP is (thus far, the only known) probabilistic merging operator satisfying a particular version of the independence principle, a principle that is an attempt to resurrect the notion of the independence preservation of pooling operators [20] in the context of probabilistic merging operators.

One may say that the reason behind an obdurate merging operator not satisfying (CP) is its “forgetting” nature. In the first stage, it transforms sets W₁,…, W_n into w⁽¹⁾,…, w⁽ⁿ⁾ individually without taking into account other sets, thus “forgetting” any existing connections, such as the consistency. However, instead of changing the definition of an averaging projective procedure so as to make it not “forgetting”, we will take a different viewpoint on the procedure itself in the following subsection.

2.3. Fixed Points

Our second approach to an averaging projective procedure F consists of considering the set of the fixed points of F. That is, for given n ≥ 1 and given closed convex nonempty sets W₁,…, W_n ⊆ ⅅ^J, we are interested in whether there are any points v ∈ ⅅ^J, such that:

F_{[W_{1}, \dots, W n]} (v) = v .

Following the convincing justification for combining Bregman projections with the

{LinOp}_{A}

-pooling operator (see Section 1.3), for every convex Bregman divergence D_f and a family of weighting vectors

A

, we consider here the averaging projective procedure

F^{D_{f}, A}

defined for every n ≥ 1 and all closed convex nonempty sets W₁,…, W_n ⊆ ⅅ^J by the following.

For an argument v ∈ ⅅ^J, take w⁽ⁱ⁾ the D_f-projection of v into W_i for all 1 ≤ i ≤ n.
Set $F_{[W_{1}, \dots, W_{n}]}^{D_{f}, A} (v) = {LinOp}_{a} (w^{(1)}, \dots, w^{(n)})$ , where a ∈ $A$ .

The restriction to convex Bregman divergences is needed for some later theorems and is adopted ad hoc. Therefore, unfortunately, we cannot provide any elaborate justification for it.

Given closed convex nonempty sets W₁,…, W_n ⊆ ⅅ^J, we will denote the set of all fixed points of

F^{D_{f}, A}

defined above by

Θ_{a}^{D_{f}} (W_{1}, \dots W_{n})

, where a ∈

A

.

On the other hand, the conjugated parallelogram theorem (Theorem 4), suggesting the combination of the conjugated KL-projection with the LogOp-pooling operator, leads us to the consideration of those convex Bregman divergences, which are strictly convex also in the second argument. The squared Euclidean distance and the Kullback–Leibler divergence are instances of such divergences. A fairly general example is a Bregman divergence D_f, such that

f (v) = \sum_{j = 1}^{J} g (v_{j})

, where g is a strictly convex function (0, 1) → ℝ, which is three times differentiable, and g″(v_j) − (w_j − v_j)g‴(v_j) > 0 for all 1 ≤ j ≤ J and all w, v ∈ ⅅ^J (this is easy to check by the Hessian matrix). Apart from the two divergences mentioned above, this condition is satisfied in particular if g(v) = v^r, 2 ≥ r > 1. Note that the Bregman divergence generated by such a function g is also convex in both arguments.

Assuming strict convexity in the second argument of D_f, we can define the conjugated D_f-projection of v ∈ ⅅ^J into a closed convex nonempty set W ⊆ ⅅ^J as that unique w ∈ W that minimizes ⅅ_f(v‖w) subject only to w ∈ W. Moreover, since a sum of strictly convex functions is a strictly convex function, for any w⁽¹⁾,…, w⁽ⁿ⁾ ∈ ⅅ^J, there exists a unique minimizer of:

\sum_{i = 1}^{n} a_{i} D_{f} (v ‖ w^{(i)})

which we denote

{Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)})

. Thus, for a family of weighting vectors

A

, we can define the

{Pool}_{A}^{D_{f}}

-pooling operator. Note that

{Pool}_{A}^{KL} = {LogOp}_{A}, {Pool}_{A}^{E 2} = {LinOp}_{A}

and that we do not need strict convexity in the second argument in these cases.

Theorem 6 (Conjugated Parallelogram Theorem). Let D_f be Bregman divergence,

w^{(1)}, \dots, w^{(n)}, v \in

ⅅ^J and a ∈ ⅅⁿ. Then:

\begin{matrix} \sum_{i = 1}^{n} a_{i} D_{f} (v ‖ w^{(i)}) = \sum_{i = 1}^{n} a_{i} D_{f} ({Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)}) ‖ w^{(i)}) + \\ + D_{f} (v ‖ {Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)})) . \end{matrix}

Proof. Let

w = {Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)})

. We need to prove that:

\begin{matrix} \sum_{i = 1}^{n} a_{i} [f (v) - f (w^{(i)}) - \sum_{j = 1}^{J} (v_{j} - w_{j}^{(i)}) \frac{\partial f (w^{(i)})}{\partial w_{j}^{(i)}}] = \\ \sum_{i = 1}^{n} a_{i} [f (w) - f (w^{(i)}) - \sum_{j = 1}^{J} (w_{j} - w_{j}^{(i)}) \frac{\partial f (w^{(i)})}{\partial w_{j}^{(i)}}] + \\ + [f (v) - f (w) - \sum_{j = 1}^{J} (v_{j} - w_{j}) \frac{\partial f (w)}{\partial w_{j}}], \end{matrix}

or equivalently:

\sum_{j = 1}^{J} (v_{j} - w_{j}) (\sum_{i = 1}^{n} a_{i} \frac{\partial f (w^{(i)})}{\partial w_{j}^{(i)}} - \frac{\partial f (w)}{\partial w_{j}}) = 0.

(3)

Since

w = \arg {min}_{w \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (w ‖ w^{(i)})

, differentiation using the Lagrange multiplier method (since a differentiable convex function f is necessarily continuously differentiable (see [9]), the partial derivatives used above are all continuous and the Lagrange multiplier method is permissible) applied to the condition

\sum_{j = 1}^{J} w_{j} = 1

produces

\sum_{i = 1}^{n} a_{i} \frac{\partial f (w^{(i)})}{\partial w_{j}^{(i)}} - \frac{\partial f (w)}{\partial w_{j}^{(i)}} = λ, 1 \leq j \leq J

, where λ is a constant independent of j. Therefore, Equation (3) is equal to

\sum_{j = 1}^{J} (v_{j} - w_{j}) λ = 0

, and the theorem follows.

The idea of defining a spectrum of pooling operators where the pooling operators LinOp and LogOp are special cases was developed previously in a similar manner, but in a slightly different framework of alpha-divergences; cf. [27].

Here, following [1,12], we will point out a geometrical relationship between pooling operators LinOp and Pool^Df, which will be helpful in illustrating some results of this paper.

Recall that the generator of a Bregman divergence D_f is a strictly convex function f : (0, 1)^J → ℝ, which is differentiable over ⅅ^J. Let w ∈ ⅅ^J. We define w^∗ = ∇f(w). Since f is a strictly convex function, the mapping w → ∇ f (w) is injective; thus, the coordinates of w^∗ form a coordinate system. There are two kinds of affine structures in ⅅ^J. D_f(w‖v) is convex in w with respect to the first structure and is convex in v^∗ with respect to the second structure.

Therefore, the proof above, in fact, gives

[v] * = [{Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)})] * = {LinOp}_{a} ([w^{(1)}] *, \dots, [w^{(n)}] *) + c

, where

c = (\underset{J -times}{\underset{︸}{λ, \dots λ}})

is a normalizing vector induced by

\sum_{j = 1}^{J} v_{j} = 1

.

The only other type of averaging projective procedure

{\hat{F}}^{D_{f}, A}

that we consider here will be generated by a convex differentiable Bregman divergence D_f, which is strictly convex in its second argument, and a family of weight A and is defined for every n ≥ 1 and all closed convex nonempty sets W₁,…, W_n ⊆ ⅅ^J by the following.

For an argument v ∈ ⅅ^J, take w⁽ⁱ⁾ the conjugated D_f-projection of v into W_i for all 1 ≤ i ≤ n.
Set ${\hat{F}}_{[W_{1}, \dots, W_{n}]}^{D_{f}, A} (v) = {Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)})$ , where a ∈ $A$ .

Given closed convex nonempty sets W₁,…, W_n ⊆ ⅅ^J, we will denote the set of all fixed points of F^Df, A defined above by

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots W_{n})

, where a ∈

A

.

Note that we always require an additional assumption of D_f being differentiable for this type of averaging projective procedure. This assumption is essential to the proofs of some results concerning this procedure. We note that both divergences KL and E2 are differentiable.

Given a family of weighting vectors

A

, our aim is to investigate

Θ_{A}^{D_{f}} = {Θ_{a}^{D_{f}} : a \in A}

and

{\hat{Θ}}_{A}^{D_{f}} = {{\hat{Θ}}_{a}^{D_{f}} : a \in A}

as operators acting on

P (D^{J}) \times \dots \times P (D^{J})

. In particular, we ask the following questions. Given any closed convex nonempty sets W₁,…, W_n ⊆ ⅅ^J and a ∈

A

:

Are $Θ_{a}^{D_{f}} = (W_{1}, \dots, W_{n})$ and ${\hat{Θ}}_{a}^{D_{f}} = (W_{1}, \dots, W_{n})$ always nonempty?
Are these sets always closed and convex?

If both answers are positive, then we can consider

Θ_{A}^{D_{f}}

and

{\hat{Θ}}_{A}^{D_{f}}

as probabilistic merging operators. In such a case, the following question makes sense.

As probabilistic merging operators, do they satisfy the consistency principle (CP)?

The fact that the answer to all three questions is “yes” is perhaps surprising, given that the much simpler obdurate merging operators do not satisfy (CP). We prove the above results in the following sequence of theorems, which conclude Section 2.

The following well-known lemma is a simple, but useful observation.

Lemma 1. Let D_f be a Bregman divergence and a, v, w ∈ ⅅ^J. Then:

D_{f} (a ‖ v) - D_{f} (a ‖ w) - D_{f} (w ‖ v) = (a - w) \cdot (\nabla f (w) - \nabla f (v)) .

Theorem 7. Let D_f be a convex Bregman divergence, W₁,…, W_n ⊆ ⅅ^J be closed convex nonempty sets and a ∈ ⅅⁿ. Let v, w ∈ ⅅ^J, u⁽¹⁾ ∈ W₁,…, u⁽ⁿ⁾ ∈ W_n and w⁽¹⁾ ∈ W₁,…, w⁽ⁿ⁾ ∈ W_n be such that v = LinOp_a (u⁽¹⁾,…, u⁽ⁿ⁾), w = LinOp_a(w⁽¹⁾,…, w⁽ⁿ⁾) and u⁽ⁱ⁾ are the D_f-projection of v into W_i, 1 ≤ i ≤ n. Then:

\sum_{i = 1}^{n} a_{i} D_{f} (u^{(i)} ‖ v) \leq \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ w) .

Proof. First of all, by the extended Pythagorean property, we have that:

D_{f} (w^{(i)} ‖ v) - D_{f} (u^{(i)} ‖ v) - D_{f} (w^{(i)} ‖ u^{(i)}) \geq 0.

By the parallelogram theorem:

\sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v) = \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ w) + D_{f} (w ‖ v) .

Hence:

\begin{matrix} \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ w) - \sum_{i = 1}^{n} a_{i} D_{f} (u^{(i)} ‖ v) + D_{f} (w ‖ v) - \\ - \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ u^{(i)}) \geq 0. \end{matrix}

(4)

Since we assume that D_f(·‖·) is a convex function in both arguments by the Jensen inequality:

D_{f} (w ‖ v) - \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ u^{(i)}) \leq 0.

(5)

The Inequalities (4) and (5) give:

\sum_{i = 1}^{n} a_{i} D_{f} (u^{(i)} ‖ v) \leq \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ w)

as required. □

Figure 6 depicts the situation in the proof above for n = 2. Arrows indicate corresponding divergences.

An interesting question related to conjugated Bregman projections arises as to whether a similar property to the Pythagorean property holds. It turns out that the corresponding property is the so-called four-point property, from to Csiszár and Tusnády. The following theorem in the case of the KL-divergence is a specific instance of a result in [28], Lemma 3, but the formulation using the term “conjugated KL-projection” first appeared in [21]. An illustration is depicted in Figure 7.

Theorem 8 (Four-Point Property). Let D_f be a convex differentiable Bregman divergence, which is strictly convex in its second argument. Let V be a convex closed nonempty subset of ⅅ^J, and let v, u, w, s ∈ ⅅ^J be such that v is the conjugated D_f-projection of w into V and u ∈ V is arbitrary. Then:

D_{f} (s ‖ v) \leq D_{f} (s ‖ u) + D_{f} (s ‖ w) .

Proof. By Lemma 1, we have that:

D_{f} (s ‖ w) = D_{f} (s ‖ v) - D_{f} (w ‖ v) - (s - w) \cdot (\nabla f (w) - \nabla f (v)) .

We can rewrite the above as:

\begin{matrix} D_{f} (s ‖ w) - D_{f} (s ‖ v) + D_{f} (s ‖ u) = \\ = D_{f} (s ‖ u) - D_{f} (w ‖ v) - (s - w) \cdot (\nabla f (w) - \nabla f (v)) \end{matrix}

(6)

Since D_f(·‖·) is a convex differentiable function, by applying the first convexity condition twice, we have that:

\begin{matrix} D_{f} (s ‖ u) \geq D_{f} (w ‖ v) + \\ + \sum_{j = 1}^{J} (a_{j} - w_{j}) \frac{\partial}{\partial x_{j}} [D_{f} (x ‖ v)] |_{_{x = w}} \\ + \sum_{j = 1}^{J} (u_{j} - v_{j}) \frac{\partial}{\partial x_{j}} [D_{f} (w ‖ x)] |_{_{x = v}} \end{matrix}

(7)

Expressions (6) and (7) give that:

\begin{matrix} D_{f} (s ‖ v) \leq D_{f} (s ‖ u) + D_{f} (s ‖ w) - \\ - \sum_{j = 1}^{J} (u_{j} - v_{j}) \frac{\partial}{\partial x_{j}} [D_{f} (w ‖ x)] |_{_{x = w}} . \end{matrix}

However, since v is the conjugated D_f-projection of w into V, the gradient of D_f(w‖·) at (w, v) in the direction to (w, u) must be greater than or equal to zero:

\sum_{j = 1}^{J} (u_{j} - v_{j}) \frac{\partial}{\partial x_{j}} [D_{f} (w ‖ x)] | x = v \geq 0

and the theorem follows. □

The following result appeared for the first time in [10], but without considering the weighting.

Theorem 9 (Characterization Theorem for

Θ_{a}^{D_{f}}

). Let D_f be a convex Bregman divergence, a ∈ ⅅⁿ and W₁,…, W_n ⊆ ⅅ^J be closed convex nonempty sets. Then:

Θ_{a}^{D_{f}} (W_{1}, \dots W_{n}) = {\arg min_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v) : w^{(i)} \in W_{i}, 1 \leq i \leq n},

where the right hand-side denotes the set of all possible minimizers. That is the set of all probability functions v ∈ ⅅ^J, which globally minimize

\sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v)

, subject only to w⁽¹⁾ ∈ W₁,…,w⁽ⁿ⁾ ∈ W_n.

Proof. It is easy to see that, given closed convex nonempty sets W₁,…, W_n ⊆ ⅅ^J, we have that those w⁽¹⁾ ∈ W₁,…, w⁽ⁿ⁾ ∈ W_n, which together with v ∈ ⅅ^J, globally minimize:

\sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v),

are also the D_f-projections of v into W₁,…, W_n respectively. This, together with Equation (1) (the equation preceding Theorem 2), gives:

Θ_{a}^{D_{f}} (W_{1}, \dots W_{n}) \supseteq {\arg min_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v) : w^{(i)} \in W_{i}, 1 \leq i \leq n} .

Now, assume that

v \in Θ_{a}^{D_{f}} (W_{1}, \dots W_{n})

and

u \in {\arg min_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v) : w^{(i)} \in W_{i}, 1 \leq i \leq n} .

Let us denote the D_f-projections of v into W₁,…, W_n by w⁽¹⁾ …, w⁽ⁿ⁾, respectively. Accordingly, let us denote the D_f-projections of u into W₁,…, W_n by r⁽¹⁾ …, r⁽ⁿ⁾, respectively. Suppose that

\sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v) > \sum_{i = 1}^{n} a_{i} D_{f} (r^{(i)} ‖ u)

, i.e.,

v \notin {\arg min_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v) : w^{(i)} \in W_{i}, 1 \leq i \leq n} .

This contradicts Theorem 7, and therefore:

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n}) \subseteq {\arg min_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v) : w^{(i)} \in W_{i}, 1 \leq i \leq n} .

□

Let us now deviate for a while from the goals of this subsection and stress the importance of the restriction to the positive discrete probability functions, which was detailed in Section 1.1. The problem with the KL-divergence is that the function

f (x) = \sum_{j = 1}^{J} x_{j} \log x_{j}

is not differentiable if some x_j = 0. Without the adopted restriction, the KL-divergence is therefore usually defined by:

KL (w ‖ v) = {\begin{matrix} \sum_{j : v_{j} \neq 0} w_{j} \log \frac{w_{j}}{v_{j}}, & if v_{j} = 0 implies w_{j} = 0 for all 1 \leq j \leq J, \\ + \infty, & otherwise . \end{matrix}

If v_j = 0 implies w_j = 0 for all 1 ≤ j ≤ J, we say that v dominates w and write

v ≫ w

.

The first problem we would face with this definition is whether the notion of the KL-projection makes sense. For given v ∈ ⅅ^J and closed convex nonempty set W ⊆ ⅅ^J, the KL-projection of v into W makes sense only if there is at least one w ∈ W, such that

v ≫ w

.

However, even if adding this condition to all of the discussion concerning the KL-projection above (this is perfectly possible, as seen in [10]), Theorem 9 still could not hold, as the following example demonstrates.

Example 4. Let

W_{1} = {λ (0, 0, \frac{1}{6}, \frac{5}{6}) + (1 - λ) (0, \frac{1}{3}, \frac{1}{3}, \frac{1}{3}) : λ \in [0, 1]}

and

W_{2} = {λ (0, 0, \frac{1}{3}, \frac{2}{3}) + (1 - λ) (0, \frac{1}{3}, \frac{1}{3}, \frac{1}{3}) : λ \in [0, 1]}

. Assume that

a = (\frac{1}{2}, \frac{1}{2})

It is easy to check that

(0, 0, \frac{1}{4}, \frac{3}{4})

and

(0, \frac{1}{3}, \frac{1}{3}, \frac{1}{3})

are both fixed points, but the former does not belong to the set of global minimizers v of

KL (w ‖ v) + KL (w^{(2)} ‖ v)

subject to w⁽¹⁾ ∈ W₁ and w⁽²⁾ ∈ W₂. An illustration is depicted in Figure 8.

Moreover, some variant of the above example would show that the set

Θ_{a}^{KL} (W_{1}, W_{2})

is not convex, which would wreck our aims; more details are given in [10].

On the other hand, neither of those Bregman divergences, which generate functions, are differentiable over the whole space of discrete probability functions (e.g., the squared Euclidean distance) and would encounter the difficulties of the KL-divergence. In particular, Theorem 9 formulated over the whole space of discrete probability functions (as opposed to only the positive ones) would still hold for such Bregman divergences.

Now, we shall go back and prove a theorem similar to Theorem 9 for the

{\hat{Θ}}_{A}^{D_{f}}

-operator. In order to do that, we will need the following analogue of Theorem 7.

Theorem 10. Let D_f be a convex differentiable Bregman divergence, which is strictly convex in its second argument, and let W₁,…, W_n ⊆ ⅅ^J be closed convex nonempty sets and a ∈ ⅅⁿ. Let v, w ∈ ⅅ^J and u⁽¹⁾ ∈ W₁,…, u⁽ⁿ⁾ ∈ W_n and w⁽¹⁾ ∈ W₁,…, w ⁿ⁾ ∈ Wn be such that

v = {Pool}_{a}^{D_{f}} (u^{(1)}, \dots, u^{(n)})

,

w = {Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)})

and u⁽ⁱ⁾ are the conjugated D_f-projection of v into W_i, 1 ≤ i ≤ n. Then:

\sum_{i = 1}^{n} a_{i} D_{f} (v ‖ u^{(i)}) \leq \sum_{i = 1}^{n} a_{i} D_{f} (w ‖ w^{(i)}) .

Proof. By Theorem 6, we have that:

\sum_{i = 1}^{n} a_{i} D_{f} (w ‖ u^{(i)}) = \sum_{i = 1}^{n} a_{i} D_{f} (v ‖ u^{(i)}) + D_{f} (w ‖ v)

which by the four-point property (notice that we need the differentiability of D_f to employ the four-point property) (Theorem 8) becomes:

\sum_{i = 1}^{n} a_{i} D_{f} (w ‖ w^{(i)}) + D_{f} (w ‖ v) \geq \sum_{i = 1}^{n} a_{i} D_{f} (v ‖ u^{(i)}) + D_{f} (w ‖ v)

and hence:

\sum_{i = 1}^{n} a_{i} D_{f} (v ‖ u^{(i)}) \leq \sum_{i = 1}^{n} a_{i} D_{f} (w ‖ w^{(i)})

as required, see Figure 9. □

The theorem above is fairly similar to Theorem 7. Let us use the dual affine structure in ⅅ^J defined after the proof of Theorem 6 to analyze this more closely. For W ⊂ ⅅ^J, define W ^∗ = {w^∗; w ∈ W} and define the dual divergence

D_{f}^{*}

to the divergence D_f by

D_{f}^{*} (v * ‖ w *) = D_{f} (w ‖ v)

. Since, by Theorem 6, we have that

[v *] = {[Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)})] * = {LinOp}_{a} ([w^{(1)}] *, \dots, [w^{(n)}] *) + c_{v}

, where

c_{v} = (\underset{J -times}{\underset{︸}{λ, \dots, λ}})

is a normalizing vector induced by

\sum_{j = 1}^{J} v_{j} = 1

, the theorem above can be rewritten as follows.

Let D_f be a convex differentiable Bregman divergence, which is strictly convex in its second argument, and let W₁,…, W_n⊆ ⅅ^J be closed convex nonempty sets and a ∈ ⅅⁿ. Let v, w ∈ ⅅ^J, u⁽¹⁾∈ W₁,…, uⁿ⁾ ∈ W_n and w⁽¹⁾ ∈ W₁.,…, w⁽ⁿ⁾∈ W_n be such that v^∗ = LinOp ([u⁽¹⁾]^∗,…, [u⁽ⁿ⁾]*) + c_v, w^∗ = LinOp_a([w⁽¹⁾]^∗ …, [w⁽ⁿ⁾]^∗) + c_w and [u⁽ⁱ⁾]^∗ are the

D_{f}^{*}

-projection of v^∗ into

W_{i}^{*}

, 1 ≤ i ≤ n. Then:

\sum_{i = 1}^{n} a_{i} D_{f}^{*} ([u^{(i)}] * ‖ v *) \leq \sum_{i = 1}^{n} a_{i} D_{f}^{*} ([w^{(i)}] * ‖ w *) .

This illustrates that if D_f is a convex differentiable Bregman divergence that is strictly convex in its second argument, then Theorems 7 and 10 are dual with respect to ^∗.

Theorem 11 (Characterization Theorem for

{\hat{Θ}}_{a}^{D_{f}}

. Let D_f be a convex differentiable Bregman divergence, which is strictly convex in its second argument, and let W₁,…, W_n ⊆ ⅅ^J be closed convex nonempty sets and a ∈ ⅅⁿ. Then:

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n}) = {\arg min_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (v ‖ w^{(i)}) : w^{(i)} \in W_{i}, 1 \leq i \leq n},

where the right hand-side denotes the set of all possible minimizers.

Proof. The proof is similar to the proof of Theorem 9. First, given closed convex nonempty sets W₁,…, W ⊆ ⅅ^J, we have that those w⁽¹⁾ ∈ W₁,…, w⁽ⁿ⁾ ∈ W_n, which together with v ∈ ⅅ^J, globally minimize:

\sum_{i = 1}^{n} a_{i} D_{f} (v ‖ w^{(i)}),

that are also the conjugated D_f-projections of v into W₁,…, W_n, respectively. This together with the definition of

{Pool}_{a}^{D_{f}}

gives:

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n}) \supseteq {\arg min_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (v ‖ w^{(i)}) : w^{(i)} \in W_{i}, 1 \leq i \leq n} .

Second, assume that

v \in {\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})

and:

u \in {\arg min_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (v ‖ w^{(i)}) : w^{(i)} \in W_{i}, 1 \leq i \leq n} .

Let us denote the conjugated D_f-projections of v into W₁,…, W_n by w⁽¹⁾, …, w⁽ⁿ⁾, respectively. Accordingly, let us denote the conjugated D_f-projections of u into W₁,…, W_n by r⁽¹⁾,…, r⁽ⁿ⁾, respectively. Suppose that:

\sum_{i = 1}^{n} a_{i} D_{f} (u ‖ r^{(i)}) < \sum_{i = 1}^{n} a_{i} D_{f} (v ‖ w^{(i)}),

i.e.,

v \notin {\arg min_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (v ‖ w^{(i)}) : w^{(i)} \in W_{i}, 1 \leq i \leq n}

. This contradicts Theorem 10, and therefore:

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n}) \subseteq {\arg min_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (v ‖ w^{(i)}) : w^{(i)} \in W_{i}, 1 \leq i \leq n} .

□

The following simple observation originally from [10] based on Equation (1) (alternatively on the parallelogram theorem) will be used in the proof of the forthcoming theorem.

Lemma 2. Let D_f be a convex Bregman divergence and a ∈ ⅅⁿ. Then, the following are equivalent:

The probability functions v, w⁽¹⁾,…, w⁽ⁿ⁾ ∈ ⅅ^J minimize the quantity:

$\sum_{i = 1}^{n} a_{i} D_{f} (w ‖ v^{(i)})$

subject to w⁽¹⁾ ∈ W₁,…, w⁽ⁿ⁾ ∈ W_n.
The probability functions w⁽¹⁾,…, w⁽ⁿ⁾ ∈ ⅅ^J minimize the quantity:

$\sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ {LinOp}_{a} (w^{(1)}, \dots, w^{(n)})$

subject to w⁽¹⁾ ∈ W₁,…, w⁽ⁿ⁾ ∈ W_n and v = LinOp_a(w⁽¹⁾,…, w⁽ⁿ⁾).

Theorem 12. Let D_f be a convex Bregman divergence. Then, for all nonempty closed convex sets W₁,…, W_n ⊆ ⅅ^J and a ∈ ⅅⁿ, the set

{\arg {min}_{v \in D^{J}} \sum_{i = 1}^{n} D_{f} (w^{(i)} ‖ v) : w^{(i)} \in W_{i}, 1 \leq i \leq n}

is a nonempty closed convex region of ⅅ^J.

Proof. This proof is from [10]. Let v, s ∈

{\arg {min}_{v \in D^{J}} \sum_{i = 1}^{n} D_{f} (w^{(i)} ‖ v) : w^{(i)} \in W_{i}, 1 \leq i \leq n}

, as the set is clearly nonempty. For convexity, we need to show that λv + (1 − λ)s ∈

{\arg {min}_{v \in D^{J}} \sum_{i = 1}^{n} D_{f} (w^{(i)} ‖ v) : w^{(i)} \in W_{i}, 1 \leq i \leq n}

for any λ ∈ [0, 1].

Assume that w⁽¹⁾ ∈ W₁,…, w⁽ⁿ⁾ ∈ W_n are such that v = LinOp_a(w⁽¹⁾,…, w⁽ⁿ⁾) and u⁽¹⁾ ∈ W₁,…, u⁽ⁿ⁾ ∈ W_n are such that s = LinOp_a(u⁽¹⁾,…, u⁽ⁿ⁾). It is easy to observe that the convexity of D_f(·k·) implies convexity of: D_f( ‖ ) implies convexity of:

g (x^{(1)}, \dots, x^{(n)}) = \sum_{i = 1}^{n} a_{i} D_{f} (x^{(i)} ‖ {LinOp}_{a} (x^{(1)}, \dots, x^{(n)}))

over the convex region specified by constraints x⁽ⁱ⁾ ∈ W_i, 1 ≤ i ≤ n. Moreover, the function g attains its minimum over this convex region at points (w⁽¹⁾,…, w⁽ⁿ⁾) and (u⁽¹⁾,…, u⁽ⁿ⁾). We need to show that g also attains its minimum at the point:

λ (w^{(1)}, \dots, w^{(n)}) + (1 - λ) (u^{(1)}, \dots, u^{(n)})

for any λ ∈ [0, 1]. Since g is convex by the Jensen inequality, we have that:

\begin{array}{l} λ g (w^{(1)}, \dots, w^{(n)}) + (1 - λ) g (u^{(1)}, \dots, u^{(n)}) \geq \\ \geq g (λ (w^{(1)}, \dots, w^{(n)}) + (1 - λ) (u^{(1)}, \dots, u^{(n)})) . \end{array}

Since g(w⁽¹⁾,…, w⁽ⁿ⁾) = g(u⁽¹⁾,…, u⁽ⁿ⁾), the inequality above can only hold with equality, and therefore, by Lemma 2,

λ v + (1 - λ) s \in {\arg {min}_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v) : w^{(i)} \in W_{i}, 1 \leq i \leq n}

for any λ ∈ [0, 1].

Moreover, since convexity implies continuity, the minimization of a convex function over a closed convex region produces a closed convex set. Therefore, the fact that W₁,…, W_n are all closed and convex implies that the set of n-tuples (w⁽¹⁾,…, w⁽ⁿ⁾), which are global minimizers of g over the region specified by w⁽ⁱ⁾ ∈ W_i, 1 ≤ i ≤ n, is closed. Additionally, since closed regions are preserved by projections in the Euclidean space, the set given by LinOp_a(w⁽¹⁾,…, w⁽ⁿ⁾) is closed, as well.

The following observation immediately follows by the definition of

{Pool}_{a}^{D_{f}}

.

Lemma 3. Let D_f be a convex Bregman divergence and a ∈ ⅅⁿ. Then, the following are equivalent:

The probability functions v, w⁽¹⁾,…, w⁽ⁿ⁾ ∈ ⅅ^J minimize the quantity:

$\sum_{i = 1}^{n} a_{i} D_{f} (v ‖ w^{(i)})$

subject to w⁽¹⁾ ∈ W₁,…, w⁽ⁿ⁾ ∈ W_n.
The probability functions w⁽¹⁾,…, w⁽ⁿ⁾ ∈ ⅅ^J minimize the quantity:

$\sum_{i = 1}^{n} a_{i} D_{f} ({Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)}) ‖ w^{(i)})$

subject to w⁽¹⁾ ∈ W₁,…, w⁽ⁿ⁾ ∈ W_n.and $v = {Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)}) .$

Theorem 13. Let D_f be a convex Bregman divergence. Then, for all nonempty closed convex sets W₁,…, W_n ⊆ ⅅ^J and a ∈ ⅅⁿ, the set

{\arg {min}_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (v ‖ w^{(i)}) : w^{(i)} \in W_{i}, 1 \leq i \leq n}

a nonempty closed convex region of ⅅ^J.

Proof. Let v, s ∈

{\arg {min}_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (v ‖ w^{(i)}) : w^{(i)} \in W_{i}, 1 \leq i \leq n}

, as the set is clearly nonempty. For convexity, we need to show that λv + (1 − λ)s ∈

{\arg {min}_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (v ‖ w^{(i)}) : w^{(i)} \in W_{i}, 1 \leq i \leq n}

for any λ ∈ [0, 1].

Assume that w⁽¹⁾ ∈ W₁,…, w⁽ⁿ⁾ ∈ W_n are such that

v = {Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)})

and u⁽¹⁾ ∈ W₁,…, u⁽ⁿ⁾ ∈ W_n are such that

s = {Pool}_{a}^{D_{f}} (u^{(1)}, \dots, u^{(n)})

. Now, for any λ ∈ [0, 1],

\begin{matrix} λ \sum_{i = 1}^{n} a_{i} D_{f} ({Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)}) ‖ w^{(i)}) + (1 - λ) \sum_{i = 1}^{n} a_{i} D_{f} ({Pool}_{a}^{D_{f}} (u^{(1)}, \dots, u^{(n)}) ‖ u^{(i)}) \geq \\ \geq \sum_{i = 1}^{n} a_{i} D_{f} (λ {Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)}) + (1 - λ) ({Pool}_{a}^{D_{f}} (u^{(1)}, \dots, u^{(n)}) ‖ λ w^{(i)} + (1 - λ) u^{(i)}) \geq \\ \geq \sum_{i = 1}^{n} a_{i} D_{f} ({Pool}_{a}^{D_{f}} (λ w^{(1)} + (1 - λ) u^{(1)}), \dots, λ w^{(n)} + (1 - λ) u^{(n)} ‖ λ w^{(i)} + (1 - λ) u^{(i)}), \end{matrix}

where the first inequality follows by convexity of D_f(·‖·) and the second by the definition of

{Pool}_{a}^{D_{f}}

as the unique minimizer. However, the inequality above can only hold with equality and, by Lemma 3,

λ v + (1 - λ) s \in {\arg {min}_{v \in D^{J}} \sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v) : w^{(i)} \in W_{i}, 1 \leq i \leq n}

for any λ ∈ [0, 1].

Moreover, since convexity implies continuity, the minimization of a convex function over a closed convex region produces a closed convex set. Therefore, the fact that W₁,…, W_n are all closed and convex implies that the set of n-tuples (w⁽¹⁾,…, w⁽ⁿ⁾), which are global minimizers of

\sum_{i = 1}^{n} a_{i} D_{f} ({Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)}) ‖ w^{(i)})

over the region specified by w⁽ⁱ ∈ W_i, 1 ≤ i ≤ n, is closed. Additionally, since closed regions are preserved by projections in the Euclidean space, the set given by

{Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)})

is closed, as well. □

Finally, we can establish our initial claims:

Theorem 14. Let

A

be a family of weighting vectors. The operator

Θ_{A}^{D_{f}}

, where D_f is a convex Bregman divergence, and the operator

{\hat{Θ}}_{A}^{D_{f}}

, where D_f is a convex differentiable Bregman divergence, which is strictly convex in its second argument, are well defined probabilistic merging operators that satisfy (CP).

Proof. First, the fact that

Θ_{A}^{D_{f}}

is well defined as a probabilistic merging operator follows Theorems 9 and 12. Accordingly,

{\hat{Θ}}_{A}^{D_{f}}

is a well-defined probabilistic merging operator by Theorems 11 and 13.

Second, let a ∈

A

(in particular a ∈ ⅅⁿ) and W₁,…, W_n ⊆ ⅅ^J be closed, convex, nonempty and have a nonempty intersection. Clearly, every point in that intersection minimizes

\sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v)

and

\sum_{i = 1}^{n} a_{i} D_{f} (v ‖ w^{(i)})

subject to w⁽¹⁾ ∈ W₁,…, w⁽ⁿ⁾ ∈ W_n with both expressionsPattaining the zero value. Since D_f(w‖v) = 0 only if w = v, those points in the intersection are the only points minimizing the above quantities. □

It turns out that, given closed convex nonempty sets W₁,…, W_n ⊆ ⅅ^J and weighting a, the sets of fixed points

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

and

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})

posses attractive properties, which make the operators

Θ_{A}^{D_{f}}

and

{\hat{Θ}}_{A}^{D_{f}}

suitable for probabilistic merging. The following example taken from [10] illustrates a possible philosophical justification for considering the set of all fixed points of a mapping consisting of a convex Bregman projection and a pooling operator.

Example 5. Assume that there are n experts, each with his own knowledge represented by closed convex nonempty sets W₁,…, W_n ⊆ ⅅ^J, respectively. Say that an independent chairman of the college has announced a probability function v to represent the agreement of the college of experts. Each expert then naturally updates his own knowledge by what seems to be the right probability function. In other words, the expert “i” projects v to W_i, obtaining the probability function w⁽ⁱ⁾. Each expert subsequently accepts w⁽ⁱ⁾ as his working hypothesis, but he does not discard his knowledge base W_i; he only takes into account other people’s opinions. Then, it is easy for the chairman to identify the average of the actual beliefs w⁽¹⁾,…, w⁽ⁿ⁾ of the experts. If he found that this average v′ did not coincide with the originally announced probability function v, then he would naturally feel unhappy about such a choice, so he would be tempted to iterate the process in the hope that, eventually, he will find a fixed point.

It seems that, in a broad philosophical setting, such as in the example above, we ought to study any possible combination of Bregman projections with pooling operators. The question as to which other combination produces a well-defined probabilistic merging operator satisfying the consistency principle (CP) is open to investigation.

3. Convergence

3.1. Iterative Processes

In this section, we continue the investigation of the averaging projective procedures

F^{D_{f}, A}

and

{\hat{F}}^{D_{f}, A}

. Recall that, given a convex Bregman divergence D_f and a family of weighting vectors

A

,

F^{D_{f}, A}

, was defined in the previous section for every n ≥ 1 and all closed convex nonempty sets W₁,…, W_n ⊆ ⅅ^J by the following.

For an argument v ∈ ⅅ^J, take w⁽ⁱ⁾ as the D_f-projection of v into W_i for all 1 i ≤ n.
Set $F_{[W_{1}, \dots, W_{n}]}^{D_{f}, A} (v) = {LinOp}_{a} (w^{(1)}, \dots, w^{(n)})$ , where a ∈ $A$ .

For D_f, which is moreover differentiable and strictly convex in the second argument,

{\hat{F}}^{D_{f}, A}

was defined analogously by conjugated projections and the

{Pool}_{A}^{D_{f}}

-pooling operator.

Our current aim is to find out what will happen if we iterate the application of averaging projective procedures

F^{D_{f}, A}

and

{\hat{F}}^{D_{f}, A}

. In particular:

Will the resulting sequences converge?

We shall find the answer in this subsection.

It is intriguing that we can abstractly define a “conjugated projection” with respect to a summation of a convex differentiable Bregman divergence D_f. Let w⁽¹⁾,…, w⁽ⁿ⁾ ⅅ^J and a ∈ ⅅⁿ. Then, the “conjugated projection” of (w⁽¹⁾,…, w⁽ⁿ⁾)into ⅅ^J is defined by the global minimizer of

\sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v)

, which, by Equation (1), is v = LinOp_a(w⁽¹⁾,…, w⁽ⁿ⁾).

The claim that this behaves as a “conjugated projection” is supported by the following analogue of the four-point property illustrated in Figure 10.

Theorem 15. Let D_f be a convex differentiable Bregman divergence. Let a ∈ ⅅⁿ, w⁽¹⁾,…, w⁽ⁿ⁾ ∈ ⅅ^J and v = LinOp (w⁽¹⁾,…, w⁽ⁿ⁾). Let u⁽¹⁾,…, u⁽ⁿ⁾ ∈ ⅅ^J and u ∈ ⅅ^J. Then:

\sum_{i = 1}^{n} a_{i} D_{f} (u^{(i)} ‖ v) \leq \sum_{i = 1}^{n} a_{i} D_{f} (u^{(i)} ‖ u) + \sum_{i = 1}^{n} a_{i} D_{f} (u^{(i)} ‖ w^{(i)}) .

Proof. The proof is similar to the one of the actual four-point property (Theorem 8) only with a slightly different argument at the end: after obtaining:

\begin{matrix} \sum_{i = 1}^{n} a_{i} D_{f} (u^{(i)} ‖ v) \leq \sum_{i = 1}^{n} a_{i} D_{f} (u^{(i)} ‖ u) + \sum_{i = 1}^{n} a_{i} D_{f} (u^{(i)} ‖ w^{(i)}) - \\ - \sum_{i = 1}^{n} a_{i} \sum_{j = 1}^{J} (u_{j} ‖ v_{j}) \frac{\partial}{\partial x_{j}} [D_{f} (w^{(i)} ‖ x)] | x = v \end{matrix}

we proceed with:

\begin{matrix} - \sum_{i = 1}^{n} a_{i} \sum_{j = 1}^{J} (u_{j} - v_{j}) \frac{\partial}{\partial x_{j}} [D_{f} (w^{(i)} ‖ x)] | x = v = \\ = \sum_{i = 1}^{n} a_{i} (u_{j} - v_{j}) [\sum_{k = 1}^{J} (\sum_{i = 1}^{n} a_{i} w_{k}^{(i)} - v_{k}) \frac{\partial \frac{\partial f (x)}{\partial x_{k}}}{\partial x_{j}} | x = v] = 0, \end{matrix}

since

\sum_{i = 1}^{n} a_{i} w_{k}^{(i)} = v_{k}

for all 1 ≤ k ≤ J, and the theorem follows. □

Similarly, given w⁽¹⁾,…, w⁽ⁿ⁾ ∈ ⅅ^J, a ∈ ⅅⁿ and a convex differentiable Bregman divergence D_f, which is strictly convex in its second argument, we can consider

{Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)})

the “projection” of (w⁽¹⁾,…, w⁽ⁿ⁾) into ⅅ^J, since Theorem 6 resembles (a special case of) the extended Pythagorean property: for any u ∈ ⅅ^J:

\begin{matrix} \sum_{i = 1}^{n} a_{i} D_{f} (u ‖ {Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)})) + \sum_{i = 1}^{n} a_{i} D_{f} ({Pool}_{a}^{D_{f}} (w^{(1)}, \dots, w^{(n)}) ‖ w^{(i)}) = \\ = \sum_{i = 1}^{n} a_{i} D_{f} (u ‖ w^{(i)}) . \end{matrix}

The two observations above and the following lemma will be essential to the proofs of the two main theorems of this subsection.

Lemma 4. Let D_f be a convex Bregman divergence. Assume that we are given a closed convex nonempty set W, v^[ⁱ^] ∈ ⅅ^L, i = 1, 2,… and w^[ⁱ^] ∈ ⅅ^J, i = 1, 2,…, such that w^[ⁱ^] is the D -projection of v^[ⁱ^] into W for all i = 1, 2,…. Assume that

{v^{[i]}}_{i = 1}^{\infty}

converges to v ∈ ⅅ^J and

{w^{[i]}}_{i = 1}^{\infty}

converges to w ∈ ⅅ^J. Then, w is the D_f-projection of v into W.

Proof. For a contradiction, assume that the D_f-projection of v into W denoted by

\bar{w}

is distinct from w. Then, by the extended Pythagorean property,

D_{f} (w^{[i]} ‖ v^{[i]}) + D_{f} ({\bar{w} || w}^{[i]}) \leq D_{f} ({\bar{w} || v}^{[i]})

. Since D_f(·‖·) is continuous (see Section 1.1), we have that:

\begin{array}{l} \lim_{i \to \infty} D_{f} (w^{[i]} ‖ v^{[i]}) = D_{f} (w ‖ v), \\ \lim_{i \to \infty} D_{f} (\bar{w} ‖ w^{[i]}) = D_{f} (\bar{w} ‖ w) and: \\ \lim_{i \to \infty} D_{f} (\bar{w} ‖ v^{[i]}) = D_{f} (\bar{w} ‖ v) . \end{array}

Therefore:

D_{f} (w ‖ v) + D_{f} (\bar{w} || w) \leq D_{f} (\bar{w} || v)

, which contradicts the assumption that

\bar{w}

is the D_f-projection of v into W. □

Finally, we are going to answer the question about whether the iteration of the averaging projective procedures

F^{D_{f}, A}

and

{\hat{F}}^{D_{f}, A}

converges; however, the result for

F^{D_{f}, A}

will be limited only to the case when D_f is differentiable. Both results below should be attributed to a number of people. First, the results are applications of well-known alternative projections due to Csiszár and Tusnády; see [28], Theorem 3. In a particular case of the Kullback–Leibler divergence, the theorems were observed and proven by Matúš in [21]. Last, but not least, Eggermont and LaRiccia reformulated original alternative projections in terms of Bregman divergences in [29].

Theorem 16. Let D_f be a convex differentiable Bregman divergence, A be a family of weighting vectors and a ∈

A

be such that a ∈ ⅅⁿ and W₁,…, W_n ⊆ ⅅ^J are closed, convex and nonempty. Then, for any v ∈ ⅅ^J, the sequence:

{v^{[i]}}_{i = 0}^{\infty},

where v^[0] = v and

v^{[i + 1]} = F_{[W_{1}, \dots, W_{n}]}^{D_{f}} (v^{[i]})

converge to some probability function in

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

. (Recall that

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

is the set of the fixed points of

F_{[W_{1}, \dots, W_{n}]}^{D_{f}, A}

, i.e., all points v, such that

F_{[W_{1}, \dots, W_{n}]}^{D_{f}, A} (v) = v

.)

Proof. This proof is inspired by [21].

Denote the D_f-projections of v^[ⁱ^] into W₁,…, W by π₁ v^[ⁱ^],…, π_nv^[ⁱ^], respectively. Then, it is easy to observe that:

\sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v^{[i]} ‖ v^{[i]}) \geq \sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v^{[i]} ‖ v^{[i + 1]}) \geq \sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v^{[i + 1]} ‖ v^{[i + 1]}),

for all i = 1, 2,…. Due to the monotonicity of this sequence, the limit

\lim_{i \to \infty} \sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v^{[i]} ‖ v^{[i]})

exists. Thanks to the compactness of W₁,…, W_n, the sequence

{(π_{1} v^{[i]}, \dots, π_{n} v^{[i]}, v^{[i]})}_{i = 1}^{\infty}

has a convergent subsequence. Let us denote the limit of this subsequence (π₁v,…, π_nv, v). Due to Lemma 4, π_kv is really the D_f-projection of v into W_k for all 1 ≤ k ≤ n. Moreover

\lim_{i \to \infty} \sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v^{[i]} ‖ v^{[i]}) = \sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v ‖ v) .

By Theorem 15:

\sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v ‖ v^{[i]}) \leq \sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v ‖ v) + \sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v ‖ π_{k} v^{[i - 1]}) .

(8)

This is because

v^{[i]} = {LinOp}_{a} (π_{1} v^{[i - 1]}, \dots, π_{n} v^{[i - 1]})

. Moreover, by the extended Pythagorean property:

\sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v^{[i]} ‖ v^{[i]}) + \sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v ‖ π_{k} v^{[i]}) \leq \sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v ‖ v^{[i]}) .

(9)

An illustration of the situation is depicted in Figure 11.

Now, since:

\lim_{_{i \to \infty}} \sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v^{[i]} ‖ v^{[i]}) = \sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v ‖ v) .

and

\sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v^{[i]} ‖ v^{[i]}) \geq \sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v ‖ v)

for all i = 1, 2,…, Equations (8) and (9) give that:

\sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v ‖ π_{k} v^{[i]}) \leq \sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v ‖ π_{k} v^{[i - 1]})

(10)

for all i = 1, 2,…. We conclude that this is possible only if:

\lim_{_{i \to \infty}} \sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v ‖ π_{k} v^{[i]})

exists.

However, we already know that a subsequence of

{(π_{1} v^{[i]}, \dots, π_{n} v^{[i]}, v^{[i]})}_{i = 1}^{\infty}

converges to (π₁v,…, π_nv); hence, a subsequence of the sequence

{\sum_{k = 1}^{n} a_{k} D_{f} (π_{k} v ‖ π_{k} v^{[i]})}_{i = 1}^{\infty}

decreases to zero, which by Equation (10), forces the whole sequence to converge to zero. Due to the fact that D_f(x‖y) = 0, only if x = y and, by the continuity, we get:

\lim_{_{i \to \infty}} π_{k} v^{[i]} = π_{k} v .

It follows that

\lim_{i \to \infty} v^{[i]}

exists and is equal to v. Moreover,

v = \lim_{i \to \infty} v^{[i + 1]} = v = \lim_{i \to \infty} {LinOp}_{a} (π_{1} v^{[i]}, \dots, π_{n} v^{[i]}) = {LinOp}_{a} (π_{1} v, \dots, π_{n} v)

, and therefore, v is a fixed point of the mapping

F_{[W_{1}, \dots, W_{n}]}^{D_{f}, A}

; hence,

v \in Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

.

The following analogue of Lemma 4 will be needed in the forthcoming theorem.

Lemma 5. Let D_f be a convex differentiable Bregman divergence, which is strictly convex in its second argument. Assume that we are given a closed convex nonempty set W , v^[ⁱ^] ∈ ⅅ^L, i = 1, 2,… and w^[ⁱ^] ∈ ⅅ^J, i = 1, 2,…, such that w^[ⁱ^] is the conjugated D -projection of v^[ⁱ^] into W for all i = 1, 2,…. Assume that

{v^{[i]}}_{i = 1}^{\infty}

converges to v ∈ ⅅ^J and

{w^{[i]}}_{i = 1}^{\infty}

converges to w ∈ ⅅ^J. Then, w is the conjugated D_f-projection of v into W.

Proof. For a contradiction, assume that the conjugated D_f-projection of v into W denoted by

\bar{w}

is distinct from w. Then, by the four-point property,

D_{f} (v^{[i]} ‖ w^{[i]}) \leq D_{f} (v^{[i]} ‖ \bar{w}) + D_{f} (v^{[i]} ‖ v^{[i]})

. Since D_f(·‖·) is continuous, we have that:

\begin{array}{l} \lim_{_{i \to \infty}} D_{f} (v^{[i]} ‖ w^{[i]}) = D_{f} (v ‖ w), \\ \lim_{_{i \to \infty}} D_{f} (v^{[i]} ‖ \bar{w}) = D_{f} (v ‖ \bar{w}) and \\ \lim_{_{i \to \infty}} D_{f} (v^{[i]} ‖ v) = D_{f} (v ‖ v) = 0. \end{array}

Therefore:

D_{f} (v ‖ w) \leq D_{f} (v ‖ \bar{w})

, which contradicts the assumption that

\bar{w}

is the conjugated D_f-projection of v into W. □

Theorem 17. Let D_f be a convex differentiable Bregman divergence, which is strictly convex in its second argument,

A

be a family of weighting vectors and a ∈

A

be such that a ∈ ⅅⁿ and W₁,…, W_n ⊆ ⅅ^J are closed, convex and nonempty. Then, for any v ∈ ⅅ^J, the sequence:

{v^{[i]}}_{i = 0}^{\infty},

where

v^{[0]} = 0 a n d v^{[i + 1]} = {\hat{F}}_{[W_{1}, \dots, W_{n}]}^{D_{f}, A} (v^{[i]})

, converges to some probability function in

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})

. (Recall that

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})

is the set of the fixed points of

{\hat{F}}_{[W_{1}, \dots, W_{n}]}^{D_{f}, A}

, i.e., all points v, such that

{\hat{F}}_{[W_{1}, \dots, W_{n}]}^{D_{f}, A} (v) = v

).

Proof. Denote the conjugated D_f-projections of v^[ⁱ^] into W₁,…, W by

π_{1} v^{[i]}, \dots, π_{n} v^{[i]}

, respectively. Then, it is easy to observe that:

\sum_{k = 1}^{n} a_{k} D_{f} (v^{[i]} ‖ π_{k} v^{[i]}) \geq \sum_{k = 1}^{n} a_{k} D_{f} (v^{[i + 1]} ‖ π_{k} v^{[i]}) \geq \sum_{k = 1}^{n} a_{k} D_{f} (v^{[i + 1]} ‖ π_{k} v^{[i + 1]}),

for all i = 1, 2,…. Due to the monotonicity of this sequence, the limit

\lim_{i \to \infty} \sum_{k = 1}^{n} a_{k} D_{f} (v^{[i]} ‖ π_{k} v^{[i]})

exists. Thanks to the compactness of W₁,…, W_n sequence

{(π_{1} v^{[i]}, \dots, π_{n} v^{[i]}, v^{[i]})}_{i = 1}^{\infty}

has a convergent subsequence. Let us denote the limit of this subsequence (π₁v,…, π_nv, v). Due to Lemma 5, π_kv is really the conjugated D_f-projection of v into W_k for all 1 ≤ k ≤ n. Moreover:

\lim_{_{i \to \infty}} \sum_{k = 1}^{n} a_{k} D_{f} (v^{[i]} ‖ π_{k} v^{[i]}) = \sum_{k = 1}^{n} a_{k} D_{f} (v ‖ π_{k} v) .

By the four-point property:

\sum_{k = 1}^{n} a_{k} D_{f} (v ‖ π_{k} v^{[i]}) \leq \sum_{k = 1}^{n} a_{k} D_{f} (v ‖ π_{k} v) + D_{f} (v ‖ v^{[i]}) .

(11)

Moreover, by Theorem 6:

\sum_{k = 1}^{n} a_{k} D_{f} (v^{[i + 1]} ‖ π_{k} v^{[i]}) \leq D_{f} (v ‖ v^{[i + 1]}) = \sum_{k = 1}^{n} a_{k} D_{f} (v ‖ π_{k} v^{[i]}) .

(12)

That is because

v^{[i + 1]} = {Pool}_{a}^{D_{f}} (π_{1} v^{[i]}, \dots, π_{n} v^{[i]})

. An illustration of the situation is depicted in Figure 12.

Now, since:

\lim_{i \to \infty} \sum_{k = 1}^{n} a_{k} D_{f} (v^{[i + 1]} ‖ π_{k} v^{[i]}) = \sum_{k = 1}^{n} a_{k} D_{f} (v ‖ π_{k} v)

and

\sum_{k = 1}^{n} a_{k} D_{f} (v^{[i + 1]} ‖ π_{k} v^{[i]}) \geq \sum_{k = 1}^{n} a_{k} D_{f} (v ‖ π_{k} v)

for all i = 1, 2,…, the expressions (11) and (12) give that:

D_{f} (v ‖ v^{[i + 1]}) \leq D_{f} (v ‖ v^{[i]})

(13)

for all i = 1, 2,…. We conclude that this is possible only if:

\lim_{i \to \infty} D_{f} (v ‖ v^{[i]})

exists.

However, we already know that a subsequence of

{v^{[i]}}_{i = 1}^{\infty}

converges to v; hence, a subsequence of the sequence

{D_{f} (v ‖ v^{[i]})}_{i = 1}^{\infty}

decreases to zero, which by Equation (13), forces the whole sequence to converge to zero. Due to the fact that D_f(x‖y) = 0 only if x = y and by the continuity, we get:.

\lim_{i \to \infty} v^{[i]} = v

and, subsequently,

\lim_{i \to \infty} π_{k} v^{[i]} = π_{k} v, 1 \leq k \leq n

(the subsequence of

{π_{k} v^{[i]}}_{i = 1}^{\infty}

has π_kv as a limit, and

{D_{f} (v^{[i]} ‖ π_{k} v^{[i]})}_{i = 1}^{\infty}

is monotonic).

Moreover,

v = \lim_{i \to \infty} v^{[i + 1]} = \lim_{i \to \infty} {Pool}_{a}^{D_{f}} (π_{1} v^{[i]}, \dots, π_{n} v^{[i]}) = {Pool}_{a}^{D_{f}} (π_{1} v, \dots, π_{n} v)

since

{Pool}_{a}^{D_{f}}

is continuous

(\sum_{k = 1}^{n} a_{k} D_{f} (\cdot ‖ \cdot)

is continuous and strictly convex in the first argument). Therefore, v is a fixed point of the mapping

{\hat{F}}_{[W_{1}, \dots, W_{n}]}^{D_{f}, A}

, and hence,

v \in {\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})

. □

The problem of characterizing the limits of Theorems 16 and 17 more precisely remains open. On the other hand, the theorems suggest a way to compute at least some points in

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

and

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})

, although we have not investigated how fast the sequences converge. Moreover, also the question of how effective it is to compute D_f-projections and conjugated D_f-projections was left unanswered. This latter problem was nevertheless addressed in the literature, at least in the case of the KL-divergence and sets W₁,…, W_n generated by finite collections of marginal probability functions. In such a case, the well-known iterative projective fitting procedure IPFP can be effectively employed [16].

3.2. Chairmen Theorems

In this section, for a convex differentiable Bregman divergence D_f, which is strictly convex in its second argument, and a family of weighting vectors

A

, we investigate the susceptibility of

Θ_{A}^{D_{f}}

and

{\hat{Θ}}_{A}^{D_{f}}

-merging operators to a small bias by an arbitrary probability function in ⅅ^J. The study of this problem first occurred in [18], where Wilmers argued that an independent adjudicator, whose only knowledge consists of what is related to him by the given college of experts, can rationally bias the agreement procedure by including himself as an additional expert, whose personal probability function is the uniform one (not arbitrary), in order to calculate a single social probability function and then find what would happen to this social probability function if his contribution happened to be infinitesimally small relative to that of the other experts. He showed that in the case of the

{\hat{Θ}}_{N}^{KL}

-merging operator, this point of agreement is characterized by the most entropic point in the region defined by

{\hat{Θ}}_{N}^{KL}

. A similar theorem for the

Θ_{N}^{KL}

-merging operator was proven in [10]. In what follows, we adapt these results to our general situation.

The following theorem will tell us that, in some particular case of W₁,…, W_n ⊆ ⅅ^J, we can always tell that the set

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

is a singleton.

Theorem 18. Let W₁,…, W_n ⊆ ⅅ^J be closed, convex, nonempty and such that, for at least one i W_i is a singleton. Let D_f be a convex Bregman divergence, which is strictly convex in its second argument and a ⊆ ⅅⁿ. Then,

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

is a singleton.

Proof. Without loss of generality, assume that W₁ = {v}. For a contradiction, suppose that w, r ∈

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

and w ≠ r. Denote w⁽²⁾,…, w⁽ⁿ⁾ the D_f-projections of w into W₂,…, W_n, respectively, and r⁽²⁾,…, r⁽ⁿ⁾ the D_f-projections of r into W₂,…, W_n, respectively. By definition, w = LinOp_a(v, w⁽²⁾,…, w⁽ⁿ⁾) and r = LinOp_a(v, r⁽²⁾,…, r⁽ⁿ⁾).

Now, consider x = λw + (1 − λ)r for some λ ∈ (0, 1). By Theorems 9 and 12, we have that

x \in Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

. Since D_f(·‖·) is a convex function, by the Jensen inequality, we have that:

\begin{matrix} a_{1} D_{f} (v ‖ x) + \sum_{i = 2}^{n} a_{i} D_{f} (λ w^{(i)} + (1 - λ) r^{(i)} ‖ x) \leq \\ \leq λ (a_{1} D_{f} (v ‖ x) + \sum_{i = 2}^{n} a_{i} D_{f} (w ‖ w)) + (1 - λ) (a_{1} D_{f} (v ‖ r) + \sum_{i = 2}^{n} a_{i} D_{f} (r^{(i)} ‖ r)) . \end{matrix}

(14)

However, since

w, r, x \in Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

and λw⁽ⁱ⁾+ (1 − λ)r⁽ⁱ⁾ ∈ W_i, 1 ≤ i ≤ n, the above is possible only with the equality.

On the other hand, since D_f is strictly convex in its second argument, the following Jensen inequality is strict:

D_{f} (v ‖ x) < λ D_{f} (v ‖ w) + (1 - λ) D_{f} (v ‖ r) .

Note that the border points λ = 0, 1 are excluded. Therefore, Equation (14) yields:

\begin{matrix} \sum_{i = 2}^{n} a_{k} D_{f} (λ w^{[i]} + (1 - λ) r^{(i)} ‖ x) > \\ > λ (\sum_{i = 2}^{n} a_{k} D_{f} (w^{(i)} ‖ w)) + (1 - λ) (\sum_{i = 2}^{n} a_{k} D_{f} (r^{(i)} ‖ r)) . \end{matrix}

However, this contradicts the Jensen inequality.

Theorem 19 (Chairman Theorem for

Θ_{A}^{D_{f}}

). Let I ⊆ ⅅ^J be a singleton consisting of an arbitrary probability function t ⊆ ⅅ^J. Let W₁,…, W_n ⊆ ⅅ^J be closed, convex and nonempty, a ∈

A

be such that a ∈ ⅅⁿ and D_f be a convex Bregman divergence, which is strictly convex in its second argument. For 1 > λ > 0, define (by the previous theorem, the following set is a singleton):

{v^{[λ]}} = Θ_{(λ, a_{1} - λ a_{1}, \dots, a_{n} - λ a_{n})}^{D_{f}} (I, W_{1}, \dots, W_{n}) .

Then,

\lim_{λ ↘ 0} v^{[λ]}

exists and equals

\arg min_{v \in Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})} D_{f} (t ‖ v),

i.e., it equals the conjugated D_f-projection of the probability functiont into

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

.

Proof. This proof is inspired by [30], where a slightly stronger result is proven for the special case of

Θ_{N}^{D_{f}}

. We note that Theorem 9 from Section 2.3 is implicitly used in what follows.

First, denote

M_{a}^{D_{f}} (W_{1}, \dots, W_{n})

as the minimal value of:

\sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v)

subject to w⁽ⁱ⁾ ∈ W_i, 1 ≤ i ≤ n and v ∈ ⅅ^J. Furthermore, we denote E_λ as the minimal value of:

(1 - λ) [\sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v) - M_{a}^{D_{f}} (W_{1}, \dots, W_{n})] + λ D_{f} (t ‖ v)

(15)

subject to w⁽ⁱ⁾ ∈ W_i, 1 ≤ i ≤ n and v ∈ ⅅ^J. By the definition of

M_{a}^{D_{f}} (W_{1}, \dots, W_{n})

, we have that 0 ≤ E_λ for all 1 > λ > 0.

Note that for a fixed λ, if v ∈ ⅅ^J globally minimizes Equation (15) subject to w⁽ⁱ⁾ ∈ W_i, 1 ≤ i ≤ n, then

v \in Θ_{(λ, a_{1} - λ a_{1}, \dots, a_{n} - λ a_{n})}^{D_{f}} (I, W_{1}, \dots, W_{n})

(by Theorem 18, such a v is unique), and conversely,

v \in Θ_{(λ, a_{1} - λ a_{1}, \dots, a_{n} - λ a_{n})}^{D_{f}} (I, W_{1}, \dots, W_{n})

, then v minimizes Equation (15), subject to the above constraints.

Now,

r = \arg {min}_{v \in Θ_{a}}^{D_{f}} (W_{1}, \dots, W_{n}) D_{f} (t ‖ v)

. Since

r \in Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

, it follows that for all 1 > λ > 0, we have that:

E_{λ} \leq D_{f} (t ‖ r) .

(16)

Since ⅅ^J ⊆ ℝ^J is a compact space, there exists a sequence

{λ_{m}}_{m = 1}^{\infty}, 0 < λ_{m} < 1, \lim_{m \to \infty} λ_{m} = 0

, such that

{v^{[λ_{m}]}}_{m = 1}^{\infty}

converges. Let w⁽ⁱ^)[^λm^] be the D_f-projection of v ^λm^] into W_i for all 1 ≤ i ≤ n and m = 1, 2,…. By Equation (16), the sequence:

{\sum_{i = 1}^{n} a_{i} D_{f} (w^{(i) [λ_{m}]} ‖ v^{[λ_{m}]})}_{m = 1}^{\infty}

converges to

M_{a}^{D_{f}} (W_{1}, \dots, W_{n})

.

Note that we already know that

\lim_{m \to \infty} v^{[λ_{m}]}

exists, and we denote it by v. However, we do not know whether the same is true for

\lim_{m \to \infty} w^{(i)}^{[λ_{m}]}

, 1 ≤ i ≤ n. On the other hand, since W₁,…, W_n are compact, the considered sequences have convergent subsequences. Let us denote the corresponding limits w⁽¹⁾,…, w⁽ⁿ⁾. Since D_f(·‖·) is a continuous function in both variables, the value of

\sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v)

must be equal to

M_{a}^{D_{f}} (W_{1}, \dots, W_{n})

. However, this means that we have found a global minimizer (w⁽¹⁾,…, w⁽ⁿ⁾, v) of

\sum_{i = 1}^{n} a_{i} D_{f} (w^{(i)} ‖ v)

subject to w⁽ⁱ⁾ ∈ W_i, 1 ≤ i ≤ n, and v ∈ ⅅ^J.

It follows that

v = \lim_{m \to \infty} v^{[λ m]} \in Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

. By Equation (16):

0 \leq (1 - λ_{m}) [\sum_{i = 1}^{n} a_{i} D_{f} (w^{(i) [λ_{m}]} ‖ v^{[λ_{m}]}) - M_{a}^{D_{f}} (W_{1}, \dots, W_{n})] + λ_{m} D_{f} (t ‖ v^{[λ_{m}]}) \leq λ_{m} D_{f} (t ‖ r) .

Hence,

0 \leq λ_{m} [D_{f} (t ‖ r) - D_{f} (t ‖ v^{[λ_{m}]})]

for all m = 1, 2,…. However, by definition of r, this is possible only if r = v.

In fact, we have proven that for every sequence

{λ_{m}}_{m = 1}^{\infty}

, such that lim_m→∞ λ_m = 0 and

{v^{[λ_{m}]}}_{m = 1}^{\infty}

is convergent,

{v^{[λ_{m}]}}_{m = 1}^{\infty}

must converge to r. Therefore, assume that there is a sequence

{λ_{m}}_{m = 1}^{\infty}

, such that lim_m→∞ λ_m = 0, but {v ^λm^]}^∞m₌₁ is not convergent. Then, there is an open neighborhood of the point r outside of which there are an infinite number of the members of the sequence

{v^{[λ_{m}]}}_{m = 1}^{\infty}

Since ⅅ^J is compact, this sequence must have a convergent subsequence with a limit distinct from r. That, however, contradicts our previous claim.

The theorem above is illustrated in Figure 13. Indeed, if

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

is a singleton, then the limit in the theorem above is obvious. By Theorem 18, this happens in particular when at least one of W₁,…, W_n is a singleton. However, it is not hard to observe an interesting case; consider that W₁,…, W_n have a nonempty intersection, which is not a singleton. In this case, the limit above is, in fact, the conjugated D_f-projection of the probability function t into that intersection. Such a conjugated projection depends on t. In particular, we can recover any point in the intersection by setting it to be the point t.

The following analogue of Theorem 18 has a fairly similar proof.

Theorem 20. Let W₁,…, W_n ⊆ ⅅ^J be closed, convex, nonempty and such that, for at least one i W_i is a singleton. Let D_f be a convex Bregman divergence, which is strictly convex in its second argument and a ∈ ⅅⁿ. Then,

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})

is a singleton.

Theorem 21 (Chairman Theorem for

{\hat{Θ}}_{A}^{D_{f}}

. Let I ⊆ ⅅ^J be a singleton consisting of an arbitrary probability function t ∈ ⅅ^J. Let W₁,…, W_n ⊆ ⅅ^J be closed, convex and nonempty, a ∈

A

be such that a ∈ ⅅⁿ and D_f be a convex differentiable Bregman divergence, which is strictly convex in its second argument. For 1 > λ > 0, define:

{v^{[λ]}} = {\hat{Θ}}_{(λ, a_{1} - λ a_{1}, \dots, a_{n} - λ a_{n})}^{D_{f}} (I, W_{1}, \dots, W_{n}) .

Then,

\lim_{λ ↘ 0} v^{[λ]}

exists and equals:

\arg min_{v \in {\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})} D_{f} (v ‖ t),

i.e., it equals the D_f-projection of the probability function t into

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})

.

The proof is analogous to the one of Theorem 19, so we omit it.

4. Applications

4.1. Relationship to Inference Processes

In this subsection, we will discuss some striking relationships between the chairmen theorems and the framework of inference processes [26]. Inference processes are methods of reasoning by which an expert may select a single probability function from a nonempty closed convex set of possible options. In our framework, it is simply a problem of choosing a single probability function in a closed convex nonempty set W ⊆ ⅅ^J. This selection is, however, not arbitrary, and it is expected to satisfy some rational principles based on symmetry and consistency, as discussed in [15]. The maximum entropy (ME) inference process, which chooses the most entropic point in a given closed convex nonempty set, is uniquely justified by a list of such principles, as Paris and Vencovská showed [15].

As discussed in Section 1.2, the most entropic point in a closed convex nonempty set W ⊆ ⅅ^J coincides with the KL-projection of the uniform probability function into W. This can be immediately applied to the chairman theorem for

{\hat{Θ}}_{A}^{KL}

, where

A

is a family of weighting vectors:

Let I ⊆ ⅅ^J be a singleton consisting of the uniform probability function t ∈ ⅅ^J. Let W₁,…, W_n ⊆ ⅅ^J be closed, convex and nonempty and a ∈

A

be such that a ∈ ⅅⁿ. For 1 > λ > 0, define:

{v^{[λ]}} = {\hat{Θ}}_{(λ, a_{1} - λ a_{1}, \dots, a_{n} - λ a_{n})}^{KL} (I, W_{1}, \dots, W_{n}) .

Then:

\underset{λ ↘ 0}{\lim v^{[λ]}} = ME ({\hat{Θ}}_{a}^{KL} (I, W_{1}, \dots, W_{n})) .

For the family of weighting vectors:

N = {(\underset{n}{\underset{︸}{\frac{1}{n}, \dots, \frac{1}{n}}}) : n = 1, 2, \dots}

the operator that results by applying the ME-inference process to the operator

{\hat{Θ}}_{N}^{KL}

is, in fact, a probabilistic merging operator, which was introduced and studied by Wilmers in [18] under the name “social entropy process” or SEP, for short. In that paper, Wilmers argues that this merging operator is, to date, the most appealing with respect to symmetry and consistency; somehow, in the spirit of the original justification for the ME-inference process, although the problem of finding a complete justification is still open.

Whether SEP will turn out to be the most appealing probabilistic merging operator or not, by the same manner as above, we can define several probabilistic merging operators related to several other classical inference processes.

For example, the conjugated KL-projection of the uniform probability function into a closed convex nonempty set W ⊆ ⅅ^J in fact generates the so-called CM^∞-inference process (a limit version of the central mass process [26]). We write simply CM^∞(W ) to denote the point of the projection, which is explicitly given by:

{CM}^{\infty} (W) = \arg min_{w \in W} - \sum_{j = 1}^{J} \log w_{j} .

The chairman theorem for

Θ_{N}^{KL}

then suggests considering the probabilistic merging operator defined for every n ≥ 1 and all closed convex nonempty sets W₁,…, W_n ⊆ ⅅ^J by:

coSEP (W_{1}, \dots, W_{n}) = {{CM}^{\infty} (Θ_{a}^{KL} (W_{1}, \dots, W_{n}))},

where a ∈ ⅅⁿ and a ∈

N

. We will call this operator the conjugated social entropy process coSEP.

What is really appealing about the operators SEP and coSEP is that there are singletons; we simply say that they satisfy the singleton principle (SP). Furthermore, the consistency principle (CP) is obviously satisfied by all of them. However, there is an interesting principle that can never be satisfied by a probabilistic merging operator that satisfies (CP) and is always a singleton: the disagreement principle introduced in [5].

(DP) Disagreement Principle. Let Δ be a probabilistic merging operator. Then, we say that Δ satisfies the disagreement principle if, for every n, m ≥ 1 and all W₁,…, W_n ⊆ ⅅ^J and V₁,…, V_m ⊆ ⅅ^J:

Δ (W_{1}, \dots, W_{n}) \cap Δ (V_{1}, \dots, V_{m}) = ϕ

implies:

Δ (W_{1}, \dots, W_{n}, V_{1}, \dots, V_{m}) \cap Δ (W_{1}, \dots, W_{n}) = ϕ .

We cite [5] on the desirability of this principle: the principle (informally) says “… that a consistent group who disagrees with another group and then merges with them can be sure that they have influenced the opinions of the combined group.”

Theorem 22. There is no probabilistic merging operator that satisfies all (SP), (CP) and (DP).

Proof. Let Δ be a probabilistic merging operator. Assume that V ⊊ W ⊆ ⅅ^J and that V is a singleton. Suppose that Δ(W) ≠ V = Δ(V). Then, by (CP), Δ(W, V) = V, which contradicts (DP).

Theorem 23. The probabilistic merging operators

Θ_{N}^{D_{f}}

and

{\hat{Θ}}_{N}^{D_{f}}

, where D_f is a convex Bregman divergence for the prior and is additionally differentiable and strictly convex in its second argument for the latter, satisfy (DP).

Proof. We prove the theorem only for

{\hat{Θ}}_{N}^{D_{f}}

. The proof for

Θ_{N}^{D_{f}}

is similar.

Let W₁,…, W_n, V₁,…, V_m ⊆ ⅅ^J be closed convex and nonempty. For a contradiction, assume that

v \in {\hat{Θ}}_{(\frac{1}{n}, \dots, \frac{1}{n})}^{D_{f}} (W_{1}, \dots, W_{n})

,

v \in {\hat{Θ}}_{(\frac{1}{n + m}, \dots, \frac{1}{n + m})}^{D_{f}} (W_{1}, \dots, W_{n}, V_{1}, \dots, V_{m})

and, at the same time,

v \notin {\hat{Θ}}_{(\frac{1}{n}, \dots, \frac{1}{n})}^{D_{f}} (W_{1}, \dots, W_{n})

.

Denote v⁽ⁱ⁾ the conjugated D_f-projection of v into V_i, 1 ≤ i ≤ m. Then, there is u ∈ ⅅ^J, such that

u = {Pool}_{(\frac{1}{n}, \dots, \frac{1}{n})}^{D_{f}} (V^{(1)}, \dots, V^{(m)})

, i.e.,

\sum_{i = 1}^{m} \frac{1}{m} D_{f} (v ‖ v^{(i)}) > \sum_{i = 1}^{m} \frac{1}{m} D_{f} (u ‖ v^{(i)})

. Since every Bregman divergence is strictly convex in its first argument, we have that:

\frac{\partial}{\partial λ} [\sum_{i = 1}^{m} D_{f} ((1 - λ) v + λ u ‖ v^{(i)})] | λ = 0 < 0.

(17)

Now, denote w⁽ⁱ⁾ the conjugated D_f-projection of v into W_i, 1 ≤ i ≤ n. Since

v = {Pool}_{(\frac{1}{n + m}, \dots, \frac{1}{n + m})}^{D_{f}} (w^{(1)}, \dots, w^{(n)}, V^{(1)}, \dots, V^{(m)})

and

v = {Pool}_{(\frac{1}{n}, \dots, \frac{1}{n})}^{D_{f}} (w^{(1)}, \dots, w^{(m)})

the strict convexity of divergences in their first argument gives also: Bregman

\frac{\partial}{\partial λ} [\sum_{i = 1}^{m} D_{f} ((1 - λ) v + λ u ‖ w^{(i)}) + \sum_{i = 1}^{m} D_{f} ((1 - λ) v + λ u ‖ v^{(i)})] | λ = 0 \geq 0

and:

\frac{\partial}{\partial λ} [\sum_{i = 1}^{m} D_{f} ((1 - λ) v + λ u ‖ w^{(i)})] | λ = 0 \geq 0.

However, this contradicts Equation (17). □

We can conclude that, before deciding which probabilistic merging operator to use, we need to establish which two of the three properties we want the operator to satisfy. In this paper, we have seen instances of all three options, as listed in Table 1.

Recall that KIRP is the operator due to Kern-Isberner and Röder and OSEP is the obdurate social entropy process; see Section 2.2 for more details. A proof that KIRP and OSEP satisfy (DP) can be easily obtained as a modification of the proof of Theorem 23, so we omit it.

4.2. Computability

In this subsection, we would like to propose a method corresponding to the classical method of projection, but in the multi-expert context. The possible use could be similar; if the knowledge of a college of experts could be characterized by a closed convex nonempty set of probability functions, then we would like to find such a probability function in that set that is “closest” to a given piece of information represented by another probability function. We only need to specify a way to represent the knowledge of the college by such a single set and pair it with an appropriate method of projection.

Throughout this subsection, assume that we are given closed convex nonempty sets of probability functions W₁,…, W_n ⊆ ⅅ^J with weighting a ∈

A

, where a_i is the weight of W_i and a probability function v ∈ ⅅ^J to represent.

If the measure of “being closed” is quantified by a projection by means of a convex differentiable Bregman divergence D_f, which is strictly convex in its second argument, our proposed method consists of the following. First, represent W₁,…, W_n by a single, closed and convex set

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})

, and then, take the D_f-projection of v into

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})

.

On the other hand, if the measure of “being closed” is quantified by a conjugated projections by means of a convex differentiable Bregman divergence D_f, which is strictly convex in its second argument, we first represent W₁,…, W_n by a single, closed convex set

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

and then take the conjugated D_f-projection of v into

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

.

The methods have two distinguishing features:

If all of the sets W₁,…, W_n are singletons, the methods reduce to ${Pool}_{A}^{D_{f}}$ and ${LinOp}_{A}$ -pooling operators respectively.
If W₁,…, W_n have a nonempty intersection V, they reduce to D_f and conjugated D_f-projections into V, respectively.

In this subsection, we shall investigate how effective it is to compute the results of those two methods. Notice that SEP and coSEP, defined in Section 4.1, are specific instances of those procedures, respectively, in which case, we are interested in KL-projections and conjugated KL-projections of the uniform probability function.

There are indeed some serious computational issues. The most essential is the following. A closed convex nonempty set W ⊆ ⅅ^J is often given by a set of constraints on ⅅ^J. How can we effectively verify that the resulting set W is nonempty? Unfortunately, it is not even possible to find a random Turing machine running in polynomial time that upon input given by a set of constraints on probability functions verifies the consistency of this set of constraints (given that the problems solvable in a randomized polynomial time cannot be solved in a polynomial time); see Theorem 10.7 of [26].

However, some computational problems closely related to projections have been extensively studied in the literature. As we have noted in Section 3.1, this includes procedures for finding a KL-projection to a closed convex set of probability functions. These show that in many particular practical implementations, the problem of intractability does not arise, e.g., as in the case when given closed convex nonempty sets are generated by marginal probability functions and where the IPFP-procedure can be applied to effectively find a KL-projection; see [16]. Therefore, we will assume that some effective procedures for D_f-projections and conjugated D_f-projections are given.

Under such an assumption, the iterative processes from Section 3.1 and the Chairmen theorems offer a way how to compute (although possibly ineffectively) the results of the two methods above. We shall start with the latter.

By Theorem 16, we know that the sequence:

{v^{[i]}}_{i = 0}^{\infty},

where v^[0] = t is arbitrary in ⅅ^J and

v^{[i + 1]} = F_{[W_{1}, \dots, W_{n}]}^{D_{f}, A} (v^{[i]})

, converges to some probability function in

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

. Notice that D_f is required to be differentiable in order to establish this conclusion.

Recall that by Theorem 18,

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

is a singleton when at least one of W₁,…, W_n is a singleton. Let I ∈ ⅅ^J be such that I = {v}. For every 1 > λ > 0, we define the sequence

{v_{[λ]}^{[i]}}_{i = 0}^{\infty}

by

{v_{[λ]}^{[0]}} = t

(t can be arbitrary) and:

v_{[λ]}^{[i + 1]} = F_{[I, W_{1}, \dots, W_{n}]}^{D_{f} (λ, a_{1} - λ a_{1}, \dots, a_{n} - λ a_{n})} (v_{[λ]}^{[i]}) .

By Theorem 16:

{\lim_{i \to \infty} v_{[λ]}^{[i]}} = Θ_{(λ, a_{1} - λ a_{1}, \dots, a_{n} - λ a_{n})}^{D_{f}} [I, W_{1}, \dots, W_{n}] .

By the chairman theorem for

Θ_{A}^{D_{f}}

:

\lim_{λ ↘ 0} \lim_{i \to \infty} v_{[λ]}^{[i]} = \arg min_{w \in Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})} D_{f} (v ‖ w)

(18)

i.e., equals the conjugated D_f-projection of the probability function v into

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

.

Now, notice that if the limits in Equation (18) were interchangeable, then this would offer an answer to the question from Section 3.1 to closely characterize the limit lim_i→∞v^[ⁱ^] but with no claims to any theoretical results on the complexity of the computation). Unfortunately, the following simple example introduced in [10] shows that these limits are not interchangeable.

Example 6. Let

J = 4, W_{1} = {(x, \frac{1}{4} - x, y \frac{3}{4} - y), x \in [0.01, \frac{1}{4} - 0.01], y \in [0.01, \frac{3}{4} - 0.01]}

and

W_{2} = {(x, y \frac{1}{4} - x \frac{3}{4} - y), x \in [0.01, \frac{1}{4} - 0.01], y \in [0.01, \frac{3}{4} - 0.01]}

. Assume that the weighting is

N

, D_f = KL and the probability function v ∈ ⅅ⁴ to interpret is the uniform probability function. In other words, we are looking for coSEP(W₁, W₂).

Then, the members of the sequence

{v^{[i]}}_{i = 0}^{\infty}

can be computed by two minimization problems: find

x \in [0.01, \frac{1}{4} - 0.01]

and

y \in [0.01, \frac{3}{4} - 0.01]

that minimize:

x \log \frac{x}{v_{1}^{[i]}} + (\frac{1}{4} - x) \log \frac{\frac{1}{4} - x}{v_{2}^{[i]}} + y \log \frac{y}{v_{3}^{[i]}} + (\frac{3}{4} - y) \log \frac{\frac{3}{4} - y}{v_{4}^{[i]}}

and another couple

\bar{x} \in [0.01, \frac{1}{4} - 0.01]

and

\bar{y} \in [0.01, \frac{3}{4} - 0.01]

that minimize:

\bar{x} \log \frac{\bar{x}}{v_{1}^{[i]}} + \bar{y} \log \frac{\bar{y}}{v_{2}^{[i]}} + (\frac{1 - \bar{x}}{4}) \log \frac{\frac{1}{4} - \bar{x}}{v_{3}^{[i]}} + (\frac{3}{4} - \bar{y}) \log \frac{\frac{3}{4} - \bar{y}}{v_{4}^{[i]}} .

Then,

v_{1}^{[i + 1]} = \frac{x + \bar{x}}{2}, v_{2}^{[i + 1]} = \frac{\frac{1}{4} - x + \bar{y}}{2}, v_{3}^{[i + 1]} = \frac{\frac{1}{4} - \bar{x} + y}{2}

and

v_{4}^{[i + 1]} = \frac{\frac{3}{2} - \bar{y} - y}{2}

After setting

v^{[0]} = (\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})

it turns out that in each iteration,

\bar{x} = x

and

\bar{y} = y

.

After performing the numerical computation for the first one hundred iterations, we obtain:

{v^{[100]}} \approx (0.0488395, 0.2011605, 0.2011605, 0.5488394) .

The rate of convergence for the first coordinate of probability functions is depicted in Figure 14 by the bottom red line.

However, since W₁ and W₂ are jointly consistent, we have that:

Θ_{(\frac{1}{2}, \frac{1}{2})}^{D_{f}} (W_{1}, W_{2}) = W_{1} \cap W_{2} = {(x, \frac{1}{4} - x, \frac{1}{4} - x, \frac{1}{2} + x), x \in [0.01, \frac{0.96}{4}]} .

We compute that CM^∞(W₁ ∩ W₂) (the conjugated KL-projection of the uniform probability function) is approximately:

(0.091506, 0.15849, 0.15849, 0.5915),

which is obviously not equal to the limit of the sequence

{v^{[i]}}_{i = 0}^{\infty}

.

It seems that the only viable way to use Equation (18) to estimate a result of the conjugated D_f-projection into

Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

is to choose a sufficiently small λ, and for this λ, iterate the sequence

{v_{^{[λ]}}^{^{[i]}}}_{i = 0}^{\infty}

. However, the rate of convergence heavily depends on λ, and in fact, this often materializes in a negative way for a practical computation [10]:

Example 7. Consider the situation from Example 6. We compute numerically the first coordinate of initial members of the sequence

{v_{^{[λ]}}^{^{[i]}}}_{i = 0}^{\infty}

for several values of λ, and we compare them with the first coordinate of the sequence

{v^{[i]}}_{i = 0}^{\infty}

. The algorithm we use is as follows. Note that due to the design of the sets, only one minimization problem is sufficient to solve in each iteration, as we have pointed out in the previous example.

v_{1} : = \frac{1}{4}; v_{2} : = \frac{1}{4}; v_{3} : = \frac{1}{4}; v_{4} : = \frac{1}{4};

for i from 1 by 1 to 200 do

Minimize

\begin{matrix} (x \log \frac{x}{v_{1}} + (\frac{1}{4} - x) \log \frac{\frac{1}{4} - x}{v_{2}} + y \log \frac{y}{v_{3}} + (\frac{3}{4} - y) \log \frac{\frac{3}{4} - y}{v_{4}}, x = 0.01.. \frac{0.96}{4}, y = 0.001.. \frac{2.96}{4}); \\ v_{1} : = \frac{1}{4} \cdot λ + x \cdot (\frac{1}{2} - \frac{1}{2} λ) + x \cdot (\frac{1}{2} - \frac{1}{2} λ); v_{2} : = \frac{1}{4} \cdot λ + (\frac{1}{4} - x) \cdot (\frac{1}{2} - \frac{1}{2} λ) + y \cdot (\frac{1}{2} - \frac{1}{2} λ); v_{3} : = \\ \frac{1}{4} \cdot λ + \frac{1}{4} - x) \cdot (\frac{1}{2} - \frac{1}{2} λ) + y \cdot (\frac{1}{2} - \frac{1}{2} λ); v_{4} : = \frac{1}{4} \cdot λ + (\frac{3}{4} - y) \cdot (\frac{1}{2} - \frac{1}{2} λ) + (\frac{3}{4} - y) \cdot (\frac{1}{2} - \frac{1}{2} λ); \end{matrix}

end do;

The numerical result for

λ = \frac{1}{21}, \frac{1}{41}, \frac{1}{61}

is plotted in Figure 14. We can see that as λ decreases, the limit points of sequences are converging to the first coordinate of CM^∞(W₁ ∩ W₂), which is denoted by the black dotted line. The red line denotes the first coordinate of the sequence

{v^{[i]}}_{i = 0}^{\infty}

.

The numerical result for

λ = \frac{1}{61}, \frac{1}{121}, \frac{1}{181}

is plotted in Figure 15. We can conclude that, although the eventual precision rises as λ decreases, the rate of convergence is affected severely. Therefore, there is a significant trade-off between the precision and the number of iterations.

Notice that, as λ decreases, the blue lines point-wise converge to the red line. This convergence is, however, obviously not uniform.

Now, consider the prior method, which follows a fairly similar computation idea. By Theorem 17, we know that the sequence:

{u^{[i]}}_{i = 0}^{\infty}

where u^[0] = t is arbitrary in ⅅ^J and

u^{[i + 1]} = {\hat{F}}_{[W_{1}, \dots, W_{n}]}^{^{D_{f}, A}} (u^{[i]})

, converges to some probability function in

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})

. This procedure can be, for instance, immediately used to compute SEP(W₁,…, W_n) in a case when

{\hat{Θ}}_{(\frac{1}{n}, \dots, \frac{1}{n})}^{KL} (W_{1}, \dots, W_{n})

is a singleton. By Theorem 20, this happens when at least one of W₁,…, W_n is a singleton.

One may perhaps expect that if u^[0] is the uniform probability function, then {lim_i→∞u^[ⁱ^]} = SEP(W₁,…, W_n). In the following example from [10], we will, however, see that this is not true in general. Note that we cannot use Example 6, since in that case, actually, {lim_i→∞u^[ⁱ^]} = SEP(W₁,…, W_n).

Example 8. Let J = 8,

W_{1} = {(x \frac{1}{12} - x \frac{1}{12} - x, \frac{2}{6} + x, y, \frac{1}{6} - y, \frac{1}{6}, \frac{1}{6}), x \in [0.01, \frac{0.88}{12}], y \in [0.01, \frac{0.94}{6}]}

and:

W_{2} = {(x, \frac{1}{12} - x, \frac{1}{12} - x, \frac{2}{6} + x, \frac{1}{12}, \frac{1}{12}, y \frac{2}{6} - y), x \in [0.01, \frac{0.88}{12}], y \in [0.01, \frac{1.94}{6}]} .

W₁ and W₂ have a nonempty intersection;

W_{1} \cap W_{2} = {(x, \frac{1}{12} - x, \frac{1}{12} - x, \frac{2}{6} + x, \frac{1}{12}, \frac{1}{12}, \frac{1}{6}, \frac{1}{6}), x \in [0.01, \frac{0.88}{12}]}

, and we can compute that SEP(W₁, W₂) is the most entropic probability function from the set above with x equal to approximately 0.013888.

However, the sequence

{u^{[i]}}_{i = 0}^{\infty}

is already constant after one iteration and equals CM^∞(W₁) = CM^∞(W₂) = CM^∞(W₁ ∩ W₂), in which case, x ≈ 0.029231.

By the aid of the chairman theorem for

{\hat{Θ}}_{a}^{D_{f}}

, we also suggest a way to approximate the D_f-projection of v into

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})

, but we have no claims to any theoretical results on the complexity of computation. Let I = {v}. For every 1 > λ > 0, we define the sequence

{u_{[λ]}^{[i]}}_{i = 0}^{\infty}

by

u_{[λ]}^{[0]} = t

, which is arbitrary, and:

u_{[λ]}^{[i + 1]} = {\hat{F}}_{[I, W_{1}, \dots, W_{n}]}^{D_{f}, (λ, a_{1} - λ a_{1}, \dots, a_{n} - λ a_{n})} (u_{[λ]}^{[i]}) .

By Theorem 17:

{\lim_{i \to \infty} u_{[m]}^{[i]}} = {\hat{Θ}}_{(λ, a_{1} - λ a_{1}, \dots, a_{n} - λ a_{n})}^{D_{f}} (I, W_{1}, \dots, W_{n}) .

By the chairman theorem for

{\hat{Θ}}_{A}^{D_{f}}

:

\begin{matrix} \lim_{λ ↘ 0} \lim_{i \to \infty} u_{[λ]}^{[i]} = \arg & min_{w \in {\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})} D_{f} (w ‖ v) \end{matrix}

(19)

i.e., equals the D_f-projection of the probability function v into

{\hat{Θ}}_{a}^{D_{f}} (W_{1}, \dots, W_{n})

.

In particular, to approximate SEP(W₁,…, W_n) using Equation (19), one needs to choose a sufficiently small λ and then iterate the sequence

{u_{[λ]}^{[i]}}_{i = 0}^{\infty}

, where

u_{[λ]}^{[0]} = v

is the uniform probability function,

A = N

and D_f = KL. However, the question of how to determine such an λ and i in order to achieve a specific level of accuracy merits further investigation.

The special case of the problem above when W₁,…, W_n have a nonempty intersection was extensively studied in the literature, and many scientific and engineering problems can be expressed as a problem of finding a point in such an intersection. Bregman in [7] showed the convergence of (what is now called) cyclic Bregman projections to a point in the intersection (the notion of a Bregman divergence is used only for the Euclidean space, but in [7], a more general topological vector space was considered). Many cyclic algorithms with appealing applications have been developed since then; see, e.g., [31,32].

Although the approach we propose offers the option of an empty intersection, it always leads to a meaningful point, and in particular, if the intersection is nonempty, it chooses a point inside the intersection; our study cannot be considered as an extension of the classical method of cyclic projections, which was developed over (possibly infinite) Banach spaces [33] in contrast to a limited discrete probabilistic space, which we are considering.

It is also worth mentioning that the method of cyclic projections, even in the case of an empty intersection, often provides more useful results than our method. An example is the noise reduction algorithm from [34].

One can perhaps conclude that the approach offered in this paper is at its best only another contribution to the problem of finding a point in a convex set by means of geometry, which, however, offers some interesting insights into the combination of Bregman projections with pooling operators.

Acknowledgments

The author is indebted to George Wilmers, whose support and wisdom allowed the creation of this paper. Thanks goes also to Alena Vencovská and František Matúš for sharing their ideas with me and to an anonymous reviewer for pointing out the connections to the dual affine structure in the probabilistic simplex.

The paper is an extension of some results that the author obtained as a Ph.D. student at the University of Manchester while supported by the (European Community’s) Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 238381.

Last, but not least, the author is grateful for the support received from the Assumption University in Thailand, without which the paper could not be finished.

Conflicts of Interest

The author declares no conflict of interest other than disclosed above in acknowledgments.

References

Amari, S. Divergence, Optimization and Geometry. In Neural Information Processing: 16th International Conference; Leung, C., Lee, M., Chan, J.H., Eds.; Iconip: Bangkok, Thailand, 2009; pp. 185–193. [Google Scholar]
Hájek, P.; Havránek, T.; Jiroušek, J. Uncertain Information Processing in Expert Systems; Raton, B., Arbor, A., Eds.; CRC Press: London, UK, 1992. [Google Scholar]
Collins, M.; Schapire, R.E. Logistic Regression, AdaBoost and Bregman Distances. Mach. Learn. 2002, 48, 253–285. [Google Scholar]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Adamčík, M.; Wilmers, G.M. 2015, in press.
De Finetti, B. Sul Significato Soggettivo della Probabilitá. Fund. Math. 1931, 17, 298–329. [Google Scholar]
Bregman, L.M. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; pp. 1–716. [Google Scholar]
Rockafeller, R.T. Convex Analysis. Princeton Landmarks in Mathematics; Princeton University Press: Princeton, NJ, USA, 1997; pp. 1–469. [Google Scholar]
Adamčík, M. Collective Reasoning under Uncertainty and Inconsistency. Ph.D. Thesis, The University of Manchester, Manchester, UK, 2014; pp. 1–150. [Google Scholar]
Csiszár, I. I-Divergence Geometry of Probability Distribution and Minimization Problems. Ann. Probab. 1975, 3, 146–158. [Google Scholar]
Amari, S.; Nagaoka, H. Methods of Information Geometry; AMS and Oxford University Press: New York, NY, USA, 2000; pp. 1–206. [Google Scholar]
Jaynes, E.T. Where do we Stand on Maximum Entropy? In The Maximum Entropy Formalism; Levine, R.D., Tribus, M., Eds.; M.I.T. Press: Cambridge, MA, USA, 1979; pp. 15–118. [Google Scholar]
Paris, J.B.; Vencovská, A. On the Applicability of Maximum Entropy to Inexact Reasoning. Int. J. Approx. Reason. 1989, 3, 1–34. [Google Scholar]
Paris, J.B.; Vencovská, A. A Note on the Inevitability of Maximum Entropy. Int. J. Approx. Reason. 1990, 4, 183–224. [Google Scholar]
Vomlel, J. Methods of Probabilistic Knowledge Integration. Ph.D. Thesis, Czech Technical University, Prague, Czech, 1999; pp. 1–123. [Google Scholar]
Banerjee, A.; Guo, X.; Wang, H. On the optimality of conditional expectation as a Bregman predictor. IEEE Trans. Inf. Theory 2005, 51, 2664–2669. [Google Scholar]
Wilmers, G.M. The Social Entropy Process: Axiomatising the Aggregation of Probabilistic Beliefs. In Probability, Uncertainty and Rationality; Hosni, H., Montagna, F., Eds.; CRM Series; Pisa, Italy, 2010; pp. 87–104. [Google Scholar]
Genest, C.; Zidek, J.V. Combining probability distributions: A critique and an annotated bibliography. Stat. Sci. 1986, 1, 114–135. [Google Scholar]
Genest, C.; Wagner, C.G. Further Evidence Against Independence Preservation in Expert Judgement Synthesis. Aequ. Math. 1986, 32, 74–86. [Google Scholar]
Matúš, F. On iterated averages of I-projections. In Statistiek und Informatik; Universität Bielefeld: Bielefeld, Germany, 2007; pp. 1–12. [Google Scholar]
Predd, J.B.; Osherson, D.N.; Kulkarni, S.R.; Poor, H.V. Aggregating Probabilistic Forecasts from Incoherent and Abstaining Experts. Decis. Anal. 2008, 5, 177–189. [Google Scholar]
Kern-Isberner, G.; Rödder, W. Belief Revision and Information Fusion on Optimum Entropy. Int. J. Intel. Syst. 2004, 19, 837–857. [Google Scholar]
Williamson, J. Deliberation, Judgement and the Nature of Evidence. Econ. Philos. 2014, in press. [Google Scholar]
Carnap, R. On the application of inductive logic. Philos. Phenomenol. Res. 1947, 8, 133–148. [Google Scholar]
Paris, J.B. The Uncertain Reasoner Companion; Cambridge University Press: Cambridge, UK, 1994; pp. 1–224. [Google Scholar]
Amari, S. Integration of stochastic models by minimizing alpha-divergence. Neural Comput 2007, 19, 2780–2796. [Google Scholar]
Csiszár, I.; Tusnády, G. Informational Geometry and Alternating Minimization Procedures. Stat. Decis. 1984, 1, 205–237. [Google Scholar]
Eggermont, P.P.B.; LaRiccia, V.N. On EM-like algorithms for minimum distance estimation; Preprint 1998; University of Delaware: Delaware, NC, USA; pp. 1–29.
Wilmers, G.M. Generalising the Maximum Entropy Inference Process to the Aggregation of Probabilistic Beliefs; Preprint 2011, Version 6; The University of Manchester: Manchester, UK; pp. 1–40.
Bauschke, H.H. Projection Algorithms and Monotone Operators. Ph.D. Thesis, Simon Fraser University, Burnaby, BC, Canada, 1996; pp. 1–223. [Google Scholar]
Censor, Y.; Zenios, S.A. Parallel Optimization: Theory, Algorithms, and Applications; Oxford University Press: New York, NY, USA, 1997; pp. 1–541. [Google Scholar]
Bauschke, H.H.; Borwein, J.M.; Combettes, P.L. Bregman monotone optimization algorithms. SIAM J. Control Optim. 2003, 42, 596–636. [Google Scholar]
Tofighi, M.; Kose, K.; Cetin, A.E. Denoising Using Projections Onto Convex Sets (POCS) Based Framework 2013, arXiv, 1309.0700.

Figure 1. An illustration of a divergence.

Figure 2. A Bregman divergence.

Figure 3. The extended Pythagorean property.

Figure 4. An illustration of an averaging projective procedure F.

Figure 5. The situation in the proof of Theorem 5.

Figure 6. The situation in the proof of Theorem 7 for n = 2.

Figure 7. The illustration of the four-point property.

Figure 8. The illustration of Example 4.

Figure 9. The situation in the proof of Theorem 10 for n = 2.

Figure 10. The illustration of Theorem 15.

Figure 11. The situation in the proof of Theorem 16.

Figure 12. The situation in the proof of Theorem 16.

Figure 13. The illustration of the chairman theorem for

v \in Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

.* Note that the fact that v_[λ]-s lie on the arrow does not have any meaning.

Figure 13. The illustration of the chairman theorem for

v \in Θ_{a}^{D_{f}} (W_{1}, \dots, W_{n})

.* Note that the fact that v_[λ]-s lie on the arrow does not have any meaning.

Figure 14. The numerical computation for Example 7. Blue lines from the top are for

{v^{[i]}}_{i = 1}^{\infty}

and

{w^{[i]}}_{i = 1}^{\infty}

. This graph is taken from [10].

Figure 14. The numerical computation for Example 7. Blue lines from the top are for

{v^{[i]}}_{i = 1}^{\infty}

and

{w^{[i]}}_{i = 1}^{\infty}

. This graph is taken from [10].

Figure 15. The numerical computation for Example 7. Blue lines from the top are for

\bar{w}

and

D_{f} (v^{[i]} ‖ w^{[i]}) \leq D_{f} (v^{[i]} ‖ \bar{w}) + D_{f} (v^{[i]} ‖ v^{[i]})

. This graph is taken from [10].

Figure 15. The numerical computation for Example 7. Blue lines from the top are for

\bar{w}

and

D_{f} (v^{[i]} ‖ w^{[i]}) \leq D_{f} (v^{[i]} ‖ \bar{w}) + D_{f} (v^{[i]} ‖ v^{[i]})

. This graph is taken from [10].

Table 1. Examples for three saturated possibilities with respect to the consistency principle (CP), disagreement principle (DP) and singleton principle (SP). KIRP, Kern-Isberner and Rödder; OSEP, obdurate social entropy process; SEP, social entropy process; coSEP, conjugated social entropy process.

**Table 1.** Examples for three saturated possibilities with respect to the consistency principle (CP), disagreement principle (DP) and singleton principle (SP). **KIRP**, Kern-Isberner and Rödder; **OSEP**, obdurate social entropy process; **SEP**, social entropy process; **coSEP**, conjugated social entropy process.
Principles	Probabilistic Merging Operators
(DP), (CP)	$Θ_{N}^{D_{f}}$ , ${\hat{Θ}}_{N}^{D_{f}}$
(DP), (SP)	KIRP, OSEP
(CP), (SP)	SEP, coSEP

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Adamčík, M. The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning. Entropy 2014, 16, 6338-6381. https://doi.org/10.3390/e16126338

AMA Style

Adamčík M. The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning. Entropy. 2014; 16(12):6338-6381. https://doi.org/10.3390/e16126338

Chicago/Turabian Style

Adamčík, Martin. 2014. "The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning" Entropy 16, no. 12: 6338-6381. https://doi.org/10.3390/e16126338

APA Style

Adamčík, M. (2014). The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning. Entropy, 16(12), 6338-6381. https://doi.org/10.3390/e16126338

Article Menu

The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning

Abstract

1. Background

1.1. Introduction

1.2. Projections

1.3. Pooling

2. Projections and Pooling Combined

2.1. Averaging Projective Procedures

2.2. Obdurate Operators

2.3. Fixed Points

3. Convergence

3.1. Iterative Processes

3.2. Chairmen Theorems

4. Applications

4.1. Relationship to Inference Processes

4.2. Computability

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI