- freely available
- re-usable

*Entropy*
**2014**,
*16*(12),
6338-6381;
doi:10.3390/e16126338

## Abstract

**:**The aim of this paper is to develop a comprehensive study of the geometry involved in combining Bregman divergences with pooling operators over closed convex sets in a discrete probabilistic space. A particular connection we develop leads to an iterative procedure, which is similar to the alternating projection procedure by Csiszár and Tusnády. Although such iterative procedures are well studied over much more general spaces than the one we consider, only a few authors have investigated combining projections with pooling operators. We aspire to achieve here a comprehensive study of such a combination. Besides, pooling operators combining the opinions of several rational experts allows us to discuss possible applications in multi-expert reasoning.

## 1. Background

#### 1.1. Introduction

Information geometry has been studied as a powerful tool for tackling various problems. It has been applied in neuroscience [1], expert systems [2], logistic regression [3], clustering [4] and probabilistic merging [5]. In this paper, we aim to present a comprehensive study of information geometry over a discrete probabilistic space in order to provide some specialized tools for researchers working in the area of multi-expert reasoning.

In the context of this paper, the domain of information geometry is the Euclidean space ℝ^{J}, for some fixed natural number J ≥ 2, where we measure a divergence from one point to another one. A divergence is, in general asymmetric, a notion of distance, and we will represent it here by an arrow. A divergence can represent a cost function having various constraints, so many engineering problems correspond to the minimization of a divergence.

For example, in the areas of neuroscience and expert systems, given evidence **v** and a training set of known instances W, we may search for an instance **w** ∈ W, which is “closest” to the evidence **v**, so as to represent it in the given training set W. An illustration is depicted in Figure 1.

A similar pattern of minimization appears also in the areas of clustering and regression. The aim of the former is to categorize several points into a given number of nodes in such a way that the sum of divergences from each point to its associated node is minimal. The aim of regression is to predict an unknown distribution of events based on the previously obtained statistical data by defining a function whose values minimize a sum of divergences to the data.

While several domains for divergences are considered in the literature, in the current presentation of information geometry, however, we will confine ourselves to the domain of positive discrete probability functions ⅅ^{J}, where ⅅ^{J} is the set of all **w** ∈ ℝ^{J} restricted by
${\sum}_{j=1}^{J}{w}_{j}=1$ and w_{1} > 0, …, w_{J} > 0. In our presentation, J ≥ 2 will be always fixed, but otherwise arbitrary.

Although in information geometry, it does not make sense to talk about beliefs, applications in multi-expert reasoning are often developed from that perspective. It is then argued that rational beliefs should obey the laws of probability, for example the Dutch book argument by Ramsey and de Finetti [6] is perhaps the most compelling argument. It is therefore of a particular interest to develop information geometry over a probabilistic space if we wish to eventually apply it to multi-expert reasoning.

In addition to our restriction to discrete probability functions, we will confine ourselves to a special type of divergence, called a Bregman divergence [7], which has recently attracted attention in machine learning and plays a major role in optimization; cf. [3]. A Bregman divergence over a discrete probabilistic space is defined by a given strictly convex function f: (0, 1)^{J} → ℝ, which is differentiable over ⅅ^{J}. For any **v**, **w** ∈ ⅅ^{J}, the Bregman divergence generated by the function f is given by:

_{f}(

**w**‖

**v**) is a Bregman divergence from

**v**∈ ⅅ

^{J}to

**w**∈ ⅅ

^{J}. Figure 2 depicts a geometrical interpretation of a Bregman divergence.

By the first convexity condition applied to the (convex and differentiable) function f (see, e.g., [8]), D_{f}(**w**‖**v**) ≥ 0 with equality holding only if **w** = **v**. This is the condition that makes D_{f}(·‖·) a divergence as defined in information geometry. Note that, since a differentiable convex function is necessarily continuously differentiable (see [9]), D_{f}(**w**‖**v**) is a continuous function. However, note that this is not sufficient to establish the differentiability of D_{f}.

It is worth mentioning that the restriction w_{1} > 0,…, w_{J} > 0 for a probability function **w** that we have adopted here is important for the definition of a Bregman divergence. Some Bregman divergences do not have their generating function f differentiable over the whole space of probability functions. However, it is possible to define the notion of a Bregman divergence even if this condition is left out, but at the cost of some restrictions on f. We kindly refer the interested reader to [10] for further details. Nonetheless, the setting developed in [10] uses a rather complicated notation, which could prove to be impenetrable at first glance if it were adopted in the current paper.

In this paper, we study mainly Bregman divergences D_{f}(·‖·), which are convex, i.e., for all λ ∈ [0, 1] and all **w**^{(1)}, **w**^{(2)}, **v**^{(1)}, **v**^{(2)} ∈ ⅅ^{J}:

The following are examples of a convex Bregman divergence.

**Example 1** (Squared Euclidean Distance). For any$J\ge 2\phantom{\rule{0.2em}{0ex}}let\phantom{\rule{0.2em}{0ex}}f(\mathbf{x})={\displaystyle {\sum}_{j=1}^{J}{({x}_{j})}^{2}}$. Then, the divergence:

**Example 2** (Kullback–Leibler Divergence). For any$J\ge 2,\phantom{\rule{0.2em}{0ex}}let\phantom{\rule{0.2em}{0ex}}f(\mathbf{x})={\displaystyle {\sum}_{j=1}^{J}{x}_{j}\mathrm{log}{x}_{j}}$, where log denotes the natural logarithm. (Note that in the information theory literature, this logarithm is often taken with base two. However, this does not affect the results of this paper in any way.) The well-known divergence:

The convexity of the KL-divergence is easy to observe and is well known; see, e.g., [10].

#### 1.2. Projections

For given **v** ∈ ⅅ^{J}, a Bregman divergence D_{f}(**w**‖**v**) is a strictly convex function in the first argument. This can be easily seen by considering
${D}_{f}(\mathbf{w}\Vert \mathbf{v})=f(\mathbf{w})-f(\mathbf{v})-{\displaystyle {\sum}_{j=1}^{J}({w}_{j}-{v}_{j})\frac{\partial f(\mathbf{v})}{\partial {v}_{j}}}$ where **v** is constant. f(**v**) is therefore constant, as well, and the claim follows, since strict convexity of f is not affected by adding the linear term
$-{\displaystyle {\sum}_{j=1}^{J}({w}_{j}-{v}_{j})\frac{\partial f(\mathbf{v})}{\partial {v}_{j}}}$.

Owing to the observation above, if **v** ∈ ⅅ^{J} is given and W ⊆ ⅅ^{J} is a closed convex nonempty set, we can define the D_{f}-projection of **v** into W. It is that unique point **w** ∈ W that minimizes D_{f}(**w**‖**v**) subject only to **w** ∈ W. This property is crucial for the applicability of Bregman divergences. Note, however, that D_{f}(·‖·) is not necessarily convex in its second argument; for a counterexample, consider the case
$f(\mathbf{x})={\displaystyle {\sum}_{j=1}^{4}{({x}_{j})}^{3}}$.

Perhaps the most useful property that a D_{f}-projection has is the extended Pythagorean property:

**Theorem 1** (Extended Pythagorean Property). Let D_{f} be a Bregman divergence. Let **w** be the D_{f}-projection of **v** ∈ ⅅ^{J} into a closed convex nonempty set W ⊆ ⅅ^{J}. Let **a** ∈ W. Then:

This property, in the case of the Kullback–Leibler divergence, was proven first by Csiszár in [11]. The proof of the generalized theorem above is given in [1,12], where the interested reader can find a comprehensive study of Bregman divergences within the context of differential geometry. We illustrate the theorem in Figure 3.

Notice that the squared Euclidean distance has a special role among all other Bregman divergences. It is symmetric, and it interprets the extended Pythagorean property “classically” as the relation of the sizes of the squares constructed on the sides of a triangle.

It is well-known that the Kullback–Leibler divergence is closely connected to the Shannon entropy defined for any **w** ∈ ⅅ^{J} by:

**w**is, the less information is carried by

**w**. In some contexts, one can then argue that given several seemingly equally probable choices of a probability function, one should choose the one that carries the least additional information [13]. Given a closed convex nonempty set W, the most entropic point in W will be denoted by ME(W).

Now, trying to find the most entropic point in a closed convex nonempty set W ⊆ ⅅ^{J} is, in fact, equivalent to finding a special KL-projection (the KL-projection of the uniform probability function
$\underset{J}{\underbrace{\left(\frac{1}{J},\dots ,\frac{1}{J}\right)}}$) since:

_{x∈X}f(x) denotes that unique argument x ∈ X, where f has its global minimum, whenever such a unique point exists. The expression arg max is defined accordingly.

Given the extensive justification of the Shannon entropy in various frameworks (see, e.g., [14,15]), it is perhaps not surprising that a common method of projecting in probabilistic expert systems is by means of the KL-projection; see [2,16]. In connection to the Shannon entropy, the KL-divergence is often referred to as the cross-entropy, and the projecting is called updating.

The above may perhaps be also an appealing reason to use projections in general to “represent” a given closed convex set of probability functions by a single point, in particular in expert reasoning. Moreover, recent use of projections by a Bregman divergence has become popular in other contexts; see, e.g., [4]. Remarkably, projections by a Bregman divergence also provide a unifying framework for a variety of techniques used in expert systems, such as logistic regression; see [3]. It is therefore of particular interest to investigate the geometry of Bregman divergences.

#### 1.3. Pooling

In this subsection, we introduce probabilistic pooling, which is a method of aggregating several probability functions. Formally, a pooling operator Pool is defined for each n ≥ 1 as a mapping:

One possibility for choosing a pooling operator is to define one by means of a Bregman divergence. In particular, given a Bregman divergence
${D}_{f},{\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)}\in {\mathbb{D}}^{J}$ and **a** ∈ ⅅ^{n}, we can ask which point **v** ∈ ⅅ^{J} has the least sum of Bregman divergences D_{f} from **w**^{(1)},…, **w**^{(n)} weighted by a_{1},…, a_{n}, respectively. It turns out that the resulting probability function is unique, and in each coordinate, it is simply the weighted arithmetic mean of the corresponding coordinates of **w**^{(1)},…, **w**^{(n)} ∈ ⅅ^{J}. In other words:

**a**∈ $\mathcal{A}$. Instead of the right-hand side of Equation (1), we will simply write

**LinOp**

_{a}(

**w**

^{(1)},…,

**w**

^{(n)}) if

**a**∈ $\mathcal{A}$. A special choice for $\mathcal{A}$ is the family $\mathcal{N}=\{{\mathbf{a}}_{n}=(\frac{1}{n},\dots ,\frac{1}{n}):n=1,2,\dots \}$, and the pooling operator ${\mathbf{LinOp}}_{\mathcal{N}}$ is well known in the literature as the LinOp-pooling operator.

The fact that Equation (1) actually holds can be observed by employing the following theorem, which is folklore in information theory.

**Theorem 2** (Parallelogram Theorem). Let D_{f} be a Bregman divergence, **w**^{(1)},…, **w**^{(n)}, **v** ∈ ⅅ^{J} and **a** ∈ ⅅ^{n}. Then:

**Proof.** Let **w** = **LinOp**_{a}(**w**^{(1)},…, **w**^{(n)}). The equality is easy to observe by:

Since D_{f}(**w**‖**v**) = 0, only if **w** = **v**, and otherwise, it is positive, the unique minimum of the left-hand side of Equation (1) is at the point **v** = **LinOp**_{a}(**w**^{(1)},…, **w**^{(n)}).

The situation above can be naturally interpreted in terms of random variables. Assume that X is a random variable taking values in {**w**^{(1)},…, **w**^{(n)}} ⊆ ⅅ^{J} with the probability distribution **a** ∈ ⅅ^{n}, and we are given the problem of finding a random variable Y, such that the expected value:

**w**

^{(i)}representing his beliefs, seek to find a single probability function to represent their joint beliefs. The ${\mathbf{LinOp}}_{\mathcal{A}}$-pooling operator simply yields the expected value as if expert’s beliefs were statistically obtained.

It is certainly interesting that the result above holds for any Bregman divergence, but as is shown in [17], Theorem 4, it is even more remarkable that Bregman divergences are the only divergences with such a property. However, we note that in order to establish this claim, a slightly more general setting was considered and that we have restricted the formulation of the original theorem to the only domain considered here (0, 1)^{J}:

**Theorem 3** (Banerjee, Guo, Wang). Let F: (0, 1)^{J} × (0, 1)^{J} → ℝ be a divergence. Assume that$F(\mathbf{x}\Vert \mathbf{y}),\frac{{\partial}^{2}F(\mathbf{x}\Vert \mathbf{y})}{\partial {x}_{i}\partial {x}_{j}},1\le i,j\le J$ are all continuous. Let (Ω, P,
$\mathcal{F}$) be an arbitrary probability space, and let$\mathcal{G}$ be a sub-σ-algebra of$\mathcal{F}$. For all random variables X taking values in (0, 1)^{J}, if:

_{f}(x‖y) for some strictly convex and differentiable function f : (0, 1)

^{J}→ ℝ.

While in the statistical sense, the
${\mathbf{LinOp}}_{\mathcal{A}}$-pooling operator, where
$\mathcal{A}$ is a family of weighting vectors, seems to be well placed, in the fields of multi-expert reasoning and probabilistic merging, the so-called
${\mathbf{LinOp}}_{\mathcal{A}}$-pooling operator often appeals more. For every n ≥ 1 and every **a** ∈
$\mathcal{A}$, it is defined by:

If **w**^{(1)},…, **w**^{(n)} are considered to be beliefs of n-experts, respectively, then the
${\mathbf{LinOp}}_{\mathcal{A}}$-pooling operator appears to favor agreement over the expected value. For instance, consider the following example from utility theory. Say that Eleanor and George are looking for a film to watch and they have three options, A, B and C. Eleanor hates Movie A and under no circumstances would agree to watch it, while George absolutely loves it. Now, consider that the situation with respect to Film C is swapped: George hates it, while Eleanor would prefer to see it. They both consider Movie B uninteresting, but are willing to see it. The following probability functions could represent the preferences of Eleanor and George towards Movies A, B and C: (0, 0.1, 0.9) and (0.9, 0.1, 0), respectively. Moreover, we value the opinions of both of them equally, i.e.,
$\mathcal{A}=\mathcal{N}$. Now, while the
${\mathbf{LinOp}}_{\mathcal{N}}$-pooling operator gives inconclusive (0.45, 0.1, 0.45) by the
${\mathbf{LinOp}}_{\mathcal{N}}$-pooling operator (in the literature, this operator is simply known as the LogOp-pooling operator), we obtain (0, 1, 0). If we take the advice, then Eleanor and George should see the only film that is acceptable for both of them.

The example above illustrates why taking products rather than the arithmetic mean is popular when considering utilities. However, recently, the
${\mathbf{LinOp}}_{\mathcal{N}}$-pooling operator attracted attention also in multi-expert probabilistic reasoning; a prominent example here is the social entropy process by Wilmers [18]. An intriguing idea that originates in the social entropy process is to swap the direction of the Kullback–Leibler projections and establish the corresponding conjugated KL-projection of **w** ∈ ⅅ^{J} into V ⊆ ⅅ^{J} as arg min_{v}_{∈V} KL(**w**‖**v**) (it is easy to check that KL(·‖·) is strictly convex in its second argument) and the conjugated parallelogram theorem [10]:

**Theorem 4.** Let **w**^{(1)},…, **w**^{(n)}, **v** ∈ ⅅ^{J} and **a** ∈ ⅅ^{n}. Then:

**Proof**. Let **w** = **LogOp _{a}**(

**w**

^{(1)},…,

**w**

^{(n)}). First note that:

As a consequence, for given **w**^{(1)},…, **w**^{(n)} ∈ ⅅ^{J}, we get:

The reader perhaps wonders which are the main practical differences in using different pooling operators. The ${\mathbf{LinOp}}_{\mathcal{A}}$-pooling operator, for example, satisfies the marginalization property, that is the values on the coordinates of the resulting probability function depend only on the corresponding coordinates of the probability functions that are pooled. The ${\mathbf{LinOp}}_{\mathcal{A}}$-pooling operator does not have this property. On the other hand, the ${\mathbf{LinOp}}_{\mathcal{A}}$-pooling operator, unlike the ${\mathbf{LinOp}}_{\mathcal{A}}$-pooling operator, is externally Bayesian. That is the order in which we combine pooling and Bayesian updating is irrelevant. See [19] for more details.

We, however, do not seek any conclusive answer as to which pooling operator to use in any particular context. In this paper, we only aim to provide geometric tools that can be used in multi-expert reasoning. For elaborate work on pooling operators, we refer to the literature, e.g., [19] for a survey, [20] for a classical problem of the relationship between pooling and probabilistic independence or [18] for a modern account on ${\mathbf{LinOp}}_{\mathcal{N}}$ and ${\mathbf{LinOp}}_{\mathcal{N}}$-pooling operators in probabilistic knowledge merging.

## 2. Projections and Pooling Combined

#### 2.1. Averaging Projective Procedures

While the geometry of projections and the theory of pooling operators have been extensively studied in the literature (see the previous section), much less attention, however, was been devoted to the combination of them. A detailed study of this problem and a comprehensive analysis of the geometry involved is the main aim of this paper.

The central geometrical notion connecting projections and pooling in this paper is an averaging projective procedure F, which consists of a family of mappings
${F}_{[{W}_{1},\dots ,{W}_{n}]}:{\mathbb{D}}^{J}\to {\mathbb{D}}^{J}$, where sets W_{1},…, W_{n} ⊆ ⅅ^{J} are closed convex and nonempty. A particular F is given by a family of strictly convex functions d_{v}, **v** ∈ ⅅ^{J} and a pooling operator **Pool** and is defined by the following two-stage process.

For an argument

**v**∈ ⅅ^{J}, ${\text{put}\phantom{\rule{0.2em}{0ex}}\mathbf{w}}^{(i)}=\mathrm{arg}{min}_{\mathbf{w}\in {W}_{i}}{d}_{\mathbf{v}}(\mathbf{w}),1\le i\le n$.Set ${F}_{[{W}_{1},\dots {W}_{n}]}(\mathbf{v})=\mathbf{Pool}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)})$.

For instance, the function d_{v}(·) can be D_{f}(·‖v) for some Bregman divergence D_{f} and in such a particular case
${F}_{[{W}_{1},\dots ,{W}_{n}]}:(\mathbf{v})$ first D_{f}-projects the argument **v** into each of W_{1},…, W_{n}, and then, it “averages” the resulting probability functions by a pooling operator Pool. Hence, the name: an averaging projective procedure. An illustration of F is depicted in Figure 4.

Note that W_{1},…, W_{n} play dual roles in the definition above, which may perhaps appear clumsy. When they are fixed,
${F}_{[{W}_{1},\dots ,{W}_{n}]}$ is a mapping ⅅ^{J} → ⅅ^{J}. However, the option to consider them also as variables will be the key to our following investigation and to the applicability of an averaging projective procedure in multi-expert reasoning, where W_{1},…, W_{n} will represent the respective knowledge of n experts. A straightforward interpretation is that the first stage simplifies sets to single probability functions, which then are being merged to a final social belief function of the college of experts.

With regard to previous research, the cases of d_{v}(·) being KL(·‖v) and KL(**v**‖·) with Pool be taken to the
${\mathbf{LinOp}}_{\mathcal{A}}$-pooling operator and the
${\mathbf{LogOp}}_{\mathcal{A}}$-pooling operator, respectively, were introduced and investigated by Matúš in [21]. The idea of combining the projections by means of the squared Euclidean distance E2 with the
${\mathbf{LinOp}}_{\mathcal{A}}$-pooling operator was first introduced by Predd et al. in [22].

**Example 3.** In the definition of an averaging projective procedure, take d_{v} to be KL(·‖v) and Pool to be the${\mathbf{LinOp}}_{\mathcal{N}}$ -pooling operator. Now, F is the mapping ⅅ^{J} → ⅅ^{J} for every n ≥ 1 and all closed convex nonempty sets W_{1},…, W_{n} ⊆ ⅅ^{J} given by${F}_{[{W}_{1},\dots ,{W}_{n}]}:(\mathbf{v})$ above.

In particular, take$J=3,n=2,{W}_{1}=\{(x,\frac{1}{2}-x,\frac{1}{2}),\frac{1}{10}\le x\le \frac{2}{5}\},{W}_{2}=\{(x,\frac{1}{4},\frac{3}{4}-x),\frac{1}{10}\le x\le \frac{13}{20}\}$and$\mathbf{v}=(\frac{1}{3},\frac{1}{6},\frac{1}{2})$. Then, the KL-projecion of **v** into W_{1} is actually **v** itself, since **v** ∈ W_{1} and the KL-projection of **v** intoW_{2} is$(\frac{3}{10},\frac{1}{4},\frac{9}{20})$. Therefore:

#### 2.2. Obdurate Operators

In this section, we approach averaging projective procedures using the framework of probabilistic knowledge merging as defined in [5]. A probabilistic merging operator:

^{J}, say W

_{1},…, W

_{n}, to a single closed convex nonempty subset of ⅅ

^{J}. In the area of multi-expert reasoning, we can perhaps interpret ∆(W

_{1},…, W

_{n}) as a representation of W

_{1},…, W

_{n}, which themselves individually represent knowledge bases of n experts.

A merging operator O is obdurate if, for every n ≥ 1 and any W_{1},…, W_{n} ⊆ ⅅ^{J}, we have that
$\mathrm{O}({W}_{1},\dots ,{W}_{n})=\{{F}_{[{W}_{1},\dots ,{W}_{n}]}(\mathbf{v})\}$, where **v** is some fixed argument and F is an averaging projective procedure. Note that this operator always produces a singleton. Obdurate processes thus first represent sets as single probability functions, and then, they pool them by a pooling operator.

Although this may sound like a fairly restrictive setting, many existing natural probabilistic merging operators are of this form. The prominent example is the merging operator of Kern-Isberner and Rödder (**KIRP**) [23]. In this particular case, **v** is the uniform probability function, d_{v}(·) is KL(·‖v) and Pool is given by:

**w**

^{(i)}) is the Shannon entropy of

**w**

^{(i)}, which is, in fact, the most entropic point in W

_{i}.

In [23], Kern-Isberner and Rödder argue that W_{1},…, W_{n} ⊆ ⅅ^{J} can by considered as marginal probabilities in a subset U ⊆ ⅅ^{J}^{+}^{n}, such that every probability function **v** ∈ U marginalizes to a ⅅ^{J} -probability function belonging to one and only one set W_{i}. Since then, the point which **KIRP** produces is, in fact, the ⅅ^{J}-marginal of the most entropic point in U, following the justification of the Shannon entropy, they conclude that such a point is a natural interpretation of W_{1},…, W_{n} by a single probability function. **KIRP** thus maps the uniform probability function to the ⅅ^{J}-marginal of the most entropic point in U. To date, **KIRP** has received much attention in the area of probabilistic knowledge merging.

However, any obdurate merging operator seems to be challenged by its violation of the following principle.

**(CP) Consistency Principle.** Let ∆ be a probabilistic merging operator. Then, we say that ∆ satisfies the consistency principle if, for every n ≥ 1 and all W_{1},…, W_{n} ⊆ ⅅ^{J}:

This principle often falls under the following philosophical criticism. One might imagine a situation where several experts consider a large set of probability functions as admissible, while one believes in a single probability function. Although this one is consistent with the beliefs of the rest of the group, one might argue that it is not justified to merge the knowledge of the whole group into that single probability function.

More rigorously, Williamson [24] introduces a particular interpretation of the epistemological status of an expert’s knowledge base, which he calls “granting”. He rejects (CP), as several experts may grant the same piece of knowledge for inconsistent reasons.

On the other hand, Adamčík and Wilmers in [5] assume that the way in which the knowledge was obtained is considered irrelevant, and each expert has incorporated all of his relevant knowledge into what he is declaring, contrary to Williamson’s granting. This is sometimes referred to as the principle of total evidence [25] or the Watts assumption [26]. They argue that, although overall knowledge of any human expert can never be fully formalized, as a formalization is always an abstraction from reality, the principle of total evidence needs to be imposed in order to avoid confusion in any discussion related to methods of representing the collective knowledge of experts. Otherwise, there would be an inexhaustible supply of invalid arguments produced by a philosophical opponent challenging one’s reasoning using implicit background information, which is not included in the formal representation of a knowledge base.

However, in this paper, we do not wish to probe further into this philosophical argument, and instead, we present the following rather surprising theorem, which appeared for the first time in [10].

**Theorem 5.** There is no obdurate merging operator O that satisfies the consistency principle (CP).

**Proof.** Suppose that J ≥ 3. Let d be the function to minimize from the definition of **O**, where, for simplicity, we suppress the constant superscript. Let **v** ∈ ⅅ^{J} be the unique minimizer of d over some sufficiently large closed convex subset W of ⅅ^{J}. Let **w**, **u** ∈ W be such that d(**v**) < d(**w**) < d(**u**) and **w** = λ**v** + (1 − λ)u for some 0 < λ < 1 (in particular, **w** is a linear combination of **v** and **u**).

Let **s** ∈ W be such that d(**v**) < d(**s**) < d(**w**) and **s** is not a linear combination of **v** and **u**. Then, there is **s**′, such that **s**′ = λ**s** + (1 − λ)**w** for some 0 < λ ≤ 1, and d is strictly increasing along the line from **s**′ to **w**. This is because d is strictly convex and d(**s**) < d(**w**). Note that if J = 2, then **s** would be always a linear combination of **v** and u. Moreover, for sufficiently large W ⊆ ⅅ^{3}, we can always choose **w**, **u**, **s** and **s**′ in W as above.

Now, we show that d is also strictly increasing along the line from **s**′ to **u**. Assume this is not the case. Then, by the same argument as before, there is **s**″, such that d(**s**″) < d(**s**′). Due to the construction, the line from **v** to **s**″ intersects the line from **s**′ to **w**; let us denote the point of intersection as **r**. Since d is strictly increasing along the line from **s**′ to **w**, we have that d(**r**) > d(**s**′) > d(**s**″) > d(**v**). This, however, contradicts the convexity of d. The situation is depicted in Figure 5.

Now, assume that W_{1} = {λ**v** + (1 − λ)**w** : λ ∈ [0, 1]}, W_{2} = {λ**s**′ + (1 − λ)**w** : λ ∈ [0, 1]}, V_{1} = {λv + (1 − λ)**u** : λ ∈ [0, 1]} and V_{2} = {λ**s**′ + (1 − λ)**u** : λ ∈ [0, 1]}. Since **v** minimizes d and along the lines from **s**′ to **w** and from **s**′ to **u**, the function d is strictly increasing, we have that:

**Pool**is a pooling operator used in the second stage of

**O**. Suppose that

**O**satisfies (CP). Then,

**O**(W

_{1}, W

_{2}) = {

**w**} and

**O**(V

_{1}, V

_{2}) = {

**u**}, which contradicts Equation (2).

The theorem above in some philosophical contexts can be used as an argument against the consistency principle, while from another perspective, it casts a shadow on the notion of an obdurate merging operator. This unfortunately includes the natural merging operator **OSEP**, or obdurate social entropy process, defined as follows. For every n ≥ 1 and all W_{1},…, W_{n} ⊆ ⅅ^{J}:

_{i}) denotes the most entropic point in W

_{i}or equivalently the KL-projection of the uniform probability function into W

_{i}, and $\mathcal{N}$ is the family of weighting vectors $(\frac{1}{n},\dots ,\frac{1}{n})$, one for every n ≥ 1. It is easy to observe that OSEP is really an obdurate merging operator.

In [10], it is proven that OSEP is (thus far, the only known) probabilistic merging operator satisfying a particular version of the independence principle, a principle that is an attempt to resurrect the notion of the independence preservation of pooling operators [20] in the context of probabilistic merging operators.

One may say that the reason behind an obdurate merging operator not satisfying (CP) is its “forgetting” nature. In the first stage, it transforms sets W_{1},…, W_{n} into **w**^{(1)},…, **w**^{(n)} individually without taking into account other sets, thus “forgetting” any existing connections, such as the consistency. However, instead of changing the definition of an averaging projective procedure so as to make it not “forgetting”, we will take a different viewpoint on the procedure itself in the following subsection.

#### 2.3. Fixed Points

Our second approach to an averaging projective procedure F consists of considering the set of the fixed points of F. That is, for given n ≥ 1 and given closed convex nonempty sets W_{1},…, W_{n} ⊆ ⅅ^{J}, we are interested in whether there are any points **v** ∈ ⅅ^{J}, such that:

Following the convincing justification for combining Bregman projections with the
${\mathbf{LinOp}}_{\mathcal{A}}$-pooling operator (see Section 1.3), for every convex Bregman divergence D_{f} and a family of weighting vectors
$\mathcal{A}$, we consider here the averaging projective procedure
${F}^{{D}_{f},\mathcal{A}}$ defined for every n ≥ 1 and all closed convex nonempty sets W_{1},…, W_{n} ⊆ ⅅ^{J} by the following.

For an argument

**v**∈ ⅅ^{J}, take**w**^{(i)}the D_{f}-projection of**v**into W_{i}for all 1 ≤ i ≤ n.Set ${F}_{[{W}_{1},\dots ,{W}_{n}]}^{{D}_{f},\mathcal{A}}(\mathbf{v})={\mathbf{LinOp}}_{\mathbf{a}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)})$, where

**a**∈ $\mathcal{A}$.

The restriction to convex Bregman divergences is needed for some later theorems and is adopted ad hoc. Therefore, unfortunately, we cannot provide any elaborate justification for it.

Given closed convex nonempty sets W_{1},…, W_{n} ⊆ ⅅ^{J}, we will denote the set of all fixed points of
${F}^{{D}_{f},\mathcal{A}}$ defined above by
${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots {W}_{n})$, where **a** ∈
$\mathcal{A}$.

On the other hand, the conjugated parallelogram theorem (Theorem 4), suggesting the combination of the conjugated KL-projection with the LogOp-pooling operator, leads us to the consideration of those convex Bregman divergences, which are strictly convex also in the second argument. The squared Euclidean distance and the Kullback–Leibler divergence are instances of such divergences. A fairly general example is a Bregman divergence D_{f}, such that
$f(\mathbf{v})={\displaystyle {\sum}_{j=1}^{J}g({v}_{j})}$, where g is a strictly convex function (0, 1) → ℝ, which is three times differentiable, and g″(v_{j}) − (w_{j} − v_{j})g‴(v_{j}) > 0 for all 1 ≤ j ≤ J and all **w**, **v** ∈ ⅅ^{J} (this is easy to check by the Hessian matrix). Apart from the two divergences mentioned above, this condition is satisfied in particular if g(v) = v^{r}, 2 ≥ r > 1. Note that the Bregman divergence generated by such a function g is also convex in both arguments.

Assuming strict convexity in the second argument of D_{f}, we can define the conjugated D_{f}-projection of **v** ∈ ⅅ^{J} into a closed convex nonempty set W ⊆ ⅅ^{J} as that unique **w** ∈ W that minimizes ⅅ_{f}(**v**‖w) subject only to **w** ∈ W. Moreover, since a sum of strictly convex functions is a strictly convex function, for any **w**^{(1)},…, **w**^{(n)} ∈ ⅅ^{J}, there exists a unique minimizer of:

**Theorem 6** (Conjugated Parallelogram Theorem). Let D_{f} be Bregman divergence,
${\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)},\mathbf{v}\in $ⅅ^{J} and **a** ∈ ⅅ^{n}. Then:

**Proof.** Let
$\mathbf{w}={\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)})$. We need to prove that:

The idea of defining a spectrum of pooling operators where the pooling operators LinOp and LogOp are special cases was developed previously in a similar manner, but in a slightly different framework of alpha-divergences; cf. [27].

Here, following [1,12], we will point out a geometrical relationship between pooling operators LinOp and Pool^{Df}, which will be helpful in illustrating some results of this paper.

Recall that the generator of a Bregman divergence D_{f} is a strictly convex function f : (0, 1)^{J} → ℝ, which is differentiable over ⅅ^{J}. Let **w** ∈ ⅅ^{J}. We define w^{∗} = ∇f(**w**). Since f is a strictly convex function, the mapping **w** → ∇ f (**w**) is injective; thus, the coordinates of w^{∗} form a coordinate system. There are two kinds of affine structures in ⅅ^{J}. D_{f}(**w**‖**v**) is convex in **w** with respect to the first structure and is convex in v^{∗} with respect to the second structure.

Therefore, the proof above, in fact, gives $[\mathbf{v}]*=[{\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)})]*={\mathbf{LinOp}}_{\mathbf{a}}([{\mathbf{w}}^{(1)}]*,\dots ,[{\mathbf{w}}^{(n)}]*)+\mathrm{c}$, where $\mathrm{c}=(\underset{J\text{-times}}{\underbrace{\lambda ,\dots \lambda}})$ is a normalizing vector induced by ${\sum}_{j=1}^{J}{v}_{j}=1$.

The only other type of averaging projective procedure
${\widehat{F}}^{{D}_{f},\mathcal{A}}$ that we consider here will be generated by a convex differentiable Bregman divergence D_{f}, which is strictly convex in its second argument, and a family of weight A and is defined for every n ≥ 1 and all closed convex nonempty sets W_{1},…, W_{n} ⊆ ⅅ^{J} by the following.

For an argument

**v**∈ ⅅ^{J}, take**w**^{(i)}the conjugated D_{f}-projection of**v**into W_{i}for all 1 ≤ i ≤ n.Set ${\widehat{F}}_{[{\mathrm{W}}_{1},\dots ,{\mathrm{W}}_{n}]}^{{D}_{f},\mathcal{A}}(\mathbf{v})={\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)})$, where

**a**∈ $\mathcal{A}$.

Given closed convex nonempty sets W_{1},…, W_{n} ⊆ ⅅ^{J}, we will denote the set of all fixed points of F^Df, A defined above by
${\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots {W}_{n})$, where **a** ∈
$\mathcal{A}$.

Note that we always require an additional assumption of D_{f} being differentiable for this type of averaging projective procedure. This assumption is essential to the proofs of some results concerning this procedure. We note that both divergences KL and E2 are differentiable.

Given a family of weighting vectors
$\mathcal{A}$, our aim is to investigate
${\mathrm{\Theta}}_{A}^{{D}_{f}}=\{{\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}:\mathbf{a}\in A\}$ and
${\widehat{\mathrm{\Theta}}}_{A}^{{D}_{f}}=\{{\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}:\mathbf{a}\in A\}$ as operators acting on
$\mathcal{P}({\mathbb{D}}^{J})\times \dots \times \mathcal{P}({\mathbb{D}}^{J})$. In particular, we ask the following questions. Given any closed convex nonempty sets W_{1},…, W_{n} ⊆ ⅅ^{J} and **a** ∈
$\mathcal{A}$:

Are ${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}=({W}_{1},\dots ,{W}_{n})$ and ${\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}=({W}_{1},\dots ,{W}_{n})$ always nonempty?

Are these sets always closed and convex?

If both answers are positive, then we can consider ${\mathrm{\Theta}}_{\mathcal{A}}^{{D}_{f}}$ and ${\widehat{\mathrm{\Theta}}}_{\mathcal{A}}^{{D}_{f}}$ as probabilistic merging operators. In such a case, the following question makes sense.

As probabilistic merging operators, do they satisfy the consistency principle (CP)?

The fact that the answer to all three questions is “yes” is perhaps surprising, given that the much simpler obdurate merging operators do not satisfy (CP). We prove the above results in the following sequence of theorems, which conclude Section 2.

The following well-known lemma is a simple, but useful observation.

**Lemma 1.** Let D_{f} be a Bregman divergence and **a**, **v**, **w** ∈ ⅅ^{J}. Then:

**Theorem 7.** Let D_{f} be a convex Bregman divergence, W_{1},…, W_{n} ⊆ ⅅ^{J} be closed convex nonempty sets and **a** ∈ ⅅ^{n}. Let **v**, **w** ∈ ⅅ^{J}, **u**^{(1)} ∈ W_{1},…, **u**^{(n)} ∈ W_{n} and **w**^{(1)} ∈ W_{1},…, **w**^{(n)} ∈ W_{n} be such that **v** = **LinOp**_{a} (**u**^{(1)},…, **u**^{(n)}), **w** = **LinOp**_{a}(**w**^{(1)},…, **w**^{(n)}) and **u**^{(i)} are the D_{f}-projection of **v** into W_{i}, 1 ≤ i ≤ n. Then:

**Proof.** First of all, by the extended Pythagorean property, we have that:

Since we assume that D_{f}(·‖·) is a convex function in both arguments by the Jensen inequality:

Figure 6 depicts the situation in the proof above for n = 2. Arrows indicate corresponding divergences.

An interesting question related to conjugated Bregman projections arises as to whether a similar property to the Pythagorean property holds. It turns out that the corresponding property is the so-called four-point property, from to Csiszár and Tusnády. The following theorem in the case of the KL-divergence is a specific instance of a result in [28], Lemma 3, but the formulation using the term “conjugated KL-projection” first appeared in [21]. An illustration is depicted in Figure 7.

**Theorem 8** (Four-Point Property). Let D_{f} be a convex differentiable Bregman divergence, which is strictly convex in its second argument. Let V be a convex closed nonempty subset of ⅅ^{J}, and let **v**, **u**, **w**, **s** ∈ ⅅ^{J} be such that **v** is the conjugated D_{f}-projection of **w** into V and **u** ∈ V is arbitrary. Then:

**Proof.** By Lemma 1, we have that:

Since D_{f}(·‖·) is a convex differentiable function, by applying the first convexity condition twice, we have that:

Expressions (6) and (7) give that:

**v**is the conjugated D

_{f}-projection of

**w**into V, the gradient of D

_{f}(

**w**‖·) at (

**w**,

**v**) in the direction to (

**w**,

**u**) must be greater than or equal to zero:

The following result appeared for the first time in [10], but without considering the weighting.

**Theorem 9** (Characterization Theorem for
${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}$). Let D_{f} be a convex Bregman divergence, **a** ∈ ⅅ^{n} and W_{1},…, W_{n} ⊆ ⅅ^{J} be closed convex nonempty sets. Then:

**v**∈ ⅅ

^{J}, which globally minimize${\sum}_{i=1}^{n}{a}_{i}{D}_{f}({\mathbf{w}}^{(i)}\Vert \mathbf{v})$, subject only to

**w**

^{(1)}∈ W

_{1},…,w

^{(n)}∈ W

_{n}.

**Proof.** It is easy to see that, given closed convex nonempty sets W_{1},…, W_{n} ⊆ ⅅ^{J}, we have that those **w**^{(1)} ∈ W_{1},…, **w**^{(n)} ∈ W_{n}, which together with **v** ∈ ⅅ^{J}, globally minimize:

_{f}-projections of

**v**into W

_{1},…, W

_{n}respectively. This, together with Equation (1) (the equation preceding Theorem 2), gives:

Now, assume that $\mathbf{v}\in {\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots {W}_{n})$ and

_{f}-projections of

**v**into W

_{1},…, W

_{n}by

**w**

^{(1)}…,

**w**

^{(n)}, respectively. Accordingly, let us denote the D

_{f}-projections of

**u**into W

_{1},…, W

_{n}by r

^{(1)}…, r

^{(n)}, respectively. Suppose that ${\sum}_{i=1}^{n}{a}_{i}{D}_{f}({\mathbf{w}}^{(i)}\Vert \mathbf{v})}>{\displaystyle {\sum}_{i=1}^{n}{a}_{i}{D}_{f}({\mathrm{r}}^{(i)}\Vert \mathrm{u})$, i.e.,

Let us now deviate for a while from the goals of this subsection and stress the importance of the restriction to the positive discrete probability functions, which was detailed in Section 1.1. The problem with the KL-divergence is that the function
$f(\mathbf{x})={\displaystyle {\sum}_{j=1}^{J}{x}_{j}\mathrm{log}{x}_{j}}$ is not differentiable if some x_{j} = 0. Without the adopted restriction, the KL-divergence is therefore usually defined by:

_{j}= 0 implies w

_{j}= 0 for all 1 ≤ j ≤ J, we say that

**v**dominates

**w**and write $\mathbf{v}\gg \mathbf{w}$.

The first problem we would face with this definition is whether the notion of the KL-projection makes sense. For given **v** ∈ ⅅ^{J} and closed convex nonempty set W ⊆ ⅅ^{J}, the KL-projection of **v** into W makes sense only if there is at least one **w** ∈ W, such that
$\mathbf{v}\gg \mathbf{w}$.

However, even if adding this condition to all of the discussion concerning the KL-projection above (this is perfectly possible, as seen in [10]), Theorem 9 still could not hold, as the following example demonstrates.

**Example 4.** Let${W}_{1}=\{\lambda (0,0,\frac{1}{6},\frac{5}{6})+(1-\lambda )(0,\frac{1}{3},\frac{1}{3},\frac{1}{3}):\lambda \in [0,1]\}$ and${W}_{2}=\{\lambda (0,0,\frac{1}{3},\frac{2}{3})+(1-\lambda )(0,\frac{1}{3},\frac{1}{3},\frac{1}{3}):\lambda \in [0,1]\}$. Assume that$\mathbf{a}=(\frac{1}{2},\frac{1}{2})$ It is easy to check that$(0,0,\frac{1}{4},\frac{3}{4})$ and$(0,\frac{1}{3},\frac{1}{3},\frac{1}{3})$ are both fixed points, but the former does not belong to the set of global minimizers **v** of$\mathrm{KL}(\mathbf{w}\Vert \mathbf{v})+\mathrm{KL}({\mathbf{w}}^{(2)}\Vert \mathbf{v})$ subject to **w**^{(1)} ∈ W_{1} and **w**^{(2)} ∈ W_{2}. An illustration is depicted in Figure 8.

Moreover, some variant of the above example would show that the set ${\mathrm{\Theta}}_{\mathbf{a}}^{\mathrm{KL}}({W}_{1},{W}_{2})$ is not convex, which would wreck our aims; more details are given in [10].

On the other hand, neither of those Bregman divergences, which generate functions, are differentiable over the whole space of discrete probability functions (e.g., the squared Euclidean distance) and would encounter the difficulties of the KL-divergence. In particular, Theorem 9 formulated over the whole space of discrete probability functions (as opposed to only the positive ones) would still hold for such Bregman divergences.

Now, we shall go back and prove a theorem similar to Theorem 9 for the ${\widehat{\mathrm{\Theta}}}_{\mathcal{A}}^{{D}_{f}}$-operator. In order to do that, we will need the following analogue of Theorem 7.

**Theorem 10.** Let D_{f} be a convex differentiable Bregman divergence, which is strictly convex in its second argument, and let W_{1},…, W_{n} ⊆ ⅅ^{J} be closed convex nonempty sets and **a** ∈ ⅅ^{n}. Let **v**, **w** ∈ ⅅ^{J} and **u**^{(1)} ∈ W_{1},…, **u**^{(n)} ∈ W_{n} and **w**^{(1)} ∈ W_{1},…, **w** ^{n}^{)} ∈ Wn be such that$\mathbf{v}={\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}({\mathrm{u}}^{(1)},\dots ,{\mathrm{u}}^{(n)})$,
$\mathbf{w}={\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)})$ and **u**^{(i)} are the conjugated D_{f}-projection of **v** into W_{i}, 1 ≤ i ≤ n. Then:

**Proof.** By Theorem 6, we have that:

_{f}to employ the four-point property) (Theorem 8) becomes:

The theorem above is fairly similar to Theorem 7. Let us use the dual affine structure in ⅅ^{J} defined after the proof of Theorem 6 to analyze this more closely. For W ⊂ ⅅ^{J}, define W ^{∗} = {**w**^{∗}; **w** ∈ W} and define the dual divergence
${D}_{f}^{*}$ to the divergence D_{f} by
${D}_{f}^{*}(\mathbf{v}*\Vert \mathbf{w}*)={D}_{f}(\mathbf{w}\Vert \mathbf{v})$. Since, by Theorem 6, we have that
$[\mathbf{v}*]={[\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)})]*={\mathbf{LinOp}}_{\mathbf{a}}([{\mathbf{w}}^{(1)}]*,\dots ,[{\mathbf{w}}^{(n)}]*)+{\mathrm{c}}_{\mathbf{v}}$, where
${\mathrm{c}}_{\mathbf{v}}=(\underset{J\text{-times}}{\underbrace{\lambda ,\dots ,\lambda}})$ is a normalizing vector induced by
${\sum}_{j=1}^{J}{v}_{j}=1$, the theorem above can be rewritten as follows.

Let D_{f} be a convex differentiable Bregman divergence, which is strictly convex in its second argument, and let W_{1},…, W_{n}⊆ ⅅ^{J} be closed convex nonempty sets and **a** ∈ ⅅ^{n}. Let **v**, **w** ∈ ⅅ^{J}, **u**^{(1)}∈ W_{1},…, u^{n}^{)} ∈ W_{n} and **w**^{(1)} ∈ W_{1}.,…, **w**^{(}^{n)}∈ W_{n} be such that v^{∗} = LinOp ([u^{(1)}]^{∗},…, [u^{(n)}]*) + c_{v}, w^{∗} = **LinOp**_{a}([w^{(1)}]^{∗} …, [w^{(n)}]^{∗}) + c_{w} and [u^{(i)}]^{∗} are the
${D}_{f}^{*}$-projection of v^{∗} into
${W}_{i}^{*}$, 1 ≤ i ≤ n. Then:

This illustrates that if D_{f} is a convex differentiable Bregman divergence that is strictly convex in its second argument, then Theorems 7 and 10 are dual with respect to ^{∗}.

**Theorem 11** (Characterization Theorem for
${\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}$. Let D_{f} be a convex differentiable Bregman divergence, which is strictly convex in its second argument, and let W_{1},…, W_{n} ⊆ ⅅ^{J} be closed convex nonempty sets and **a** ∈ ⅅ^{n}. Then:

**Proof.** The proof is similar to the proof of Theorem 9. First, given closed convex nonempty sets W_{1},…, W ⊆ ⅅ^{J}, we have that those **w**^{(1)} ∈ W_{1},…, **w**^{(n)} ∈ W_{n}, which together with **v** ∈ ⅅ^{J}, globally minimize:

_{f}-projections of

**v**into W

_{1},…, W

_{n}, respectively. This together with the definition of ${\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}$ gives:

Second, assume that $\mathbf{v}\in {\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$ and:

_{f}-projections of

**v**into W

_{1},…, W

_{n}by

**w**

^{(1)}, …,

**w**

^{(n)}, respectively. Accordingly, let us denote the conjugated D

_{f}-projections of

**u**into W

_{1},…, W

_{n}by r

^{(1)},…, r

^{(n)}, respectively. Suppose that:

The following simple observation originally from [10] based on Equation (1) (alternatively on the parallelogram theorem) will be used in the proof of the forthcoming theorem.

**Lemma 2.** Let D_{f} be a convex Bregman divergence and **a** ∈ ⅅ^{n}. Then, the following are equivalent:

The probability functions

**v**,**w**^{(1)},…,**w**^{(n)}∈ ⅅ^{J}minimize the quantity:$$\sum _{i=1}^{n}{a}_{i}{D}_{f}(\mathbf{w}\Vert {\mathbf{v}}^{(i)})$$**w**^{(1)}∈ W_{1},…,**w**^{(n)}∈ W_{n}.The probability functions

**w**^{(1)},…,**w**^{(n)}∈ ⅅ^{J}minimize the quantity:$$\sum _{i=1}^{n}{a}_{i}{D}_{f}({\mathbf{w}}^{(i)}\Vert {\mathbf{LinOp}}_{\mathbf{a}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)})$$**w**^{(1)}∈ W_{1},…,**w**^{(n)}∈ W_{n}and**v**=**LinOp**_{a}(**w**^{(1)},…,**w**^{(n)}).

**Theorem 12.** Let D_{f} be a convex Bregman divergence. Then, for all nonempty closed convex sets W_{1},…, W_{n} ⊆ ⅅ^{J} and **a** ∈ ⅅ^{n}, the set$\left\{\mathrm{arg}{min}_{\mathbf{v}\in {\mathbb{D}}^{J}}{\displaystyle {\sum}_{i=1}^{n}{D}_{f}({\mathbf{w}}^{(i)}\Vert \mathbf{v}):{\mathbf{w}}^{(i)}\in {W}_{i},1\le i\le n}\right\}$ is a nonempty closed convex region of ⅅ^{J}.

**Proof.** This proof is from [10]. Let **v**, **s** ∈
$\left\{\mathrm{arg}{min}_{\mathbf{v}\in {\mathbb{D}}^{J}}{\displaystyle {\sum}_{i=1}^{n}{D}_{f}({\mathbf{w}}^{(i)}\Vert \mathbf{v}):{\mathbf{w}}^{(i)}\in {W}_{i},1\le i\le n}\right\}$, as the set is clearly nonempty. For convexity, we need to show that λv + (1 − λ)s ∈
$\left\{\mathrm{arg}{min}_{\mathbf{v}\in {\mathbb{D}}^{J}}{\displaystyle {\sum}_{i=1}^{n}{D}_{f}({\mathbf{w}}^{(i)}\Vert \mathbf{v}):{\mathbf{w}}^{(i)}\in {W}_{i},1\le i\le n}\right\}$ for any λ ∈ [0, 1].

Assume that **w**^{(1)} ∈ W_{1},…, **w**^{(n)} ∈ W_{n} are such that **v** = **LinOp**_{a}(**w**^{(1)},…, **w**^{(n)}) and **u**^{(1)} ∈ W_{1},…, **u**^{(n)} ∈ W_{n} are such that **s** = **LinOp**_{a}(**u**^{(1)},…, **u**^{(n)}). It is easy to observe that the convexity of D_{f}(·k·) implies convexity of: D_{f}( ‖ ) implies convexity of:

^{(i)}∈ W

_{i}, 1 ≤ i ≤ n. Moreover, the function g attains its minimum over this convex region at points (

**w**

^{(1)},…,

**w**

^{(n)}) and (

**u**

^{(1)},…,

**u**

^{(n)}). We need to show that g also attains its minimum at the point:

**w**

^{(1)},…,

**w**

^{(n)}) = g(

**u**

^{(1)},…,

**u**

^{(n)}), the inequality above can only hold with equality, and therefore, by Lemma 2,

Moreover, since convexity implies continuity, the minimization of a convex function over a closed convex region produces a closed convex set. Therefore, the fact that W_{1},…, W_{n} are all closed and convex implies that the set of n-tuples (**w**^{(1)},…, **w**^{(n)}), which are global minimizers of g over the region specified by **w**^{(i)} ∈ W_{i}, 1 ≤ i ≤ n, is closed. Additionally, since closed regions are preserved by projections in the Euclidean space, the set given by **LinOp**_{a}(**w**^{(1)},…, **w**^{(n)}) is closed, as well.

The following observation immediately follows by the definition of ${\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}$.

**Lemma 3.** Let D_{f} be a convex Bregman divergence and **a** ∈ ⅅ^{n}. Then, the following are equivalent:

The probability functions

**v**,**w**^{(1)},…,**w**^{(n)}∈ ⅅ^{J}minimize the quantity:$$\sum _{i=1}^{n}{a}_{i}{D}_{f}(\mathbf{v}\Vert {\mathbf{w}}^{(i)})$$**w**^{(1)}∈ W_{1},…,**w**^{(n)}∈ W_{n}.The probability functions

**w**^{(1)},…,**w**^{(n)}∈ ⅅ^{J}minimize the quantity:$$\sum _{i=1}^{n}{a}_{i}{D}_{f}({\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)})\Vert {\mathbf{w}}^{(i)})$$**w**^{(1)}∈ W_{1},…,**w**^{(n)}∈ W_{n}.and$\mathbf{v}={\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)}).$

**Theorem 13.** Let D_{f} be a convex Bregman divergence. Then, for all nonempty closed convex sets W_{1},…, W_{n} ⊆ ⅅ^{J} and **a** ∈ ⅅ^{n}, the set$\left\{\mathrm{arg}{min}_{\mathbf{v}\in {\mathbb{D}}^{J}}{\displaystyle {\sum}_{i=1}^{n}{a}_{i}{D}_{f}(\mathbf{v}\Vert {\mathbf{w}}^{(i)}):{\mathbf{w}}^{(i)}\in {W}_{i},1\le i\le n}\right\}$ a nonempty closed convex region of ⅅ^{J}.

**Proof.** Let **v**, **s** ∈
$\left\{\mathrm{arg}{min}_{\mathbf{v}\in {\mathbb{D}}^{J}}{\displaystyle {\sum}_{i=1}^{n}{a}_{i}{D}_{f}(\mathbf{v}\Vert {\mathbf{w}}^{(i)}):{\mathbf{w}}^{(i)}\in {W}_{i},1\le i\le n}\right\}$, as the set is clearly nonempty. For convexity, we need to show that λv + (1 − λ)s ∈
$\left\{\mathrm{arg}{min}_{\mathbf{v}\in {\mathbb{D}}^{J}}{\displaystyle {\sum}_{i=1}^{n}{a}_{i}{D}_{f}(\mathbf{v}\Vert {\mathbf{w}}^{(i)}):{\mathbf{w}}^{(i)}\in {W}_{i},1\le i\le n}\right\}$ for any λ ∈ [0, 1].

Assume that **w**^{(1)} ∈ W_{1},…, **w**^{(n)} ∈ W_{n} are such that
$\mathbf{v}={\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)})$ and **u**^{(1)} ∈ W_{1},…, **u**^{(}^{n)} ∈ W_{n} are such that
$\mathbf{s}={\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}({\mathrm{u}}^{(1)},\dots ,{\mathrm{u}}^{(n)})$. Now, for any λ ∈ [0, 1],

_{f}(·‖·) and the second by the definition of ${\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}$ as the unique minimizer. However, the inequality above can only hold with equality and, by Lemma 3,

Moreover, since convexity implies continuity, the minimization of a convex function over a closed convex region produces a closed convex set. Therefore, the fact that W_{1},…, W_{n} are all closed and convex implies that the set of n-tuples (**w**^{(1)},…, **w**^{(n)}), which are global minimizers of
${\sum}_{i=1}^{n}{a}_{i}{D}_{f}({\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)})\Vert {\mathbf{w}}^{(i)})$ over the region specified by **w**^{(}^{i} ∈ W_{i}, 1 ≤ i ≤ n, is closed. Additionally, since closed regions are preserved by projections in the Euclidean space, the set given by
${\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)})$ is closed, as well. □

Finally, we can establish our initial claims:

**Theorem 14.** Let$\mathcal{A}$ be a family of weighting vectors. The operator${\mathrm{\Theta}}_{\mathcal{A}}^{{D}_{f}}$, where D_{f} is a convex Bregman divergence, and the operator${\widehat{\mathrm{\Theta}}}_{\mathcal{A}}^{{D}_{f}}$, where D_{f} is a convex differentiable Bregman divergence, which is strictly convex in its second argument, are well defined probabilistic merging operators that satisfy (CP).

**Proof.** First, the fact that
${\mathrm{\Theta}}_{\mathcal{A}}^{{D}_{f}}$ is well defined as a probabilistic merging operator follows Theorems 9 and 12. Accordingly,
${\widehat{\mathrm{\Theta}}}_{\mathcal{A}}^{{D}_{f}}$ is a well-defined probabilistic merging operator by Theorems 11 and 13.

Second, let **a** ∈
$\mathcal{A}$ (in particular **a** ∈ ⅅ^{n}) and W_{1},…, W_{n} ⊆ ⅅ^{J} be closed, convex, nonempty and have a nonempty intersection. Clearly, every point in that intersection minimizes
${\sum}_{i=1}^{n}{a}_{i}{D}_{f}({\mathbf{w}}^{(i)}\Vert \mathbf{v})$ and
${\sum}_{i=1}^{n}{a}_{i}{D}_{f}(\mathbf{v}\Vert {\mathbf{w}}^{(i)})$ subject to **w**^{(1)} ∈ W_{1},…, **w**^{(n)} ∈ W_{n} with both expressionsPattaining the zero value. Since D_{f}(**w**‖**v**) = 0 only if **w** = **v**, those points in the intersection are the only points minimizing the above quantities. □

It turns out that, given closed convex nonempty sets W_{1},…, W_{n} ⊆ ⅅ^{J} and weighting a, the sets of fixed points
${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$ and
${\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$ posses attractive properties, which make the operators
${\mathrm{\Theta}}_{\mathcal{A}}^{{D}_{f}}$ and
${\widehat{\mathrm{\Theta}}}_{\mathcal{A}}^{{D}_{f}}$ suitable for probabilistic merging. The following example taken from [10] illustrates a possible philosophical justification for considering the set of all fixed points of a mapping consisting of a convex Bregman projection and a pooling operator.

**Example 5.** Assume that there are n experts, each with his own knowledge represented by closed convex nonempty sets W_{1},…, W_{n} ⊆ ⅅ^{J}, respectively. Say that an independent chairman of the college has announced a probability function **v** to represent the agreement of the college of experts. Each expert then naturally updates his own knowledge by what seems to be the right probability function. In other words, the expert “i” projects **v** to W_{i}, obtaining the probability function **w**^{(i)}. Each expert subsequently accepts **w**^{(i)} as his working hypothesis, but he does not discard his knowledge base W_{i}; he only takes into account other people’s opinions. Then, it is easy for the chairman to identify the average of the actual beliefs **w**^{(1)},…, **w**^{(n)} of the experts. If he found that this average v′ did not coincide with the originally announced probability function **v**, then he would naturally feel unhappy about such a choice, so he would be tempted to iterate the process in the hope that, eventually, he will find a fixed point.

It seems that, in a broad philosophical setting, such as in the example above, we ought to study any possible combination of Bregman projections with pooling operators. The question as to which other combination produces a well-defined probabilistic merging operator satisfying the consistency principle (CP) is open to investigation.

## 3. Convergence

#### 3.1. Iterative Processes

In this section, we continue the investigation of the averaging projective procedures
${F}^{{D}_{f},\mathcal{A}}$ and
${\widehat{F}}^{{D}_{f},\mathcal{A}}$. Recall that, given a convex Bregman divergence D_{f} and a family of weighting vectors
$\mathcal{A}$,
${F}^{{D}_{f},\mathcal{A}}$, was defined in the previous section for every n ≥ 1 and all closed convex nonempty sets W_{1},…, W_{n} ⊆ ⅅ^{J} by the following.

For an argument

**v**∈ ⅅ^{J}, take**w**^{(i)}as the D_{f}-projection of**v**into W_{i}for all 1 i ≤ n.Set ${F}_{[{W}_{1},\dots ,{W}_{n}]}^{{D}_{f},\mathcal{A}}(\mathbf{v})={\mathbf{LinOp}}_{\mathbf{a}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)})$, where

**a**∈ $\mathcal{A}$.

For D_{f}, which is moreover differentiable and strictly convex in the second argument,
${\widehat{F}}^{{D}_{f},\mathcal{A}}$ was defined analogously by conjugated projections and the
${\mathbf{Pool}}_{\mathcal{A}}^{{D}_{f}}$-pooling operator.

Our current aim is to find out what will happen if we iterate the application of averaging projective procedures ${F}^{{D}_{f},\mathcal{A}}$ and ${\widehat{F}}^{{D}_{f},\mathcal{A}}$. In particular:

Will the resulting sequences converge?

We shall find the answer in this subsection.

It is intriguing that we can abstractly define a “conjugated projection” with respect to a summation of a convex differentiable Bregman divergence D_{f}. Let **w**^{(1)},…, **w**^{(n)} ⅅ^{J} and **a** ∈ ⅅ^{n}. Then, the “conjugated projection” of (**w**^{(1)},…, **w**^{(n)})into ⅅ^{J} is defined by the global minimizer of
${\sum}_{i=1}^{n}{a}_{i}{D}_{f}({\mathbf{w}}^{(i)}\Vert \mathbf{v})$, which, by Equation (1), is **v** = **LinOp**_{a}(**w**^{(1)},…, **w**^{(n)}).

The claim that this behaves as a “conjugated projection” is supported by the following analogue of the four-point property illustrated in Figure 10.

**Theorem 15.** Let D_{f} be a convex differentiable Bregman divergence. Let **a** ∈ ⅅ^{n}, **w**^{(1)},…, **w**^{(n)} ∈ ⅅ^{J} and **v** = LinOp (**w**^{(1)},…, **w**^{(n)}). Let **u**^{(1)},…, **u**^{(n)} ∈ ⅅ^{J} and **u** ∈ ⅅ^{J}. Then:

**Proof.** The proof is similar to the one of the actual four-point property (Theorem 8) only with a slightly different argument at the end: after obtaining:

Similarly, given **w**^{(1)},…, **w**^{(n)} ∈ ⅅ^{J}, **a** ∈ ⅅ^{n} and a convex differentiable Bregman divergence D_{f}, which is strictly convex in its second argument, we can consider
${\mathbf{Pool}}_{\mathbf{a}}^{{D}_{f}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)})$ the “projection” of (**w**^{(1)},…, **w**^{(n)}) into ⅅ^{J}, since Theorem 6 resembles (a special case of) the extended Pythagorean property: for any **u** ∈ ⅅ^{J}:

The two observations above and the following lemma will be essential to the proofs of the two main theorems of this subsection.

**Lemma 4.** Let D_{f} be a convex Bregman divergence. Assume that we are given a closed convex nonempty set W, **v**^{[}^{i}^{]} ∈ ⅅ^{L}, i = 1, 2,… and **w**^{[}^{i}^{]} ∈ ⅅ^{J}, i = 1, 2,…, such that **w**^{[}^{i}^{]} is the D -projection of **v**^{[}^{i}^{]} into W for all i = 1, 2,…. Assume that${\left\{{\mathbf{v}}^{[i]}\right\}}_{i=1}^{\infty}$ converges to **v** ∈ ⅅ^{J} and${\left\{{\mathbf{w}}^{[i]}\right\}}_{i=1}^{\infty}$ converges to **w** ∈ ⅅ^{J}. Then, **w** is the D_{f}-projection of **v** into W.

**Proof.** For a contradiction, assume that the D_{f}-projection of **v** into W denoted by
$\overline{\mathbf{w}}$ is distinct from w. Then, by the extended Pythagorean property,
${D}_{f}({\mathbf{w}}^{[i]}\Vert {\mathbf{v}}^{[i]})+{D}_{f}({\overline{\mathbf{w}}||\mathbf{w}}^{[i]})\le {D}_{f}({\overline{\mathbf{w}}||\mathbf{v}}^{[i]})$. Since D_{f}(·‖·) is continuous (see Section 1.1), we have that:

_{f}-projection of

**v**into W. □

Finally, we are going to answer the question about whether the iteration of the averaging projective procedures
${F}^{{D}_{f},\mathcal{A}}$ and
${\widehat{F}}^{{D}_{f},\mathcal{A}}$ converges; however, the result for
${F}^{{D}_{f},\mathcal{A}}$ will be limited only to the case when D_{f} is differentiable. Both results below should be attributed to a number of people. First, the results are applications of well-known alternative projections due to Csiszár and Tusnády; see [28], Theorem 3. In a particular case of the Kullback–Leibler divergence, the theorems were observed and proven by Matúš in [21]. Last, but not least, Eggermont and LaRiccia reformulated original alternative projections in terms of Bregman divergences in [29].

**Theorem 16.** Let D_{f} be a convex differentiable Bregman divergence, A be a family of weighting vectors and **a** ∈
$\mathcal{A}$ be such that **a** ∈ ⅅ^{n} and W_{1},…, W_{n} ⊆ ⅅ^{J} are closed, convex and nonempty. Then, for any **v** ∈ ⅅ^{J}, the sequence:

**v**

^{[0]}=

**v**and${v}^{[i+1]}={F}_{[{W}_{1},\dots ,{W}_{n}]}^{{D}_{f}}({\mathbf{v}}^{[i]})$ converge to some probability function in${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$. (Recall that${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$ is the set of the fixed points of${F}_{[{W}_{1},\dots ,{W}_{n}]}^{{D}_{f},\mathcal{A}}$, i.e., all points

**v**, such that${F}_{[{W}_{1},\dots ,{W}_{n}]}^{{D}_{f},\mathcal{A}}(\mathbf{v})=\mathbf{v}$.)

**Proof.** This proof is inspired by [21].

Denote the D_{f}-projections of **v**^{[}^{i}^{]} into W_{1},…, W by π_{1} **v**^{[}^{i}^{]},…, π_{n}v^{[}^{i}^{]}, respectively. Then, it is easy to observe that:

_{1},…, W

_{n}, the sequence ${\{({\pi}_{1}{\mathbf{v}}^{[i]},\dots ,{\pi}_{n}{\mathbf{v}}^{[i]},{\mathbf{v}}^{[i]})\}}_{i=1}^{\infty}$ has a convergent subsequence. Let us denote the limit of this subsequence (π

_{1}

**v**,…, π

_{n}

**v**,

**v**). Due to Lemma 4, π

_{k}v is really the D

_{f}-projection of

**v**into W

_{k}for all 1 ≤ k ≤ n. Moreover

By Theorem 15:

Now, since:

However, we already know that a subsequence of
${\{({\pi}_{1}{\mathbf{v}}^{[i]},\dots ,{\pi}_{n}{\mathbf{v}}^{[i]},{\mathbf{v}}^{[i]})\}}_{i=1}^{\infty}$ converges to (π_{1}**v**,…, π_{n}**v**); hence, a subsequence of the sequence
${\left\{{\displaystyle {\sum}_{k=1}^{n}{a}_{k}{D}_{f}({\pi}_{k}\mathbf{v}\Vert {\pi}_{k}{\mathbf{v}}^{[i]})}\right\}}_{i=1}^{\infty}$ decreases to zero, which by Equation (10), forces the whole sequence to converge to zero. Due to the fact that D_{f}(x‖y) = 0, only if x = y and, by the continuity, we get:

It follows that
${\mathrm{lim}}_{i\to \infty}{\mathbf{v}}^{[i]}$ exists and is equal to v. Moreover,
$\mathbf{v}={\mathrm{lim}}_{i\to \infty}{\mathbf{v}}^{[i+1]}=\mathbf{v}={\mathrm{lim}}_{i\to \infty}{\mathbf{LinOp}}_{\mathbf{a}}({\pi}_{1}{\mathbf{v}}^{[i]},\dots ,{\pi}_{n}{\mathbf{v}}^{[i]})={\mathbf{LinOp}}_{\mathbf{a}}({\pi}_{1}\mathbf{v},\dots ,{\pi}_{n}\mathbf{v})$, and therefore, **v** is a fixed point of the mapping
${F}_{[{W}_{1},\dots ,{W}_{n}]}^{{D}_{f},\mathcal{A}}$; hence,
$\mathbf{v}\in {\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$.

The following analogue of Lemma 4 will be needed in the forthcoming theorem.

**Lemma 5.** Let D_{f} be a convex differentiable Bregman divergence, which is strictly convex in its second argument. Assume that we are given a closed convex nonempty set W , **v**^{[}^{i}^{]} ∈ ⅅ^{L}, i = 1, 2,… and **w**^{[}^{i}^{]} ∈ ⅅ^{J}, i = 1, 2,…, such that **w**^{[}^{i}^{]} is the conjugated D -projection of **v**^{[}^{i}^{]} into W for all i = 1, 2,…. Assume that${\left\{{\mathbf{v}}^{[i]}\right\}}_{i=1}^{\infty}$ converges to **v** ∈ ⅅ^{J} and${\left\{{\mathbf{w}}^{[i]}\right\}}_{i=1}^{\infty}$ converges to **w** ∈ ⅅ^{J}. Then, **w** is the conjugated D_{f}-projection of **v** into W.

**Proof.** For a contradiction, assume that the conjugated D_{f}-projection of **v** into W denoted by
$\overline{\mathbf{w}}$ is distinct from w. Then, by the four-point property,
${D}_{f}({\mathbf{v}}^{[i]}\Vert {\mathbf{w}}^{[i]})\le {D}_{f}({\mathbf{v}}^{[i]}\Vert \overline{\mathbf{w}})+{D}_{f}({\mathbf{v}}^{[i]}\Vert {\mathbf{v}}^{[i]})$. Since D_{f}(·‖·) is continuous, we have that:

_{f}-projection of

**v**into W. □

**Theorem 17.** Let D_{f} be a convex differentiable Bregman divergence, which is strictly convex in its second argument,
$\mathcal{A}$ be a family of weighting vectors and **a** ∈
$\mathcal{A}$ be such that **a** ∈ ⅅ^{n} and W_{1},…, W_{n} ⊆ ⅅ^{J} are closed, convex and nonempty. Then, for any **v** ∈ ⅅ^{J}, the sequence:

**v**, such that${\widehat{F}}_{[W{}_{1},\dots ,{W}_{n}]}^{{D}_{f},\mathcal{A}}(\mathbf{v})=\mathbf{v}$).

**Proof.** Denote the conjugated D_{f}-projections of **v**^{[}^{i}^{]} into W_{1},…, W by
${\pi}_{1}{\mathbf{v}}^{[i]},\dots ,{\pi}_{n}{\mathbf{v}}^{[i]}$, respectively. Then, it is easy to observe that:

_{1},…, W

_{n}sequence ${\{({\pi}_{1}{\mathbf{v}}^{[i]},\dots ,{\pi}_{n}{\mathbf{v}}^{[i]},{\mathbf{v}}^{[i]})\}}_{i=1}^{\infty}$ has a convergent subsequence. Let us denote the limit of this subsequence (π

_{1}

**v**,…, π

_{n}

**v**,

**v**). Due to Lemma 5, π

_{k}

**v**is really the conjugated D

_{f}-projection of

**v**into W

_{k}for all 1 ≤ k ≤ n. Moreover:

By the four-point property:

Now, since:

However, we already know that a subsequence of
${\left\{{\mathbf{v}}^{[i]}\right\}}_{i=1}^{\infty}$ converges to v; hence, a subsequence of the sequence
${\{{D}_{f}(\mathbf{v}\Vert {\mathbf{v}}^{[i]})\}}_{i=1}^{\infty}$ decreases to zero, which by Equation (13), forces the whole sequence to converge to zero. Due to the fact that D_{f}(x‖y) = 0 only if x = y and by the continuity, we get:.

_{k}v as a limit, and ${\{{D}_{f}({\mathbf{v}}^{[i]}\Vert {\pi}_{k}{\mathbf{v}}^{[i]})\}}_{i=1}^{\infty}$ is monotonic).

Moreover,
$\mathbf{v}={\mathrm{lim}}_{i\to \infty}{\mathbf{v}}^{[i+1]}={\mathrm{lim}}_{i\to \infty}{\mathbf{Pool}}_{a}^{{D}_{f}}({\pi}_{1}{\mathbf{v}}^{[i]},\dots ,{\pi}_{n}{\mathbf{v}}^{[i]})={\mathbf{Pool}}_{a}^{{D}_{f}}({\pi}_{1}\mathbf{v},\dots ,{\pi}_{n}\mathbf{v})$ since
${\mathbf{Pool}}_{a}^{{D}_{f}}$ is continuous
$({\displaystyle {\sum}_{k=1}^{n}{a}_{k}{D}_{f}(\cdot \Vert}\cdot )$ is continuous and strictly convex in the first argument). Therefore, **v** is a fixed point of the mapping
${\widehat{F}}_{[W{}_{1},\dots ,{W}_{n}]}^{{D}_{f},\mathcal{A}}$, and hence,
$\mathbf{v}\in {\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$. □

The problem of characterizing the limits of Theorems 16 and 17 more precisely remains open. On the other hand, the theorems suggest a way to compute at least some points in
${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$ and
${\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$, although we have not investigated how fast the sequences converge. Moreover, also the question of how effective it is to compute D_{f}-projections and conjugated D_{f}-projections was left unanswered. This latter problem was nevertheless addressed in the literature, at least in the case of the KL-divergence and sets W_{1},…, W_{n} generated by finite collections of marginal probability functions. In such a case, the well-known iterative projective fitting procedure IPFP can be effectively employed [16].

#### 3.2. Chairmen Theorems

In this section, for a convex differentiable Bregman divergence D_{f}, which is strictly convex in its second argument, and a family of weighting vectors
$\mathcal{A}$, we investigate the susceptibility of
${\mathrm{\Theta}}_{\mathcal{A}}^{{D}_{f}}$ and
${\widehat{\mathrm{\Theta}}}_{\mathcal{A}}^{{D}_{f}}$-merging operators to a small bias by an arbitrary probability function in ⅅ^{J}. The study of this problem first occurred in [18], where Wilmers argued that an independent adjudicator, whose only knowledge consists of what is related to him by the given college of experts, can rationally bias the agreement procedure by including himself as an additional expert, whose personal probability function is the uniform one (not arbitrary), in order to calculate a single social probability function and then find what would happen to this social probability function if his contribution happened to be infinitesimally small relative to that of the other experts. He showed that in the case of the
${\widehat{\mathrm{\Theta}}}_{\mathcal{N}}^{\mathrm{KL}}$ -merging operator, this point of agreement is characterized by the most entropic point in the region defined by
${\widehat{\mathrm{\Theta}}}_{\mathcal{N}}^{\mathrm{KL}}$. A similar theorem for the
${\mathrm{\Theta}}_{\mathcal{N}}^{\mathrm{KL}}$-merging operator was proven in [10]. In what follows, we adapt these results to our general situation.

The following theorem will tell us that, in some particular case of W_{1},…, W_{n} ⊆ ⅅ^{J}, we can always tell that the set
${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$ is a singleton.

**Theorem 18.** Let W_{1},…, W_{n} ⊆ ⅅ^{J} be closed, convex, nonempty and such that, for at least one i W_{i} is a singleton. Let D_{f} be a convex Bregman divergence, which is strictly convex in its second argument and a ⊆ ⅅ^{n}. Then,
${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$ is a singleton.

**Proof.** Without loss of generality, assume that W_{1} = {**v**}. For a contradiction, suppose that **w**, **r** ∈
${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$ and **w** ≠ r. Denote **w**^{(2)},…, **w**^{(n)} the D_{f}-projections of **w** into W_{2},…, W_{n}, respectively, and r^{(2)},…, r^{(n)} the D_{f}-projections of r into W_{2},…, W_{n}, respectively. By definition, **w** = **LinOp**_{a}(**v**, **w**^{(2)},…, **w**^{(n)}) and **r** = ** LinOp_{a}**(

**v**,

**r**

^{(2)},…,

**r**

^{(n)}).

Now, consider **x** = λ**w** + (1 − λ)**r** for some λ ∈ (0, 1). By Theorems 9 and 12, we have that
$\mathbf{x}\phantom{\rule{0.2em}{0ex}}\in {\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$. Since D_{f}(·‖·) is a convex function, by the Jensen inequality, we have that:

^{(i)}+ (1 − λ)r

^{(i)}∈ W

_{i}, 1 ≤ i ≤ n, the above is possible only with the equality.

On the other hand, since D_{f} is strictly convex in its second argument, the following Jensen inequality is strict:

**Theorem 19** (Chairman Theorem for
${\mathrm{\Theta}}_{\mathcal{A}}^{{D}_{f}}$). Let I ⊆ ⅅ^{J} be a singleton consisting of an arbitrary probability function **t** ⊆ ⅅ^{J}. Let W_{1},…, W_{n} ⊆ ⅅ^{J} be closed, convex and nonempty, **a** ∈
$\mathcal{A}$ be such that **a** ∈ ⅅ^{n} and D_{f} be a convex Bregman divergence, which is strictly convex in its second argument. For 1 > λ > 0, define (by the previous theorem, the following set is a singleton):

_{f}-projection of the probability functiont into${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$.

**Proof.** This proof is inspired by [30], where a slightly stronger result is proven for the special case of
${\mathrm{\Theta}}_{\mathcal{N}}^{{D}_{f}}$. We note that Theorem 9 from Section 2.3 is implicitly used in what follows.

First, denote ${\mathrm{M}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$ as the minimal value of:

**w**

^{(i)}∈ W

_{i}, 1 ≤ i ≤ n and

**v**∈ ⅅ

^{J}. Furthermore, we denote E

_{λ}as the minimal value of:

**w**

^{(i)}∈ W

_{i}, 1 ≤ i ≤ n and

**v**∈ ⅅ

^{J}. By the definition of ${\mathrm{M}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$, we have that 0 ≤ E

_{λ}for all 1 > λ > 0.

Note that for a fixed λ, if **v** ∈ ⅅ^{J} globally minimizes Equation (15) subject to **w**^{(i)} ∈ W_{i}, 1 ≤ i ≤ n, then
$\mathbf{v}\phantom{\rule{0.2em}{0ex}}\in {\mathrm{\Theta}}_{(\lambda ,{a}_{1}-\lambda {a}_{1},\dots ,{a}_{n}-\lambda {a}_{n})}^{{D}_{f}}(I,{W}_{1},\dots ,{W}_{n})$ (by Theorem 18, such a **v** is unique), and conversely,
$\mathbf{v}\phantom{\rule{0.2em}{0ex}}\in {\mathrm{\Theta}}_{(\lambda ,{a}_{1}-\lambda {a}_{1},\dots ,{a}_{n}-\lambda {a}_{n})}^{{D}_{f}}(I,{W}_{1},\dots ,{W}_{n})$, then **v** minimizes Equation (15), subject to the above constraints.

Now, $\mathrm{r}=\mathrm{arg}{min}_{\mathbf{v}\in {\mathrm{\Theta}}_{\mathbf{a}}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n}){D}_{f}(\mathbf{t}\Vert \mathbf{v})$. Since $\mathrm{r}\phantom{\rule{0.2em}{0ex}}\in {\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$, it follows that for all 1 > λ > 0, we have that:

Since ⅅ^{J} ⊆ ℝ^{J} is a compact space, there exists a sequence
${\left\{{\lambda}_{m}\right\}}_{m=1}^{\infty},0<{\lambda}_{m}<1,{\mathrm{lim}}_{m\to \infty}{\lambda}_{m}=0$, such that
${\left\{{\mathbf{v}}^{[{\lambda}_{m}]}\right\}}_{m=1}^{\infty}$ converges. Let **w**^{(}^{i}^{)[}^{λm}^{]} be the D_{f}-projection of **v** ^{λm}^{]} into W_{i} for all 1 ≤ i ≤ n and m = 1, 2,…. By Equation (16), the sequence:

Note that we already know that
${\mathrm{lim}}_{m\to \infty}{\mathbf{v}}^{[{\lambda}_{m}]}$ exists, and we denote it by v. However, we do not know whether the same is true for
${\mathrm{lim}}_{m\to \infty}{\mathbf{w}}^{(i)}{}^{[{\lambda}_{m}]}$, 1 ≤ i ≤ n. On the other hand, since W_{1},…, W_{n} are compact, the considered sequences have convergent subsequences. Let us denote the corresponding limits **w**^{(1)},…, **w**^{(n)}. Since D_{f}(·‖·) is a continuous function in both variables, the value of
${\sum}_{i=1}^{n}{a}_{i}{D}_{f}({\mathbf{w}}^{(i)}\Vert \mathbf{v})$ must be equal to
${\mathrm{M}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$. However, this means that we have found a global minimizer (**w**^{(1)},…, **w**^{(n)}, v) of
${\sum}_{i=1}^{n}{a}_{i}{D}_{f}({\mathbf{w}}^{(i)}\Vert \mathbf{v})$ subject to **w**^{(i)} ∈ W_{i}, 1 ≤ i ≤ n, and **v** ∈ ⅅ^{J}.

It follows that $\mathbf{v}={\mathrm{lim}}_{m\to \infty}{\mathbf{v}}^{[\lambda m]}\in {\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$. By Equation (16):

In fact, we have proven that for every sequence
${\left\{{\lambda}_{m}\right\}}_{m=1}^{\infty}$, such that lim_{m→∞} λ_{m} = 0 and
${\left\{{\mathbf{v}}^{[{\lambda}_{m}]}\right\}}_{m=1}^{\infty}$ is convergent,
${\left\{{\mathbf{v}}^{[{\lambda}_{m}]}\right\}}_{m=1}^{\infty}$ must converge to r. Therefore, assume that there is a sequence
${\left\{{\lambda}_{m}\right\}}_{m=1}^{\infty}$, such that lim_{m→∞} λ_{m} = 0, but {v ^{λm}^{]}}^{∞}m_{=1} is not convergent. Then, there is an open neighborhood of the point r outside of which there are an infinite number of the members of the sequence
${\left\{{\mathbf{v}}^{[{\lambda}_{m}]}\right\}}_{m=1}^{\infty}$ Since ⅅ^{J} is compact, this sequence must have a convergent subsequence with a limit distinct from r. That, however, contradicts our previous claim.

The theorem above is illustrated in Figure 13. Indeed, if
${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$ is a singleton, then the limit in the theorem above is obvious. By Theorem 18, this happens in particular when at least one of W_{1},…, W_{n} is a singleton. However, it is not hard to observe an interesting case; consider that W_{1},…, W_{n} have a nonempty intersection, which is not a singleton. In this case, the limit above is, in fact, the conjugated D_{f}-projection of the probability function **t** into that intersection. Such a conjugated projection depends on t. In particular, we can recover any point in the intersection by setting it to be the point t.

The following analogue of Theorem 18 has a fairly similar proof.

**Theorem 20.** Let W_{1},…, W_{n} ⊆ ⅅ^{J} be closed, convex, nonempty and such that, for at least one i W_{i} is a singleton. Let D_{f} be a convex Bregman divergence, which is strictly convex in its second argument and **a** ∈ ⅅ^{n}. Then,
${\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$ is a singleton.

**Theorem 21** (Chairman Theorem for
${\widehat{\mathrm{\Theta}}}_{\mathcal{A}}^{{D}_{f}}$. Let I ⊆ ⅅ^{J} be a singleton consisting of an arbitrary probability function **t** ∈ ⅅ^{J}. Let W_{1},…, W_{n} ⊆ ⅅ^{J} be closed, convex and nonempty, **a** ∈
$\mathcal{A}$ be such that **a** ∈ ⅅ^{n} and D_{f} be a convex differentiable Bregman divergence, which is strictly convex in its second argument. For 1 > λ > 0, define:

_{f}-projection of the probability function

**t**into${\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$.

The proof is analogous to the one of Theorem 19, so we omit it.

## 4. Applications

#### 4.1. Relationship to Inference Processes

In this subsection, we will discuss some striking relationships between the chairmen theorems and the framework of inference processes [26]. Inference processes are methods of reasoning by which an expert may select a single probability function from a nonempty closed convex set of possible options. In our framework, it is simply a problem of choosing a single probability function in a closed convex nonempty set W ⊆ ⅅ^{J}. This selection is, however, not arbitrary, and it is expected to satisfy some rational principles based on symmetry and consistency, as discussed in [15]. The maximum entropy (ME) inference process, which chooses the most entropic point in a given closed convex nonempty set, is uniquely justified by a list of such principles, as Paris and Vencovská showed [15].

As discussed in Section 1.2, the most entropic point in a closed convex nonempty set W ⊆ ⅅ^{J} coincides with the KL-projection of the uniform probability function into W. This can be immediately applied to the chairman theorem for
${\widehat{\mathrm{\Theta}}}_{\mathcal{A}}^{\mathrm{KL}}$, where
$\mathcal{A}$ is a family of weighting vectors:

Let I ⊆ ⅅ^{J} be a singleton consisting of the uniform probability function **t** ∈ ⅅ^{J}. Let W_{1},…, W_{n} ⊆ ⅅ^{J} be closed, convex and nonempty and **a** ∈
$\mathcal{A}$ be such that **a** ∈ ⅅ^{n}. For 1 > λ > 0, define:

For the family of weighting vectors:

Whether SEP will turn out to be the most appealing probabilistic merging operator or not, by the same manner as above, we can define several probabilistic merging operators related to several other classical inference processes.

For example, the conjugated KL-projection of the uniform probability function into a closed convex nonempty set W ⊆ ⅅ^{J} in fact generates the so-called CM^{∞}-inference process (a limit version of the central mass process [26]). We write simply CM^{∞}(W ) to denote the point of the projection, which is explicitly given by:

_{1},…, W

_{n}⊆ ⅅ

^{J}by:

**a**∈ ⅅ

^{n}and

**a**∈ $\mathcal{N}$. We will call this operator the conjugated social entropy process coSEP.

What is really appealing about the operators **SEP** and **coSEP** is that there are singletons; we simply say that they satisfy the singleton principle (SP). Furthermore, the consistency principle (CP) is obviously satisfied by all of them. However, there is an interesting principle that can never be satisfied by a probabilistic merging operator that satisfies (CP) and is always a singleton: the disagreement principle introduced in [5].

**(DP) Disagreement Principle.** Let Δ be a probabilistic merging operator. Then, we say that Δ satisfies the disagreement principle if, for every n, m ≥ 1 and all W_{1},…, W_{n} ⊆ ⅅ^{J} and V_{1},…, V_{m} ⊆ ⅅ^{J}:

We cite [5] on the desirability of this principle: the principle (informally) says “… that a consistent group who disagrees with another group and then merges with them can be sure that they have influenced the opinions of the combined group.”

**Theorem 22.** There is no probabilistic merging operator that satisfies all (SP), (CP) and (DP).

**Proof.** Let Δ be a probabilistic merging operator. Assume that V ⊊ W ⊆ ⅅ^{J} and that V is a singleton. Suppose that Δ(W) ≠ V = Δ(V). Then, by (CP), Δ(W, V) = V, which contradicts (DP).

**Theorem 23.** The probabilistic merging operators${\mathrm{\Theta}}_{\mathcal{N}}^{{D}_{f}}$ and${\widehat{\mathrm{\Theta}}}_{\mathcal{N}}^{{D}_{f}}$, where D_{f} is a convex Bregman divergence for the prior and is additionally differentiable and strictly convex in its second argument for the latter, satisfy (DP).

**Proof.** We prove the theorem only for
${\widehat{\mathrm{\Theta}}}_{\mathcal{N}}^{{D}_{f}}$. The proof for
${\mathrm{\Theta}}_{\mathcal{N}}^{{D}_{f}}$ is similar.

Let W_{1},…, W_{n}, V_{1},…, V_{m} ⊆ ⅅ^{J} be closed convex and nonempty. For a contradiction, assume that
$\mathbf{v}\in {\widehat{\mathrm{\Theta}}}_{(\frac{1}{n},\dots ,\frac{1}{n})}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$,
$\mathbf{v}\in {\widehat{\mathrm{\Theta}}}_{(\frac{1}{n+m},\dots ,\frac{1}{n+m})}^{{D}_{f}}({W}_{1},\dots ,{W}_{n},{V}_{1},\dots ,{V}_{m})$ and, at the same time,
$\mathbf{v}\notin {\widehat{\mathrm{\Theta}}}_{(\frac{1}{n},\dots ,\frac{1}{n})}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$.

Denote **v**^{(i)} the conjugated D_{f}-projection of **v** into V_{i}, 1 ≤ i ≤ m. Then, there is **u** ∈ ⅅ^{J}, such that
$\mathrm{u}={\mathbf{Pool}}_{(\frac{1}{n},\dots ,\frac{1}{n})}^{{D}_{f}}({V}^{(1)},\dots ,{V}^{(m)})$, i.e.,
${\sum}_{i=1}^{m}\frac{1}{m}{D}_{f}(\mathbf{v}\Vert {\mathbf{v}}^{(i)})}>{\displaystyle {\sum}_{i=1}^{m}\frac{1}{m}{D}_{f}(\mathrm{u}\Vert {\mathbf{v}}^{(i)})$. Since every Bregman divergence is strictly convex in its first argument, we have that:

Now, denote **w**^{(i)} the conjugated D_{f}-projection of **v** into W_{i}, 1 ≤ i ≤ n. Since
$\mathbf{v}={\mathbf{Pool}}_{(\frac{1}{n+m},\dots ,\frac{1}{n+m})}^{{D}_{f}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(n)},{V}^{(1)},\dots ,{V}^{(m)})$ and
$\mathbf{v}={\mathbf{Pool}}_{(\frac{1}{n},\dots ,\frac{1}{n})}^{{D}_{f}}({\mathbf{w}}^{(1)},\dots ,{\mathbf{w}}^{(m)})$ the strict convexity of divergences in their first argument gives also: Bregman

We can conclude that, before deciding which probabilistic merging operator to use, we need to establish which two of the three properties we want the operator to satisfy. In this paper, we have seen instances of all three options, as listed in Table 1.

Recall that **KIRP** is the operator due to Kern-Isberner and Röder and **OSEP** is the obdurate social entropy process; see Section 2.2 for more details. A proof that **KIRP** and **OSEP** satisfy (DP) can be easily obtained as a modification of the proof of Theorem 23, so we omit it.

#### 4.2. Computability

In this subsection, we would like to propose a method corresponding to the classical method of projection, but in the multi-expert context. The possible use could be similar; if the knowledge of a college of experts could be characterized by a closed convex nonempty set of probability functions, then we would like to find such a probability function in that set that is “closest” to a given piece of information represented by another probability function. We only need to specify a way to represent the knowledge of the college by such a single set and pair it with an appropriate method of projection.

Throughout this subsection, assume that we are given closed convex nonempty sets of probability functions W_{1},…, W_{n} ⊆ ⅅ^{J} with weighting **a** ∈
$\mathcal{A}$, where a_{i} is the weight of W_{i} and a probability function **v** ∈ ⅅ^{J} to represent.

If the measure of “being closed” is quantified by a projection by means of a convex differentiable Bregman divergence D_{f}, which is strictly convex in its second argument, our proposed method consists of the following. First, represent W_{1},…, W_{n} by a single, closed and convex set
${\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$, and then, take the D_{f}-projection of **v** into
${\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$.

On the other hand, if the measure of “being closed” is quantified by a conjugated projections by means of a convex differentiable Bregman divergence D_{f}, which is strictly convex in its second argument, we first represent W_{1},…, W_{n} by a single, closed convex set
${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$ and then take the conjugated D_{f}-projection of **v** into
${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$.

The methods have two distinguishing features:

If all of the sets W

_{1},…, W_{n}are singletons, the methods reduce to ${\mathbf{Pool}}_{\mathcal{A}}^{{D}_{f}}$ and ${\mathbf{LinOp}}_{\mathcal{A}}$-pooling operators respectively.If W

_{1},…, W_{n}have a nonempty intersection V, they reduce to D_{f}and conjugated D_{f}-projections into V, respectively.

In this subsection, we shall investigate how effective it is to compute the results of those two methods. Notice that SEP and coSEP, defined in Section 4.1, are specific instances of those procedures, respectively, in which case, we are interested in KL-projections and conjugated KL-projections of the uniform probability function.

There are indeed some serious computational issues. The most essential is the following. A closed convex nonempty set W ⊆ ⅅ^{J} is often given by a set of constraints on ⅅ^{J}. How can we effectively verify that the resulting set W is nonempty? Unfortunately, it is not even possible to find a random Turing machine running in polynomial time that upon input given by a set of constraints on probability functions verifies the consistency of this set of constraints (given that the problems solvable in a randomized polynomial time cannot be solved in a polynomial time); see Theorem 10.7 of [26].

However, some computational problems closely related to projections have been extensively studied in the literature. As we have noted in Section 3.1, this includes procedures for finding a KL-projection to a closed convex set of probability functions. These show that in many particular practical implementations, the problem of intractability does not arise, e.g., as in the case when given closed convex nonempty sets are generated by marginal probability functions and where the IPFP-procedure can be applied to effectively find a KL-projection; see [16]. Therefore, we will assume that some effective procedures for D_{f}-projections and conjugated D_{f}-projections are given.

Under such an assumption, the iterative processes from Section 3.1 and the Chairmen theorems offer a way how to compute (although possibly ineffectively) the results of the two methods above. We shall start with the latter.

By Theorem 16, we know that the sequence:

**v**

^{[0]}=

**t**is arbitrary in ⅅ

^{J}and ${\mathbf{v}}^{[i+1]}={F}_{[{W}_{1},\dots ,{W}_{n}]}^{{D}_{f},\mathcal{A}}({\mathbf{v}}^{[i]})$, converges to some probability function in ${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$. Notice that D

_{f}is required to be differentiable in order to establish this conclusion.

Recall that by Theorem 18,
${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$ is a singleton when at least one of W_{1},…, W_{n} is a singleton. Let I ∈ ⅅ^{J} be such that I = {**v**}. For every 1 > λ > 0, we define the sequence
${\left\{{\mathbf{v}}_{[\lambda ]}^{[i]}\right\}}_{i=0}^{\infty}$ by
$\left\{{\mathbf{v}}_{[\lambda ]}^{[0]}\right\}=\mathbf{t}$ (**t** can be arbitrary) and:

By Theorem 16:

_{f}-projection of the probability function

**v**into ${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$.

Now, notice that if the limits in Equation (18) were interchangeable, then this would offer an answer to the question from Section 3.1 to closely characterize the limit lim_{i→∞}v^{[}^{i}^{]} but with no claims to any theoretical results on the complexity of the computation). Unfortunately, the following simple example introduced in [10] shows that these limits are not interchangeable.

**Example 6.** Let$J=4,{W}_{1}=\{(x,\frac{1}{4}-x,y\frac{3}{4}-y),x\in [0.01,\frac{1}{4}-0.01],y\in [0.01,\frac{3}{4}-0.01]\}$ and${W}_{2}=\{(x,y\frac{1}{4}-x\frac{3}{4}-y),x\in [0.01,\frac{1}{4}-0.01],y\in [0.01,\frac{3}{4}-0.01]\}$. Assume that the weighting is$\mathcal{N}$, D_{f} = KL and the probability function **v** ∈ ⅅ^{4} to interpret is the uniform probability function. In other words, we are looking for coSEP(W_{1}, W_{2}).

Then, the members of the sequence${\left\{{\mathbf{v}}^{[i]}\right\}}_{i=0}^{\infty}$ can be computed by two minimization problems: find$x\in [0.01,\frac{1}{4}-0.01]$ and$y\in [0.01,\frac{3}{4}-0.01]$ that minimize:

After performing the numerical computation for the first one hundred iterations, we obtain:

However, since W_{1} and W_{2} are jointly consistent, we have that:

^{∞}(W

_{1}∩ W

_{2}) (the conjugated KL-projection of the uniform probability function) is approximately:

It seems that the only viable way to use Equation (18) to estimate a result of the conjugated D_{f}-projection into
${\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$ is to choose a sufficiently small λ, and for this λ, iterate the sequence
${\left\{{\mathbf{v}}_{{}^{[\lambda ]}}^{{}^{[i]}}\right\}}_{i=0}^{\infty}$. However, the rate of convergence heavily depends on λ, and in fact, this often materializes in a negative way for a practical computation [10]:

**Example 7.** Consider the situation from Example 6. We compute numerically the first coordinate of initial members of the sequence${\left\{{\mathbf{v}}_{{}^{[\lambda ]}}^{{}^{[i]}}\right\}}_{i=0}^{\infty}$ for several values of λ, and we compare them with the first coordinate of the sequence${\left\{{\mathbf{v}}^{[i]}\right\}}_{i=0}^{\infty}$. The algorithm we use is as follows. Note that due to the design of the sets, only one minimization problem is sufficient to solve in each iteration, as we have pointed out in the previous example.

**for**i

**from**1

**by**1

**to**200

**do**

Minimize$\begin{array}{c}\left(x\mathrm{log}\frac{x}{{v}_{1}}+\left(\frac{1}{4}-x\right)\mathrm{log}\frac{\frac{1}{4}-x}{{v}_{2}}+y\mathrm{log}\frac{y}{{v}_{3}}+\left(\frac{3}{4}-y\right)\mathrm{log}\frac{\frac{3}{4}-y}{{v}_{4}},x=\mathrm{0.01..}\frac{0.96}{4},y=\mathrm{0.001..}\frac{2.96}{4}\right);\\ {v}_{1}:=\frac{1}{4}\cdot \lambda +x\cdot \left(\frac{1}{2}-\frac{1}{2}\lambda \right)+x\cdot \left(\frac{1}{2}-\frac{1}{2}\lambda \right);{v}_{2}:=\frac{1}{4}\cdot \lambda +\left(\frac{1}{4}-x\right)\cdot \left(\frac{1}{2}-\frac{1}{2}\lambda \right)+y\cdot \left(\frac{1}{2}-\frac{1}{2}\lambda \right);v{}_{3}:=\\ \frac{1}{4}\cdot \lambda +\frac{1}{4}-x)\cdot \left(\frac{1}{2}-\frac{1}{2}\lambda \right)+y\cdot \left(\frac{1}{2}-\frac{1}{2}\lambda \right);{v}_{4}:=\frac{1}{4}\cdot \lambda +\left(\frac{3}{4}-y\right)\cdot \left(\frac{1}{2}-\frac{1}{2}\lambda \right)+\left(\frac{3}{4}-y\right)\cdot \left(\frac{1}{2}-\frac{1}{2}\lambda \right);\end{array}$ **end do**;

The numerical result for$\lambda =\frac{1}{21},\frac{1}{41},\frac{1}{61}$ is plotted in Figure 14. We can see that as λ decreases, the limit points of sequences are converging to the first coordinate of CM^{∞}(W_{1} ∩ W_{2}), which is denoted by the black dotted line. The red line denotes the first coordinate of the sequence${\left\{{\mathbf{v}}^{[i]}\right\}}_{i=0}^{\infty}$.

The numerical result for$\lambda =\frac{1}{61},\frac{1}{121},\frac{1}{181}$ is plotted in Figure 15. We can conclude that, although the eventual precision rises as λ decreases, the rate of convergence is affected severely. Therefore, there is a significant trade-off between the precision and the number of iterations.

Notice that, as λ decreases, the blue lines point-wise converge to the red line. This convergence is, however, obviously not uniform.

Now, consider the prior method, which follows a fairly similar computation idea. By Theorem 17, we know that the sequence:

**u**

^{[0]}=

**t**is arbitrary in ⅅ

^{J}and ${\mathrm{u}}^{[i+1]}={\widehat{F}}_{[{W}_{1},\dots ,{W}_{n}]}^{{}^{{D}_{f},\mathcal{A}}}({\mathrm{u}}^{[i]})$, converges to some probability function in ${\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$. This procedure can be, for instance, immediately used to compute SEP(W

_{1},…, W

_{n}) in a case when ${\widehat{\mathrm{\Theta}}}_{(\frac{1}{n},\dots ,\frac{1}{n})}^{\mathrm{KL}}({W}_{1},\dots ,{W}_{n})$ is a singleton. By Theorem 20, this happens when at least one of W

_{1},…, W

_{n}is a singleton.

One may perhaps expect that if **u**^{[0]} is the uniform probability function, then {lim_{i→∞}u^{[}^{i}^{]}} = SEP(W_{1},…, W_{n}). In the following example from [10], we will, however, see that this is not true in general. Note that we cannot use Example 6, since in that case, actually, {lim_{i→∞}u^{[}^{i}^{]}} = SEP(W_{1},…, W_{n}).

**Example 8.** Let J = 8,

_{1}and W

_{2}have a nonempty intersection;${W}_{1}\cap {W}_{2}=\{(x,\frac{1}{12}-x,\frac{1}{12}-x,\frac{2}{6}+x,\frac{1}{12},\frac{1}{12},\frac{1}{6},\frac{1}{6}),x\in [0.01,\frac{0.88}{12}]\}$, and we can compute that SEP(W

_{1}, W

_{2}) is the most entropic probability function from the set above with x equal to approximately 0.013888.

However, the sequence${\left\{{\mathrm{u}}^{[i]}\right\}}_{i=0}^{\infty}$ is already constant after one iteration and equals CM^{∞}(W_{1}) = CM^{∞}(W_{2}) = CM^{∞}(W_{1} ∩ W_{2}), in which case, x ≈ 0.029231.

By the aid of the chairman theorem for
${\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}$, we also suggest a way to approximate the D_{f}-projection of **v** into
${\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$, but we have no claims to any theoretical results on the complexity of computation. Let I = {**v**}. For every 1 > λ > 0, we define the sequence
${\left\{{\mathrm{u}}_{[\lambda ]}^{[i]}\right\}}_{i=0}^{\infty}$ by
${\mathrm{u}}_{[\lambda ]}^{[0]}=\mathbf{t}$, which is arbitrary, and:

By Theorem 17:

_{f}-projection of the probability function

**v**into ${\widehat{\mathrm{\Theta}}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$.

In particular, to approximate SEP(W_{1},…, W_{n}) using Equation (19), one needs to choose a sufficiently small λ and then iterate the sequence
${\left\{{\mathrm{u}}_{[\lambda ]}^{[i]}\right\}}_{i=0}^{\infty}$, where
${\mathrm{u}}_{[\lambda ]}^{[0]}=\mathbf{v}$ is the uniform probability function,
$\mathcal{A}=\mathcal{N}$ and D_{f} = KL. However, the question of how to determine such an λ and i in order to achieve a specific level of accuracy merits further investigation.

The special case of the problem above when W_{1},…, W_{n} have a nonempty intersection was extensively studied in the literature, and many scientific and engineering problems can be expressed as a problem of finding a point in such an intersection. Bregman in [7] showed the convergence of (**w**hat is now called) cyclic Bregman projections to a point in the intersection (the notion of a Bregman divergence is used only for the Euclidean space, but in [7], a more general topological vector space was considered). Many cyclic algorithms with appealing applications have been developed since then; see, e.g., [31,32].

Although the approach we propose offers the option of an empty intersection, it always leads to a meaningful point, and in particular, if the intersection is nonempty, it chooses a point inside the intersection; our study cannot be considered as an extension of the classical method of cyclic projections, which was developed over (possibly infinite) Banach spaces [33] in contrast to a limited discrete probabilistic space, which we are considering.

It is also worth mentioning that the method of cyclic projections, even in the case of an empty intersection, often provides more useful results than our method. An example is the noise reduction algorithm from [34].

One can perhaps conclude that the approach offered in this paper is at its best only another contribution to the problem of finding a point in a convex set by means of geometry, which, however, offers some interesting insights into the combination of Bregman projections with pooling operators.

## Acknowledgments

The author is indebted to George Wilmers, whose support and wisdom allowed the creation of this paper. Thanks goes also to Alena Vencovská and František Matúš for sharing their ideas with me and to an anonymous reviewer for pointing out the connections to the dual affine structure in the probabilistic simplex.

The paper is an extension of some results that the author obtained as a Ph.D. student at the University of Manchester while supported by the (European Community’s) Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 238381.

Last, but not least, the author is grateful for the support received from the Assumption University in Thailand, without which the paper could not be finished.

## Conflicts of Interest

The author declares no conflict of interest other than disclosed above in acknowledgments.

## References

- Amari, S. Divergence, Optimization and Geometry. In Neural Information Processing: 16th International Conference; Leung, C., Lee, M., Chan, J.H., Eds.; Iconip: Bangkok, Thailand, 2009; pp. 185–193. [Google Scholar]
- Hájek, P.; Havránek, T.; Jiroušek, J. Uncertain Information Processing in Expert Systems; Raton, B., Arbor, A., Eds.; CRC Press: London, UK, 1992. [Google Scholar]
- Collins, M.; Schapire, R.E. Logistic Regression, AdaBoost and Bregman Distances. Mach. Learn.
**2002**, 48, 253–285. [Google Scholar] - Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res.
**2005**, 6, 1705–1749. [Google Scholar] - Adamčík, M.; Wilmers, G.M.
**2015**. in press. - De Finetti, B. Sul Significato Soggettivo della Probabilitá. Fund. Math.
**1931**, 17, 298–329. [Google Scholar] - Bregman, L.M. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys.
**1967**, 7, 200–217. [Google Scholar] - Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; pp. 1–716. [Google Scholar]
- Rockafeller, R.T. Convex Analysis. Princeton Landmarks in Mathematics; Princeton University Press: Princeton, NJ, USA, 1997; pp. 1–469. [Google Scholar]
- Adamčík, M. Collective Reasoning under Uncertainty and Inconsistency. Ph.D. Thesis, The University of Manchester, Manchester, UK, 2014; pp. 1–150. [Google Scholar]
- Csiszár, I. I-Divergence Geometry of Probability Distribution and Minimization Problems. Ann. Probab.
**1975**, 3, 146–158. [Google Scholar] - Amari, S.; Nagaoka, H. Methods of Information Geometry; AMS and Oxford University Press: New York, NY, USA, 2000; pp. 1–206. [Google Scholar]
- Jaynes, E.T. Where do we Stand on Maximum Entropy? In The Maximum Entropy Formalism; Levine, R.D., Tribus, M., Eds.; M.I.T. Press: Cambridge, MA, USA, 1979; pp. 15–118. [Google Scholar]
- Paris, J.B.; Vencovská, A. On the Applicability of Maximum Entropy to Inexact Reasoning. Int. J. Approx. Reason.
**1989**, 3, 1–34. [Google Scholar] - Paris, J.B.; Vencovská, A. A Note on the Inevitability of Maximum Entropy. Int. J. Approx. Reason.
**1990**, 4, 183–224. [Google Scholar] - Vomlel, J. Methods of Probabilistic Knowledge Integration. Ph.D. Thesis, Czech Technical University, Prague, Czech, 1999; pp. 1–123. [Google Scholar]
- Banerjee, A.; Guo, X.; Wang, H. On the optimality of conditional expectation as a Bregman predictor. IEEE Trans. Inf. Theory
**2005**, 51, 2664–2669. [Google Scholar] - Wilmers, G.M. The Social Entropy Process: Axiomatising the Aggregation of Probabilistic Beliefs. In Probability, Uncertainty and Rationality; Hosni, H., Montagna, F., Eds.; CRM Series; Pisa, Italy, 2010; pp. 87–104. [Google Scholar]
- Genest, C.; Zidek, J.V. Combining probability distributions: A critique and an annotated bibliography. Stat. Sci.
**1986**, 1, 114–135. [Google Scholar] - Genest, C.; Wagner, C.G. Further Evidence Against Independence Preservation in Expert Judgement Synthesis. Aequ. Math.
**1986**, 32, 74–86. [Google Scholar] - Matúš, F. On iterated averages of I-projections. In Statistiek und Informatik; Universität Bielefeld: Bielefeld, Germany, 2007; pp. 1–12. [Google Scholar]
- Predd, J.B.; Osherson, D.N.; Kulkarni, S.R.; Poor, H.V. Aggregating Probabilistic Forecasts from Incoherent and Abstaining Experts. Decis. Anal.
**2008**, 5, 177–189. [Google Scholar] - Kern-Isberner, G.; Rödder, W. Belief Revision and Information Fusion on Optimum Entropy. Int. J. Intel. Syst.
**2004**, 19, 837–857. [Google Scholar] - Williamson, J. Deliberation, Judgement and the Nature of Evidence. Econ. Philos.
**2014**. in press. [Google Scholar] - Carnap, R. On the application of inductive logic. Philos. Phenomenol. Res.
**1947**, 8, 133–148. [Google Scholar] - Paris, J.B. The Uncertain Reasoner Companion; Cambridge University Press: Cambridge, UK, 1994; pp. 1–224. [Google Scholar]
- Amari, S. Integration of stochastic models by minimizing alpha-divergence. Neural Comput
**2007**, 19, 2780–2796. [Google Scholar] - Csiszár, I.; Tusnády, G. Informational Geometry and Alternating Minimization Procedures. Stat. Decis.
**1984**, 1, 205–237. [Google Scholar] - Eggermont, P.P.B.; LaRiccia, V.N. On EM-like algorithms for minimum distance estimation; Preprint 1998; University of Delaware: Delaware, NC, USA; pp. 1–29.
- Wilmers, G.M. Generalising the Maximum Entropy Inference Process to the Aggregation of Probabilistic Beliefs; Preprint 2011, Version 6; The University of Manchester: Manchester, UK; pp. 1–40.
- Bauschke, H.H. Projection Algorithms and Monotone Operators. Ph.D. Thesis, Simon Fraser University, Burnaby, BC, Canada, 1996; pp. 1–223. [Google Scholar]
- Censor, Y.; Zenios, S.A. Parallel Optimization: Theory, Algorithms, and Applications; Oxford University Press: New York, NY, USA, 1997; pp. 1–541. [Google Scholar]
- Bauschke, H.H.; Borwein, J.M.; Combettes, P.L. Bregman monotone optimization algorithms. SIAM J. Control Optim.
**2003**, 42, 596–636. [Google Scholar] - Tofighi, M.; Kose, K.; Cetin, A.E. Denoising Using Projections Onto Convex Sets (POCS) Based Framework
**2013**, arXiv, 1309.0700.

**Figure 13.**The illustration of the chairman theorem for $\mathbf{v}\in {\mathrm{\Theta}}_{\mathbf{a}}^{{D}_{f}}({W}_{1},\dots ,{W}_{n})$.* Note that the fact that

**v**

_{[λ]}-

**s**lie on the arrow does not have any meaning.

**Figure 14.**The numerical computation for Example 7. Blue lines from the top are for ${\left\{{\mathbf{v}}^{[i]}\right\}}_{i=1}^{\infty}$ and ${\left\{{\mathbf{w}}^{[i]}\right\}}_{i=1}^{\infty}$. This graph is taken from [10].

**Figure 15.**The numerical computation for Example 7. Blue lines from the top are for $\overline{\mathbf{w}}$ and ${D}_{f}({\mathbf{v}}^{[i]}\Vert {\mathbf{w}}^{[i]})\le {D}_{f}({\mathbf{v}}^{[i]}\Vert \overline{\mathbf{w}})+{D}_{f}({\mathbf{v}}^{[i]}\Vert {\mathbf{v}}^{[i]})$. This graph is taken from [10].

**Table 1.**Examples for three saturated possibilities with respect to the consistency principle (CP), disagreement principle (DP) and singleton principle (SP).

**KIRP**, Kern-Isberner and Rödder;

**OSEP**, obdurate social entropy process;

**SEP**, social entropy process;

**coSEP**, conjugated social entropy process.

Principles | Probabilistic Merging Operators |
---|---|

(DP), (CP) | ${\mathrm{\Theta}}_{\mathcal{N}}^{{D}_{f}}$, ${\widehat{\mathrm{\Theta}}}_{\mathcal{N}}^{{D}_{f}}$ |

(DP), (SP) | KIRP, OSEP |

(CP), (SP) | SEP, coSEP |

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).