A First Approach to Closeness Distributions

Jesus Cerquides

doi:10.3390/math9233112

Instituto de Investigación en Inteligencia Artificial (IIIA-CSIC), Campus UAB, 08193 Cerdanyola, Spain

Mathematics2021, 9(23), 3112;https://doi.org/10.3390/math9233112

This article belongs to the Special Issue Bayesian Inference and Modeling with Applications

Version Notes

Order Reprints

Abstract

Probabilistic graphical models allow us to encode a large probability distribution as a composition of smaller ones. It is oftentimes the case that we are interested in incorporating in the model the idea that some of these smaller distributions are likely to be similar to one another. In this paper we provide an information geometric approach on how to incorporate this information and see that it allows us to reinterpret some already existing models. Our proposal relies on providing a formal definition of what it means to be close. We provide an example on how this definition can be actioned for multinomial distributions. We use the results on multinomial distributions to reinterpret two already existing hierarchical models in terms of closeness distributions.

Keywords:

probabilistic modeling; distance; KL divergence; closeness; Beta distribution; multinomial distribution

1. Introduction

Bayesian modeling [1] builds on our ability to describe a given process in probabilistic terms, known as probabilistic modeling. As stated in [2]: “Statistical methods and models commonly involve multiple parameters that can be regarded as related or connected in such a way that the problem implies dependence of the joint probability model for these parameters”. Hierarchical modeling [3] is widely used for that purpose in areas such as epidemiological modeling [4] or to model oil or gas production [5]. The motivation for this paper comes from realizing that many hierarchical models can be understood, from a high level perspective, as defining a distribution over the multiple parameters that establishes that distributions which are closer to each other, are more likely. Thus, the main motivation is to start providing the mathematical tools that allow a probabilistic modeler to build hierarchical (and non-hierarchical) models starting from that geometrical concepts.

We start by introducing a simple example to illustrate the kind of problems we are interested in solving. Consider the problem of estimating a parameter

θ

using data from a small experiment and a prior distribution constructed from similar previous experiments. The specific problem description is borrowed from [2]:

In the evaluation of drugs for possible clinical application, studies are routinely performed on rodents. For a particular study drawn from the statistical literature, suppose the immediate aim is to estimate θ, the probability of a tumor in a population of female laboratory rats of type ‘F344’ that receive a zero dose of the drug (a control group). The data show that 4 out of 14 rats developed endometrial stromal polyps (a kind of tumor). (...) Typically, the mean and standard deviation of underlying tumor risks are not available. Rather, historical data are available on previous experiments on similar groups of rats. In the rat tumor example, the historical data were in fact a set of observations of tumor incidence in 70 groups of rats (Table 1). In the ith historical experiment, let the number of rats with tumors be $y_{i}$ and the total number of rats be $n_{i}$ . We model the $y_{i}$ ’s as independent binomial data, given sample sizes $n_{i}$ and study-specific means $θ_{i}$ .

Table 1. Tumor incidence in 70 historical groups of rats and in the current group of rats (from [6]). The table displays the values of: (number of rats with tumors)/(total number of rats).

Example. Estimating the risk of tumor in a group of rats.

We can depict our graphical model (for more information on the interpretation of the graphical models in this paper the reader can consult [7,8]) as shown in Figure 1, where current and historical experiments are a random sample from a common population, having

h

as hyperparameters, which follow f as prior distribution. Equationally, our model can be described as:

\begin{matrix} h \sim f \end{matrix}

(1)

\begin{matrix} θ_{i} \sim g (h) & \forall i \in [1 : 71] \end{matrix}

(2)

\begin{matrix} y_{i} & \sim B i n o m i a l (n_{i}, θ_{i}) & \forall i \in [1 : 71] . \end{matrix}

(3)

Figure 1. General probabilistic graphical model for the rodents example.

The model used for this problem in [2] is the Beta-Binomial model, where g is taken to be the Beta distribution, hence

h = (α, β)

(see Figure 2). Furthermore, in [2] the prior f over

α, β

is taken to be proportional to

{(α + β)}^{- 5 / 2}

, giving the model

\begin{matrix} p (α, β) & \propto {(α + β)}^{- 5 / 2} \end{matrix}

(4)

\begin{matrix} θ_{i} & \sim B e t a (α, β) & \forall i \in [1 : 71] \end{matrix}

(5)

\begin{matrix} y_{i} & \sim B i n o m i a l (n_{i}, θ_{i}) & \forall i \in [1 : 71] . \end{matrix}

(6)

Figure 2. PGM for the rodents example proposed in [2].

The presentation of the model in [2] simply introduces the assumption that “the Beta prior distribution with parameters

(α, β)

is a good description of the population distribution of the

θ_{i}

’s in the historical experiments” without further justification. In this paper we would like to show that a large part of this model can be obtained from the intuitive idea that the probability distributions for rats with tumors in each group are similar. To do that we develop a framework for encoding as a probability distribution the assumption that two probability distributions are close to each other, and rely on information geometric concepts to model the idea of closeness.

We start by introducing the general concept of closeness distribution in Section 2. Then, we analyze the particular case in which we choose to measure remoteness between distributions by means of the Kullback Leibler divergence in the family of multinomial distributions in Section 3. The results from Section 3 are used in Section 4 to reinterpret the Beta Binomial model proposed in [2] for the rodents example, and in Section 5 to reinterpret the Hierarchical Dirichlet Multinomial model proposed by Azzimonti et al. in [9,10,11]. We are convinced that closeness distributions could play a relevant role in probabilistic modeling, allowing for more explicitly geometrically inspired probabilistic models. This paper is just a first step towards a proper definition and understanding of closeness distributions.

2. Closeness Distributions

We start by introducing the formal framework required to discuss the probability distributions. Then, we formalize what we mean by remoteness through a remoteness function, and we introduce closeness distributions as those that implement a remoteness function.

2.1. Probabilities over Probabilities

Information geometry [12] has shown us that most families of probability distributions can be understood as a Riemannian manifold. Thus, we can work with probabilities over probabilities by defining random variables which take values in a Riemannian manifold. Here, we only introduce some fundamental definitions. For a more detailed overview of measures and probability see [13], of Riemannian manifolds see [14]. Finally, Pennec provides a good overview of the probability on Riemannian manifolds in [15].

We start by noting that each manifold

M

, has an associated

σ

-algebra,

L_{M}

, the Lebesgue

σ

-algebra of

M

(see Section 1, chapter XII in [16]). Furthermore, the existence of a metric g induces a measure

η_{g}

(see Section 1.3 in [15]). The volume of

M

is defined as

V o l (M) = \int_{M} 1 d η_{g} .

Definition 1.

Let

(Ω, F, P)

be a probability space and

(M, g)

be a Riemannian manifold. A random variable (referred to as a random primitive in [15])

x

taking values in

M

is a measurable function from Ω to

M .

Furthermore, we say that

x

has a probability density function (p.d.f.)

p_{x}

(real, positive, and integrable function) if:

\forall X \in L_{M} P (x \in X) = \int_{X} p_{x} d η_{g}, and P (M) = 1 .

We would like to highlight that the density function

p_{x}

is intrinsic to the manifold. If

x^{'} = π (x)

is a chart of the manifold defined almost everywhere, we obtain a random vector

x^{'} = π (x)

. The expression of

p_{x}

in this parametrization is

p_{x^{'}} (x^{'}) = p_{x} (π^{- 1} (x^{'})) .

Let

f : M \to R

be a real function on

M

. We define the expectation of f under

x

as

E [f (x)] = \int f (x) p_{x} (x) η_{g} (d x)

We have to be careful when computing

E [f (x)]

so that we do it independently of the parametrization. We have to use the fact that

\int f (x) p_{x} (x) η_{g} (d x) = \int f (π^{- 1} (x^{'})) p_{x^{'}} (x^{'}) \sqrt{∣ G (x^{'}) ∣} d x^{'},

where

G (x^{'})

is the Fisher matrix at

x^{'}

in the parametrization

π

. Hence,

E [f (x)] = \int f (π^{- 1} (x^{'})) ρ_{x^{'}} (x^{'}) d x^{'} .

where

ρ_{x^{'}} (x^{'}) = p_{x^{'}} (x^{'}) \sqrt{∣ G (x^{'}) ∣} = p_{x} (π^{- 1} (x^{'})) \sqrt{∣ G (x^{'}) ∣}

is the expression of

p_{x}

in the parametrization for integration purposes, that is, its expression with respect to the Lebesgue measure

d x^{'}

instead of

d η_{g} .

We note that

ρ_{x^{'}}

depends on the chart used whereas

p_{x}

is intrinsic to the manifold.

2.2. Formalizing Remoteness and Closeness

Intuitively, the objective of this section is to create a probability distribution over pairs of probability distributions that assigns higher probability to those pairs of probability distributions which are “close”.

We assume that we measure how distant two points are in

M

by means of a remoteness function

r : M \times M \to R

, such that

r (x, y) \geq 0

for each

x, y \in M

. Note that r does not need to be transitive, symmetric, or reflexive.

As can be seen in Appendix A, r induces a total order

\leq_{r}

in

M \times M .

We say that two remoteness functions

r, s

are order-equivalent if

\leq_{r} = \leq_{s}

.

Proposition 1.

Let

γ, β \in R, γ, β > 0

. Then,

γ \cdot r + β

is order-equivalent to

r .

Proof.

a \leq_{r} b

iff

r (a) \leq r (b)

iff

γ \cdot r (a) \leq γ \cdot r (b)

iff

γ \cdot r (a) + β \leq γ \cdot r (b) + β

iff

a \leq_{γ \cdot r + β} b

. □

We say that a probability density function

p : M \times M \to R

implements a remoteness function r if

\geq_{p} = \leq_{r} .

This is equivalent to stating that for each

x, y, z, t \in M

we have that

p (x, y) \geq p (z, t)

iff

r (x, y) \leq r (z, t)

. That is, a density function implements a remoteness function r if it assigns higher probability density to those pairs of points which are closer according to r.

Once we have clarified what it means for a probability to implement a remoteness function, we introduce a specific way of creating probabilities to that.

Definition 2.

Let

f_{r} : M \times M \to R

be

f_{r} (x, y) = exp (- r (x, y))

. If

Z_{r} = \int_{M} \int_{M} f_{r} d η_{g} d η_{g}

is finite, we define the density function

p_{r} (x, y) = \frac{f_{r} (x, y)}{Z_{r}} = \frac{exp (- r (x, y))}{Z_{r}} .

(7)

We refer to the corresponding probability distribution as a closeness distribution.

Note that

p_{r}

is defined intrinsically. Following the explanation in the previous section, let

π

be a chart of

M

defined almost everywhere. The representation of this pdf in the parametrization

x^{'}, y^{'} = (π (x), π (y))

is simply

p_{r^{'}} (x^{'}, y^{'}) = p_{r} (π^{- 1} (x^{'}), π^{- 1} (y^{'}))

(8)

and its representation for integration purposes is

ρ_{r^{'}} (x^{'}, y^{'}) = p_{r} (π^{- 1} (x^{'}), π^{- 1} (y^{'})) \sqrt{| G (x^{'}) |} \sqrt{| G (y^{'}) |}

(9)

Proposition 2.

It it exists,

p_{r}

implements r.

Proof.

The exponential is a monotonous function and the minus sign in the exponent is used to revert the order. □

Proposition 3.

If r is measurable and

M

has finite volume, then

Z_{r}

is finite, and hence

p_{r}

implements r.

Proof.

Note that since

r (x, y) \geq 0

, we have that

f_{r} (x, y) \leq 1

, and hence

f_{r}

is bounded. Furthermore,

f_{r}

is measurable since it is a composition of measurable functions. Now, since any bounded measurable function in a finite volume space is integrable,

Z_{r}

is finite. □

Obviously, once we have established a closeness distribution

p_{r}

we can define its marginal and conditional distributions in the usual way. We note

p_{r} (x)

(resp.

p_{r} (y)

) as the marginal over x (resp. y). We note

p_{r} (x | y)

(resp.

p_{r} (y | x)

) as the conditional density of x given y (resp. y given x).

3. KL-Closeness Distributions for Multinomials

In this section we study closeness distributions on

M_{n}

(the family of multinomial distributions of dimension n, or the family of finite discrete distributions over

n + 1

atoms). To do that, first we need to establish the remoteness function. It is well known that there is an isometry between

M_{n}

and the positive orthant of the n dimensional sphere (

S_{n}

) of radius 2 (see Section 7.4.2. in [17]). This isometry allows us to compute the volume of the manifold as the area of the sphere of radius 2 on the positive orthant.

Proposition 4.

The volume of

M_{n}

is

V o l (M_{n}) = \frac{π^{\frac{n + 1}{2}}}{Γ (\frac{n + 1}{2})} .

Proof.

The area of a sphere

S_{n}

of radius r is

A_{n, r} = \frac{2 π^{\frac{n + 1}{2}} r^{n}}{Γ (\frac{n + 1}{2})} .

Taking

r = 2

,

A_{n, 2} = \frac{π^{\frac{n + 1}{2}} 2^{n + 1}}{Γ (\frac{n + 1}{2})} .

Now, there are

2^{n + 1}

orthants, so the positive orthant amounts for

\frac{1}{2^{n + 1}}

of that area, as stated. □

Figure 3 shows that the volume of the space of multinomial distributions over

n + 1

atoms reaches its maximum at

n = 7

. The main takeover of Proposition 4 is that the volume of

M_{n}

is finite, because this allows us to prove the following result:

Figure 3. Volume of the family of multinomial distributions as dimension increases.

Proposition 5.

For any measurable remoteness function r on

M_{n}

there is a closeness distribution

p_{r}

implementing it.

Proof.

Directly from Proposition 3 and the fact that

M_{n}

has finite volume. □

A reasonable choice of remoteness function for a statistical manifold is the Kullback-Leibler (KL) divergence. The next section analyzes the closeness distributions that implement KL in

M_{n}

.

3.1. Closeness Distributions for KL as Remoteness Function

Let

θ \in M_{n} .

Thus,

θ

is a discrete distribution over

n + 1

atoms. We write

θ_{i}

to represent

p (x = i | θ) .

Note that each

θ_{i}

is independent of the parametrization an thus it is an intrinsic quantity of the distribution.

Let

θ, μ \in M_{n}

. The KL divergence between

θ

and

μ

is

D (μ, θ) = \sum_{i = 1}^{n + 1} μ_{i} log \frac{μ_{i}}{θ_{i}} .

We want to study the closeness distributions that implement KL in

M_{n}

. The detailed derivation of these results can be found in Appendix B. The closeness pdf according to Equation (7) is

\begin{matrix} p_{D} (μ, θ) = \frac{1}{Z_{D}} \prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i}} \prod_{i = 1}^{n + 1} {μ_{i}}^{- μ_{i}} \end{matrix}

The marginal for

μ

is

\begin{matrix} p_{D} (μ) = \frac{1}{Z_{D}} \prod_{i = 1}^{n + 1} {μ_{i}}^{- μ_{i}} B (μ + \frac{1}{2}) \end{matrix}

where

B (α) = \frac{\prod_{i = 1}^{k} Γ (α_{i})}{Γ (\sum_{i = 1}^{k} α_{i})}

is the multivariate Beta function.

The conditional for

θ

given

μ

:

\begin{matrix} p_{D} (θ ∣ μ) = \frac{\prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i}}}{B (μ + \frac{1}{2})} \end{matrix}

(10)

Equation (10) is very similar to the expression of a Dirichlet distribution. In fact, the expression of

p_{D} (θ ∣ μ)

for integration purposes in the expectation parameterization, namely

ρ_{D} (θ ∣ μ)

, is that of a Dirichlet distribution:

\begin{matrix} ρ_{D} (θ ∣ μ) = D i r i c h l e t (θ; μ + \frac{1}{2}) \end{matrix}

(11)

Equation (11) deserves some attention. We have defined the joint density

p_{D} (μ, θ)

so that pairs of distributions

(μ, θ)

that are close in terms of KL divergence are assigned a higher probability than pairs of distributions

(μ^{*}, θ^{*})

which are further away in terms of KL. Hence, the conditional

p_{D} (θ ∣ μ)

assigns a larger probability to those distributions

θ

which are close in terms of KL to

μ .

This means that whenever we have a probabilistic model which encodes two multinomial distributions

θ

and

μ

, and we are interested in introducing that

θ

should be close to

μ

, we can introduce the assumption that

θ \sim D i r i c h l e t (μ + \frac{1}{2}) .

Interesting as it is for modeling purposes, the use of Equation (11) however does not allow the modeler to convey information regarding the strength of the link. That is,

θ

’s in the KL-surrounding of

μ

will be more probable, but there is no way to establish how much more probable. We know by Proposition 1 that for any remoteness function r, we can select

γ, β > 0

, and

γ \cdot r + β

is order-equivalent to r. We can take advantage of that fact and use

γ

to encode the strength of the probabilistic link between

θ

and

μ

. If instead of using the KL (D) as the remoteness function, we opt for

γ \cdot D

, following a parallel development to the one above we will find that

ρ_{γ \cdot D} (θ ∣ μ) = D i r i c h l e t (θ; γ μ + \frac{1}{2}) .

(12)

Now, Equation (12) allows the modeler to fix a large value of

γ

to encode that it is extremely unlikely that

θ

separates from

μ

, or a value of

γ

close to 0 to encode that the link between

θ

and

μ

is highly loose. Furthermore it is important to realize that Equation (12) allows us to interpret any already existing model which incorporates Dirichlet (or Beta) distributions with the only requirement that each of its concentration parameters is larger than

\frac{1}{2} .

Say we have a model in which

θ \sim D i r i c h l e t (α)

. Then, defining

μ

by coordinates as

μ_{i} = \frac{α_{i} - \frac{1}{2}}{- \frac{n + 1}{2} + \sum_{i = 1}^{n + 1} α_{i}}

, we can interpret the model as imposing

θ

to be close to

μ

with intensity

γ = \frac{α_{1} - \frac{1}{2}}{μ_{1}} .

Note that, extending this interpretation a bit to the extreme, since the strength of the link reduces as

γ \to 0

, a “free” Dirichlet will have all of its weights set to

\frac{1}{2} .

This coincides with the classical prior suggested by Jeffreys [18,19] for this very same problem. This is reasonable since Jeffreys’ prior was constructed to be independent of the parametrization, that is, to be intrinsic to the manifold, similarly to what we are doing.

3.2. Visualizing the Distributions

In the previous section we have seen an expression for

p_{γ \cdot D} (θ ∣ μ) .

Since the KL divergence is not symmetric, we have that

p_{γ \cdot D} (μ ∣ θ)

is different from

p_{γ \cdot D} (θ ∣ μ) .

Unfortunately, we have not been able to provide a closed form expression for

p_{γ \cdot D} (μ ∣ θ) .

However, it is possible to compute it numerically in order to compare both conditionals.

Figure 4 shows a comparison of

p_{γ \cdot D} (μ ∣ θ)

and

p_{γ \cdot D} (θ ∣ μ) .

According to what is suggested in [20], for a proper interpretation of the densities we show its density function, which is intrinsic to the manifold, instead of its expression in the parametrization, as is commonly done. Note that from Equation (12), the value of

p_{γ \cdot D} (θ ∣ μ)

is 0 at

θ = 0

and

θ = 1 .

In Figure 4, we can see that this is not the case for

p_{γ \cdot D} (μ ∣ θ)

neither at

μ = 0

nor at

μ = 1

. In fact we see that

p_{γ \cdot D} (θ ∣ μ)

always starts below

p_{γ \cdot D} (μ ∣ θ)

at

θ = 0

(resp.

μ = 0)

. Then, as

θ

(resp.

μ

) grows, it is always the case that

p_{γ \cdot D} (θ ∣ μ)

goes over

p_{γ \cdot D} (μ ∣ θ)

, to end decreasing again below it when

θ

(resp

μ

) approaches to 1.

Figure 4. Comparison of

p_{γ \cdot D} (θ ∣ μ)

and

p_{γ \cdot D} (μ ∣ θ) .

In (a)

p_{γ \cdot D} (θ ∣ μ = 0.5)

and

p_{γ \cdot D} (μ ∣ θ = 0.5)

. In (b)

p_{γ \cdot D} (θ ∣ μ = 0.7)

and

p_{γ \cdot D} (μ ∣ θ = 0.7)

. In (c)

p_{γ \cdot D} (θ ∣ μ = 0.9)

and

p_{γ \cdot D} (μ ∣ θ = 0.9)

. In (d)

p_{γ \cdot D} (θ ∣ μ = 0.95)

and

p_{γ \cdot D} (μ ∣ θ = 0.95)

.

4. Reinterpreting the Beta-Binomial Model

We are now ready to go back to the rodents example provided in the introduction. The main idea we would like this hierarchical model to capture is that the

θ_{i}

’s are somewhat similar. We do this by introducing a new random variable

μ

to which we would like each

θ_{i}

to be close to (see Figure 5). Furthermore, we introduce another variable

γ

that controls how tightly coupled the

θ_{i}

’s are to

μ

. Now,

μ

represents a proportion, and priors for proportions have been well studied, including the “Bayes-Laplace rule” [21] which recommends

B e t a (1, 1)

, the Haldane prior [22] which is an improper prior

{lim}_{α \to 0^{+}} B e t a (α, α)

, and the Jeffreys’ prior [18,19]

B e t a (\frac{1}{2}, \frac{1}{2}) .

Following the arguments in the previous section, here we stick with the Jeffreys’ prior. A more difficult problem is the selection of the prior for

γ

, where we still do not have a well founded choice. Note that taking a look at Equation (12),

γ

’s role acts similarly (although not exactly equal) to an equivalent sample size. Thus, the prior over

γ

could be thought as a prior over the equivalent sample size with which

μ

will be incorporated as prior into the determination of each of the

θ_{i}

’s. In case the size of each sample (

n_{i}

) is large, there will not be much difference between a hierarchical model and modeling of each of the 71 experiments as independent experiments. So, it makes sense for the prior over

γ

to concentrate on a relatively small equivalent sample sizes. Following this line of thought we propose

γ

to follow a

G a m m a (α = 1, β = 0.1) .

Figure 5. Reinterpreted hierarchical graphical model for the rodents example.

To summarize, the hierarchical model we obtain based on closeness probability distributions is:

\begin{matrix} μ & \sim B e t a (\frac{1}{2}, \frac{1}{2}) \end{matrix}

(13)

\begin{matrix} γ & \sim G a m m a (1, 0.1) \end{matrix}

(14)

\begin{matrix} θ_{i} & \sim B e t a (γ μ + \frac{1}{2}, γ (1 - μ) + \frac{1}{2}) & \forall i \in [1 : 71] \end{matrix}

(15)

\begin{matrix} y_{i} & \sim B i n o m i a l (n_{i}, θ_{i}) & \forall i \in [1 : 71] . \end{matrix}

(16)

Figure 6 shows that the posteriors generated by both models are similar, and put the parameter

μ

(the pooled average) between 0.08 and 0.15, and the parameter

γ

(the intensity of the link between

μ

and each of the

θ_{i}

’s) between 5 and 25. Furthermore, the model is relatively insensitive to the parameters of the prior for

γ

as long as they do create a sparse prior. Thus, we see that selecting the prior as

Γ (1, 0.5)

creates a prior that is too concentrated on the low values of

γ

(that is, it imposes a relatively mild closeness link between

μ

and each of the

θ_{i}

’s). This changes the estimation a lot. However,

Γ (1, 0.01)

creates a posterior similar to that of

Γ (1, 0.1)

, despite being more spread.

Figure 6. Comparison of posteriors between a closeness distribution model and that proposed by Gelman et al. in [2].

5. Hierarchical Dirichlet Multinomial Model

Recently [9,10,11], Azzimonti et al. have proposed a hierarchical Dirichlet multinomial model to estimate conditional probability tables (CPTs) in Bayesian networks. Given two discrete finite random variables X (over domain

X

) and Y over domain (

Y

) which are part of a Bayesian network, and such that Y is the only parent of X in the network, the CPT for X is responsible of storing

p (X | Y) .

The usual CPT model (the so called Multinomial-Dirichlet) adheres to parameter independence and stores

| Y |

different independent Dirichlet distributions over each of the

θ_{X | y}

. Instead, Azzimonti et al. propose the hierarchical Multinomial-Dirichlet model, where “the parameters of different conditional distributions belonging to the same CPT are drawn from a common higher-level distribution”. Their model can be summarized equationally as

\begin{matrix} α & \sim s \cdot D i r i c h l e t (α_{0}) \\ θ_{X | y} & \sim D i r i c h l e t (α) & \forall y \in Y \\ X_{y} & \sim C a t e g o r i c a l (θ_{X | y}) & \forall y \in Y \end{matrix}

and graphically as shown in Figure 7.

Figure 7. PGM for the hierarchical Dirichlet Multinomial model proposed in [10].

The fact that the Dirichlet distribution is the conditional of a closeness distribution allows us to think about this model as a generalization of the model presented for the rat example. Thus, the Hierarchical Dirichlet Multinomial model can be understood as introducing the assumption that there is a probability distribution with parameter

μ

, that is close in terms of its KL divergence to each of the

y \in Y

different distributions, each of them parameterized by

θ_{X | y} .

Thus, in equational terms, we have that the model can be rewritten as

\begin{matrix} μ & \sim D i r i c h l e t (\frac{1}{2}, \dots, \frac{1}{2}) \end{matrix}

(17)

\begin{matrix} γ & \sim G a m m a (1, 0.1) \end{matrix}

(18)

\begin{matrix} θ_{X | y} & \sim D i r i c h l e t (γ μ + \frac{1}{2}) & \forall y \in Y \end{matrix}

(19)

\begin{matrix} X_{y} & \sim C a t e g o r i c a l (θ_{X | y}) & \forall y \in Y \end{matrix}

(20)

and depicted as shown in Figure 8. Note that

γ

in our reinterpreted model plays a role quite similar to the one that s played on Azzimonti’s model. To maintain the parallel with the model developed for the rodents example, here we have also assumed a

G a m m a (1, 0.1)

as prior over

γ

, instead of the punctual distribution assumed in [10], but we could easily mimic their approach and specify a single value for

γ .

Figure 8. Reinterpreted PGM for the hierarchical Dirichlet Multinomial model.

Note that we are not claiming that we are improving the Hierarchical Dirichlet Multinomial model, we are just reinterpreting it in a way that is easier to understand conceptually.

6. Conclusions and Future Work

We have introduced the idea and the formalization remoteness functions and closeness distributions in Section 2. We have proven that any remoteness function induces a closeness distribution, provided that the volume of the space of distributions is finite. We have particularized closeness distributions for multinomials in Section 3, taking the remoteness function as the KL-divergence. By analyzing two examples, we have shown that closeness distributions can be a useful tool to the probabilistic model builder. We have seen that they can provide additional rationale and geometric intuitions for some commonly used hierarchical models.

In the Crowd4SDG and Humane-AI-net projects, these mathematical tools could prove useful in the understanding and development of consensus models for citizen science, improving the ones presented in [23]. Our plan is to study this in future work.

In this paper we have concentrated on discrete closeness distributions. The study of continuous closeness distributions remains for future work.

Funding

This work was partially supported by the projects Crowd4SDG and Humane-AI-net, which have received funding from the European Union’s Horizon 2020 research and innovation program under grant agreements No 872944 and No 952026, respectively. This work was also partially supported by Grant PID2019-104156GB-I00 funded by MCIN/AEI/10.13039/501100011033.

Acknowledgments

Thanks to Borja Sánchez López, Jerónimo Hernández-González and Mehmet Oğuz Mülâyim for discussions on preliminary versions.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Total Order Induced by a Function

Definition A1.

Let Z be a set and

f : Z \to R

a function. The binary relation

\leq_{f}

(a subset of

Z \times Z

) is defined as

a \leq_{f} b iff f (a) \leq f (b)

(A1)

Proposition A1.

\leq_{f}

is a total (or lineal) order in Z.

Proof.

Reflexivity, transitivity, antisimmetry, and totality are inherited from the fact that ≤ is a total order in Z. □

Appendix B. Detailed Derivation of the KL Based Closeness Distributions for Multinomials

Closeness Distributions for KL as Remoteness Function

Let

θ \in M_{n} .

Thus,

θ

is a discrete distribution over

n + 1

atoms. We write

θ_{i}

to represent

p (x = i | θ) .

Note that each

θ_{i}

is independent of the parametrization an thus it is an intrinsic quantity of the distribution.

Let

θ, μ \in M_{n}

. The KL divergence between

θ

and

μ

is

D (μ, θ) = \sum_{i = 1}^{n + 1} μ_{i} log \frac{μ_{i}}{θ_{i}} .

The closeness pdf according to Equation (7) is

\begin{matrix} p_{D} (μ, θ) & = \frac{1}{Z_{D}} exp (- D (μ, θ)) \\ = \frac{1}{Z_{D}} exp (- \sum_{i = 1}^{n + 1} μ_{i} log \frac{μ_{i}}{θ_{i}}) \\ = \frac{1}{Z_{D}} \prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i}} \prod_{i = 1}^{n + 1} {μ_{i}}^{- μ_{i}} . \end{matrix}

Now, it is possible to assess the marginal for

μ

\begin{matrix} p_{D} (μ) & = \int_{θ} p_{D} (μ, θ) η_{g} (d θ) \\ = \int_{θ} \frac{1}{Z_{D}} \prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i}} \prod_{i = 1}^{n + 1} {μ_{i}}^{- μ_{i}} η_{g} (d θ) \\ = \frac{1}{Z_{D}} \prod_{i = 1}^{n + 1} {μ_{i}}^{- μ_{i}} \int_{θ} \prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i}} η_{g} (d θ) \end{matrix}

(A2)

where we recall that

η_{g}

is the measure induced by the Fisher metric and it is not connected to

μ

. To continue, we need to compute

\int_{θ} \prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i}} η_{g} (d θ)

as an intrinsic quantity of the manifold, that is, invariant to changes in parametrization. We are integrating

f (θ) = \prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i}}

. We can parameterize the manifold using

θ

itself (the expectation parameters). In this parameterization, the integral can be written as

\begin{matrix} \int_{θ} \prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i}} η_{g} (d θ) & = \int_{θ} \prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i}} \sqrt{∣ G (θ) ∣} d θ \\ = \int_{θ} \prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i}} \prod_{i = 1}^{n + 1} {θ_{i}}^{- \frac{1}{2}} d θ \\ = \int_{θ} \prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i} - \frac{1}{2}} d θ \\ = B (μ + \frac{1}{2}), \end{matrix}

(A3)

where the last equality comes from identifying it as a Dirichlet integral of type 1 (see 15-08 in [24]), and

B (α) = \frac{\prod_{i = 1}^{k} Γ (α_{i})}{Γ (\sum_{i = 1}^{k} α_{i})}

is the multivariate Beta function. Combining Equation (A2) with Equation (A3) we get

\begin{matrix} p_{D} (μ) & = \frac{1}{Z_{D}} \prod_{i = 1}^{n + 1} {μ_{i}}^{- μ_{i}} B (μ + \frac{1}{2}) . \end{matrix}

(A4)

From here, we can compute the conditional for

θ

given

μ

:

\begin{matrix} p_{D} (θ ∣ μ) & = \frac{p_{D} (μ, θ)}{p_{D} (μ)} \\ = \frac{\frac{1}{Z_{D}} \prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i}} \prod_{i = 1}^{n + 1} {μ_{i}}^{- μ_{i}}}{\frac{1}{Z_{D}} \prod_{i = 1}^{n + 1} {μ_{i}}^{- μ_{i}} B (μ + \frac{1}{2})} \\ = \frac{\prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i}}}{B (μ + \frac{1}{2})} . \end{matrix}

(A5)

Equation (A5) is very similar to the expression of a Dirichlet distribution. In fact, the expression of

ρ_{D} (θ ∣ μ)

in the expectation parameterization is that of a Dirichlet distribution:

\begin{matrix} ρ_{D} (θ ∣ μ) & = p_{D} (θ ∣ μ) \sqrt{∣ G (θ) ∣} \\ = \frac{\prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i}}}{B (μ + \frac{1}{2})} \prod_{i = 1}^{n + 1} {θ_{i}}^{- \frac{1}{2}} \\ = \frac{\prod_{i = 1}^{n + 1} {θ_{i}}^{μ_{i} - \frac{1}{2}}}{B (μ + \frac{1}{2})} \\ = D i r i c h l e t (θ; μ + \frac{1}{2}) . \end{matrix}

(A6)

References

Van de Schoot, R.; Depaoli, S.; King, R.; Kramer, B.; Märtens, K.; Tadesse, M.G.; Vannucci, M.; Gelman, A.; Veen, D.; Willemsen, J.; et al. Bayesian statistics and modelling. Nat. Rev. Methods Prim. 2021, 1, 1. [Google Scholar] [CrossRef]
Gelman, A.; Carlin, J.B.; Stern, H.S.; Rubin, D.B. Bayesian Data Analysis; Chapman and Hall/CRC: London, UK, 2013. [Google Scholar]
Allenby, G.M.; Rossi, P.E.; McCulloch, R. Hierarchical Bayes Models: A Practitioners Guide. SSRN Electron. J. 2005. [Google Scholar] [CrossRef] [Green Version]
Lee, S.Y.; Lei, B.; Mallick, B. Estimation of COVID-19 spread curves integrating global data and borrowing information. PLoS ONE 2020, 15, e0236860. [Google Scholar] [CrossRef] [PubMed]
Lee, S.Y.; Mallick, B.K. Bayesian Hierarchical Modeling: Application Towards Production Results in the Eagle Ford Shale of South Texas. Sankhya B 2021. [Google Scholar] [CrossRef]
Tarone, R.E. The Use of Historical Control Information in Testing for a Trend in Proportions. Biometrics 1982, 38, 215–220. [Google Scholar] [CrossRef]
Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
Obermeyer, F.; Bingham, E.; Jankowiak, M.; Pradhan, N.; Chiu, J.; Rush, A.; Goodman, N. Tensor variable elimination for plated factor graphs. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 4871–4880. [Google Scholar]
Azzimonti, L.; Corani, G.; Zaffalon, M. Hierarchical Multinomial-Dirichlet Model for the Estimation of Conditional Probability Tables. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; pp. 739–744. [Google Scholar] [CrossRef] [Green Version]
Azzimonti, L.; Corani, G.; Zaffalon, M. Hierarchical estimation of parameters in Bayesian networks. Comput. Stat. Data Anal. 2019, 137, 67–91. [Google Scholar] [CrossRef]
Azzimonti, L.; Corani, G.; Scutari, M. Structure Learning from Related Data Sets with a Hierarchical Bayesian Score. In Proceedings of the International Conference on Probabilistic Graphical Models, PMLR, Aalborg, Denmark, 23–25 September 2020; pp. 5–16. [Google Scholar]
Amari, S.I. Information Geometry and Its Applications; Springer: Berlin/Heidelberg, Germany, 2016; Volume 194. [Google Scholar]
Dudley, R.M. Real Analysis and Probability, 2nd ed.; Cambridge Studies in Advanced Mathematics; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar] [CrossRef]
Jost, J. Riemannian Geometry and Geometric Analysis; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar] [CrossRef]
Pennec, X. Probabilities and Statistics on Riemannian Manifolds: A Geometric Approach; Technical Report RR-5093; INRIA: Rocquencourt, France, 2004. [Google Scholar]
Amann, H.; Escher, J. Analysis III; Birkhäuser: Basel, Switzerland, 2009. [Google Scholar] [CrossRef]
Kass, R.E.; Vos, P.W. Geometrical Foundations of Assimptotic Inference; Wiley-Interscience: Hoboken, NJ, USA, 1997. [Google Scholar]
Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. Ser. A Math. Phys. Sci. 1946, 186, 453–461. [Google Scholar] [CrossRef] [Green Version]
Jeffreys, H. The Theory of Probability; Oxford University Press: Oxford, UK, 1998. [Google Scholar]
Cerquides, J. Parametrization invariant interpretation of priors and posteriors. arXiv 2021, arXiv:2105.08304. [Google Scholar]
Laplace, P.S.m.d. Essai Philosophique sur les Probabilités; Courcier: Le Mesnil-Saint-Denis, France, 1814. [Google Scholar]
Haldane, J.B.S. A note on inverse probability. Math. Proc. Camb. Philos. Soc. 1932, 28, 55–61. [Google Scholar] [CrossRef]
Cerquides, J.; Mülâyim, M.O.; Hernández-González, J.; Ravi Shankar, A.; Fernandez-Marquez, J.L. A Conceptual Probabilistic Framework for Annotation Aggregation of Citizen Science Data. Mathematics 2021, 9, 875. [Google Scholar] [CrossRef]
Jeffreys, H.; Swirles Jeffreys, B. Methods of Mathematical Physics; Cambridge University Press: Cambridge, UK, 1950. [Google Scholar]

Figure 1. General probabilistic graphical model for the rodents example.

Figure 2. PGM for the rodents example proposed in [2].

Figure 3. Volume of the family of multinomial distributions as dimension increases.

Figure 4. Comparison of

p_{γ \cdot D} (θ ∣ μ)

and

p_{γ \cdot D} (μ ∣ θ) .

In (a)

p_{γ \cdot D} (θ ∣ μ = 0.5)

and

p_{γ \cdot D} (μ ∣ θ = 0.5)

. In (b)

p_{γ \cdot D} (θ ∣ μ = 0.7)

and

p_{γ \cdot D} (μ ∣ θ = 0.7)

. In (c)

p_{γ \cdot D} (θ ∣ μ = 0.9)

and

p_{γ \cdot D} (μ ∣ θ = 0.9)

. In (d)

p_{γ \cdot D} (θ ∣ μ = 0.95)

and

p_{γ \cdot D} (μ ∣ θ = 0.95)

.

Figure 5. Reinterpreted hierarchical graphical model for the rodents example.

Figure 6. Comparison of posteriors between a closeness distribution model and that proposed by Gelman et al. in [2].

Figure 7. PGM for the hierarchical Dirichlet Multinomial model proposed in [10].

Figure 8. Reinterpreted PGM for the hierarchical Dirichlet Multinomial model.

Table 1. Tumor incidence in 70 historical groups of rats and in the current group of rats (from [6]). The table displays the values of: (number of rats with tumors)/(total number of rats).

Previous experiments:
0/20	0/20	0/20	0/20	0/20	0/20	0/20	0/19	0/19	0/19
0/19	0/18	0/18	0/17	1/20	1/20	1/20	1/20	1/19	1/19
1/18	1/18	2/25	2/24	2/23	2/20	2/20	2/20	2/20	2/20
2/20	1/10	5/49	2/19	5/46	3/27	2/17	7/49	7/47	3/20
3/20	2/13	9/48	10/50	4/20	4/20	4/20	4/20	4/20	4/20
4/20	10/48	4/19	4/19	4/19	5/22	11/46	12/49	5/20	5/20
6/23	5/19	6/22	6/20	6/20	6/20	16/52	15/47	15/46	9/24
Current experiment: 4/14

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A First Approach to Closeness Distributions

Abstract

1. Introduction

2. Closeness Distributions

2.1. Probabilities over Probabilities

2.2. Formalizing Remoteness and Closeness

3. KL-Closeness Distributions for Multinomials

3.1. Closeness Distributions for KL as Remoteness Function

3.2. Visualizing the Distributions

4. Reinterpreting the Beta-Binomial Model

5. Hierarchical Dirichlet Multinomial Model

6. Conclusions and Future Work

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Total Order Induced by a Function

Appendix B. Detailed Derivation of the KL Based Closeness Distributions for Multinomials

Closeness Distributions for KL as Remoteness Function

References

Article Metrics

Citations

Article Access Statistics