A first approach to closeness distributions

Probabilistic graphical models allow us to encode a large probability distribution as a composition of smaller ones. It is oftentimes the case that we are interested in incorporating in the model the idea that some of these smaller distributions are likely to be similar to one another. In this paper we provide an information geometric approach on how to incorporate this information, and see that it allows us to reinterpret some already existing models.


Introduction
We start by introducing a simple example to illustrate the kind of problems we are interested in solving. Consider the problem of estimating a parameter θ using data from a small experiment and a prior distribution constructed from similar previous experiments. The specific problem description is borrowed from [3]: Estimating the risk of tumor in a group of rats.
In the evaluation of drugs for possible clinical application, studies are routinely performed on rodents. For a particular study drawn from the statistical literature, suppose the immediate aim is to estimate θ, the probability of tumor in a population of female laboratory rats of type 'F344' that receive a zero dose of the drug (a control group). The data show that 4 out of 14 rats developed endometrial stromal polyps (a kind of tumor). Typically, the mean and standard deviation of underlying tumor risks are not available. Rather, historical data are available on previous experiments on similar groups of rats. In the rat tumor example, the historical data were in fact a set of observations of tumor incidence in 70 groups of rats (table 1). In the ith historical experiment, let the number of rats with tumors be y i and the total number of rats be n i . We model the y i 's as independent binomial data, given sample sizes n i and study-specific means θ i .  Table 1: Tumor incidence in 70 historical groups of rats and in the current group of rats (from [17]). The table displays the values of : (number of rats with tumors)/(total number of rats).  We can depict our graphical model as shown in fig. 1, where current and historical experiments are a random sample from a common population, having h as hyperparameters. Equationally our model can be described as: The model used for this problem in [3] is the Beta-Binomial model, where g is taken to be the Beta distribution, hence h = (α, β) (see fig. 2). Furthermore, in [3] the prior f over α, β is taken to be proportional to (α + β) −5/2 , giving the Figure 2: PGM for the rodents example proposed in [3]. .
The presentation of the model in [3] simply introduces the assumption that "the Beta prior distribution with parameters (α, β) is a good description of the population distribution of the θ i 's in the historical experiments" without further justification. In this paper we would like to show that a large part of this model can be obtained from the intuitive idea that the probability distributions for rats with tumors in each group are similar. To do that we develop a framework for encoding as a probability distribution the assumption that two probability distributions are close to each other, and rely on information geometric concepts to model the idea of closeness. We start by introducing the general concept of closeness distribution in section 2. Then, we analyze the particular case in which we choose to measure remoteness between distributions by means of the Kullback Leibler divergence in the family of multinomial distributions in section 3. The results from section 3 are used in section 4 to reinterpret the Beta Binomial model proposed in [3] for the rodents example, and in section 5 to reinterpret the Hierarchical Dirichlet Multinomial model proposed by Azzimonti et al. in [5,6,4]. We are convinced that closeness distributions could play a relevant role in probabilistic modelling, allowing for more explictly geometrically inspired probabilistic models. This paper is just a first step towards a proper definition and understanding of closeness distributions.

Closeness distributions
We start by introducing the formal framework required to discuss about probability distributions over probability distributions. Then, we formalize what we mean by remoteness through a remoteness function, and we introduce closeness distributions as those that implement a remoteness function.

Probabilities over probabilities
Information geometry [2] has shown us that most families of probability distributions can be understood as a Riemannian manifold. Thus, we can work with probabilities over probabilities by defining random variables which take values in a Riemannian manifold. Here, we only introduce some fundamental definitions. For a more detailed overview of measures and probability see [8], of Riemannian manifolds see [13]. Finally, Pennec provides a good overview of probability on Riemannian manifolds in [16].
We start by noting that each manifold M, has an associated σ-algebra, L M , the Lebesgue σ-algebra of M (see section 1, chapter XII in [1]). Furthermore, the existence of a metric g induces a measure µ g (see section 1.3 in [16]). The volume of M is defined as V ol(M) = M 1dµ g . Definition 1. Let (Ω, F, P ) be a probability space and (M, g) be a Riemannian manifold. A random variable 1 x taking values in M is a measurable function from Ω to M. Furthermore, we say that x has a probability density function (p.d.f.) p x (real, positive, and integrable function) if: and P (M) = 1.
We would like to highlight that the density function p x is intrinsic to the manifold. If x = π(x) is a chart of the manifold defined almost everywhere, we obtain a random vector x = π(x). The expression of p x in this parametrization is ).
Let f : M → R be a real function on M. We define the expectation of f under x as We have to be careful when computing E[f (x)] so that we do it independently of the parametrization. We have to use the fact that | is the expression of p x in the parametrization for integration purposes, that is, its expression with respect to the Lebesgue measure dx instead of dµ g .
We note that ρ x depends on the chart used whereas p x is intrinsic to the manifold.

Formalizing remoteness and closeness
Intuitively, the objective of this section is to create a probability distribution over pairs of probability distributions that assigns higher probability to those pairs of probability distributions which are "close".
We assume that we measure how distant are two points in M by means of a remoteness function r : M × M → R, such that r(x, y) ≥ 0 for each x, y ∈ M. Note that r does not need to be transitive, symmetric or reflexive.
As can be seen in appendix A, r induces a total order ≤ r in M × M. We say that two remoteness functions r, s are order-equivalent if ≤ r =≤ s . Proposition 2. Let γ, β ∈ R, γ, β > 0. Then, γ · r + β is order-equivalent to r. 1 Referred to as a random primitive in [16].
We say that a probability density function p : M × M → R implements a remoteness function r if ≥ p =≤ r . This is equivalent to stating that for each x, y, z, t ∈ M we have that p(x, y) ≥ p(z, t) iff r(x, y) ≤ r(z, t). That is, a density function implements a remoteness function r if it assigns higher probability density to those pairs of points which are closer according to r.
Once we have clarified what it means for a probability to implement a remoteness function, we introduce a specific way of creating probabilities that to that.
. We refer to the corresponding probability distribution as a closeness distribution.
Note that p r is defined intrinsically. Following the explanation in the previous section, let π be a chart of M defined almost everywhere. The representation of this pdf in the parametrization x , y = (π(x), π(y)) is simply and its representation for integration purposes is Proposition 4. It it exists, p r implements r.
Proof. The exponential is a monotonous function and the minus sign in the exponent is used to revert the order.
Proposition 5. If r is measurable and M has finite volume, then Z r is finite, and hence p r implements r.
Proof. Note that since r(x, y) ≥ 0, we have that f r (x, y) ≤ 1, and hence f r is bounded. Furthermore, f r is measurable since it is a composition of measurable functions. Now, since any bounded measurable function in a finite volume space is integrable, Z r is finite.
Obviously, once we have established a closeness distribution p r we can define its marginal and conditional distributions in the usual way. We note p r (x) (resp. p r (y)) as the marginal over x (resp. y). We note p r (x|y) (resp. p r (y|x)) as the conditional density of x given y (resp. y given x).

KL-closeness distributions for multinomials
In this section we study closeness distributions on M n (the family of multinomial distributions of dimension n, or the family of finite discrete distributions over n + 1 atoms). To do that, first we need to establish the remoteness function. It is well known that there is an isometry between M n and the positive orthant of the n dimensional sphere (S n ) of radius 2 (see section 7.4.2. in [14]). This isometry allows us to compute the volume of the manifold as the area of the sphere of radius 2 on the positive orthant.
Proof. The area of a sphere S n of radius r is A n,r = 2π . Now, there are 2 n+1 orthants, so the positive orthant amounts for 1 2 n+1 of that area, as stated. fig. 3 shows that the volume reaches its maximum at n = 7. The main takeover of proposition 6 is that the volume of M n is finite, because this allows us to prove the following result: Proposition 7. For any measurable remoteness function r on M n there is a closeness distribution p r implementing it.
Proof. Directly from proposition 5 and the fact that M n has finite volume.
A reasonable choice of remoteness function for a statistical manifold is the Kullback-Leibler (KL) divergence. The next section analyzes the closeness distributions that implement KL in M n .

Closeness distributions for KL as remoteness function
Let θ ∈ M n . Thus, θ is a discrete distribution over n + 1 atoms. We write θ i to represent p(x = i|θ). Note that each θ i is independent of the parametrization an thus it is an intrinsic quantity of the distribution.
Let θ, η ∈ M n . The KL divergence between θ and η is We want to study the closeness distributions that implement KL in M n . The detailed derivation of these results can be found in appendix B. The closeness pdf according to eq. (7) is The marginal for µ is And the conditional for θ given µ: Equation (10) is very similar to the expression of a Dirichlet distribution. In fact, the expression of p D (θ | µ) for integration purposes in the expectation parameterization , namely ρ D (θ | µ), is that of a Dirichlet distribution: Equation (11) deserves some attention. We have defined the joint density p D (µ, θ) so that pairs of distributions (µ, θ) that are close in terms of KL divergence are assigned a higher probability than pairs of distributions (µ * , θ * ) which are further away in terms of KL. Hence, the conditional p D (θ | µ) assigns a larger probability to those distributions θ which are close in terms of KL to µ. This means that whenever we have a probabilistic model which encodes two multinomial distributions θ and µ, and we are interested in introducing that θ should be close to µ, we can introduce the assumption that θ ∼ Dirichlet(µ+ 1 2 ). Interesting as it is for modeling purposes, the use of eq. (11) however does not allow the modeler to convey information regarding the strength of the link. That is, θ's in the KL-surrounding of µ will be more probable, but there is no way to establish how much more probable. We know by proposition 2 that for any remoteness function r, we can select γ, β > 0, and γ ·r+β is order-equivalent to r. We can take advantage of that fact and use γ to encode the strength of the probabilistic link between θ and µ. If instead of using the KL (D) as remoteness function, we opt for γ · D, following a parallel development to the one above we will find that ρ γ·D (θ | µ) = Dirichlet(θ; γµ + 1 2 ).
Now, eq. (12) allows the modeler to fix a large value of γ to encode that it is extremely unlikely that θ separates from µ, or a value of γ close to 0 to encode that the link between θ and µ is highly loose. Furthermore it is important to realize that eq. (12) allow us to interpret any already existing model which incorporates Dirichlet (or Beta) distributions with the only requirement that each of its concentration parameters is larger than 1 2 . Say we have a model in which θ ∼ Dirichlet(α). Then, defining µ by coordinates as αi , we can interpret the model as imposing θ to be close to µ with intensity γ = α1− 1 2 µ1 . Note that, extending this interpretation a bit to the extreme, since the strength of the link reduces as γ → 0, a "free" Dirichlet will have all of its weights set to 1 2 . This coincides with the classical prior suggested by Jeffreys [10,11] for this very same problem. This is reasonable since Jeffreys' prior was constructed to be independent of the parametrization, that is, to be intrinsical to the manifold, similarly to what we are doing.

Reinterpreting the Beta-Binomial model
We are now ready to go back to the rodents example provided in the introduction. The main idea we would like this hierarchical model to capture is that the θ i 's are somewhat similar. We do this by introducing a new random variable µ to which we would like each θ i to be close to (see fig. 5). Furthermore, we introduce another variable γ that controls how tightly coupled the θ i 's are to µ.    Now, µ represents a proportion and priors for proportions have been well studied, including the "Bayes-Laplace rule" [15] which recommends Beta (1, 1), the Haldane prior [9] which is an improper prior lim α→0 + Beta(α, α), and the Jeffreys' prior [10,11] Beta( 1 2 , 1 2 ). Following the arguments in the previous section, here we stick with the Jeffreys' prior. A more difficult problem is the selection of the prior for γ, where we still do not have a well founded choice. Note that taking a look at eq. (12), γ's role acts similarly (although not exactly equal) to an equivalent sample size. Thus, the prior over γ could be thought as a prior over the equivalent sample size with which µ will be incorporated as prior into the determination of each of the θ i 's. In case the size of each sample (n i ) is large, there will be no much difference between a hierarchical model and modeling each of the 71 experiments as independent experiments. So, it makes sense for the prior over γ to concentrate on relatively small equivalent sample sizes. Following this line of thought we propose γ to follow a Gamma(α = 1, β = 0.1).
To summarize, the hierarchical model we obtain based on divergence probability distributions is: Figure 6 shows that the posteriors generated by both models are similar, and put the parameter µ (the pooled average) between 0.08 and 0.15 and the parameter γ (the intensity of the link between µ and each of the θ i 's) between 5 and 25. Furthermore, the model is relatively insensitive to the parameters of the prior for γ as long as they do create a sparse prior. Thus, we see that selecting the prior as Γ(1, 0.5) creates a prior too much concentrated on low values of γ (that is it imposes a relatively mild closeness link between µ and each of the θ i 's). This changes a lot the estimation. However, Γ(1, 0.01) creates a posterior similar to that of Γ(1, 0.1), despite being more spread.

Hierarchical Dirichlet Multinomial model
Recently [5,6,4], Azzimonti et al. have proposed a hierarchical Dirichlet multinomial model to estimate conditional probability tables (CPTs) in Bayesian networks. Given two discrete finite random variables X (over domain X ) and Y over domain (Y) which are part of a Bayesian network, and such that Y is the only parent of X in the network, the CPT for X is responsible of storing p(X|Y ). The usual CPT model (the so called Multinomial-Dirichlet adheres to parameter independence and stores |Y| different independent Dirichlet distributions over each of the θ X|y . Instead Azzimonti et al. propose the hierarchical Multinomial-Dirichlet model, where "the parameters of different conditional distributions belonging to the same CPT are drawn from a common higher-level distribution". Their model can be summarized equationally as and graphically as shown in fig. 7. The fact that the Dirichlet distribution is the conditional of a closeness distribution allow us to think about this model as a generalization of the model presented for the rats example. Thus, the Hierarchical Dirichlet Multinomial model can be understood as introducing the assumption that there is a probability distribution with parameter µ, that is close in terms of its KL divergence to each of the y ∈ Y different distributions each of them parameterized by θ X|y . Thus, in equational terms, we have that the model can be rewritten as Note that γ in our reinterpreted model plays a role quite similar to the one that s played on Azzimonti's model. To maintain the parallel with the model developed for the rodents example, here we have also assumed a Gamma(1, 0.1) as prior over γ, instead of the punctual distribution assumed in [6], but we could easily mimic their approach and specify a single value for γ.
Note that we are not claiming that we are improving the Hierarchical Dirichlet Multinomial model, we are just reinterpreting it in a conceptually easier to understand way.
θ X|y α X y s · Dirichlet(α 0 ) y ∈ Y Figure 7: PGM for the hierarchical Dirichlet Multinomial model proposed in [6]. We have introduced the idea of divergence distributions and we have shown that they can be a useful tool for the probabilistic model builder. We have seen that they can provide additional rationale and geometric intuitions for some commonly used hierarchical models. In this paper we have concentrated on discrete divergence distributions. The study of continuous divergence distributions remains as future work.

Acknowledgements
Thanks to Borja Sánchez López, Jerónimo Hernández-González and Oguz Mulayim for discussions on preliminary versions. This work was partially supported by the projects Crowd4SDG and Humane-AI-net, which have received funding from the European Union's Horizon 2020 research and innovation program under grant agreements No 872944 and No 952026, respectively. This work was also partially supported by Grant PID2019-104156GB-I00 funded by MCIN/AEI/10.13039/501100011033.

A Total order induced by a function
Definition 8. Let Z be a set and f : Z → R a function. The binary relation ≤ f (a subset of Z × Z) is defined as Proposition 9. ≤ f is a total (or lineal) order in Z.
Proof. Reflexivity, transitivity, antisimmetry and totality are inherited from the fact that ≤ is a total order in Z.
B Detailed derivation of the KL based closeness distributions for multinomials B.1 Closeness distributions for KL as remoteness function Let θ ∈ M n . Thus, θ is a discrete distribution over n + 1 atoms. We write θ i to represent p(x = i|θ). Note that each θ i is independent of the parametrization an thus it is an intrinsic quantity of the distribution. Let θ, η ∈ M n . The KL divergence between θ and η is The closeness pdf according to eq. (7) is Now, it is possible to assess the marginal for µ To continue, we need to compute θ n+1 i=1 θ i µi dµ g as an intrinsic quantity of the manifold, that is, invariant to changes in parametrization. We are integrating f (θ) = n+1 i=1 θ i µi . We can parameterize the manifold using θ itself (the expectation parameters). In this parameterization the integral can be written From here, we can compute the conditional for θ given µ: Equation (27) is very similar to the expression of a Dirichlet distribution. In fact, the expression of ρ D (θ | µ) in the expectation parameterization is that of a Dirichlet distribution: