Expansion of the Kullback-Leibler Divergence, and a new class of information metrics

Inferring and comparing complex, multivariable probability density functions is a fundamental problem in several fields, including probabilistic learning, network theory, and data analysis. Classification and prediction are the two faces of this class of problem. We take an approach here that simplifies many aspects of these problems by presenting a structured series expansion of the Kullback-Leibler divergence - a function central to information theory - and devise a distance metric based on this divergence. Using the M\"obius inversion duality between multivariable entropies and multivariable interaction information, we express the divergence as an additive series in the number of interacting variables. Truncations of this series yield approximations based on the number of interacting variables. The first few terms of the expansion-truncation are illustrated and shown to lead naturally to familiar approximations, including the well-known Kirkwood superposition approximation. Truncation can also induce a simple relation between the multi-information and the interaction information. A measure of distance between distributions, based on Kullback-Leibler divergence, is then described and shown to be a true metric. Finally, the expansion is shown to generate a hierarchy of metrics and connects it to information geometry formalisms. These results provide a general approach for systematic approximations in numbers of interactions or connections, and a related quantitative metric.

general class of problem have also been explored [13]. While certainly not the only indication of complexity, the number of variables that interact or are functionally interdependent, is a very important characteristic of the complexity of a system.
A central function of information theory, the Kullback--Leibler divergence, can be shown to be close to the heart of these problems. It is the goal of this paper to describe an approach that simplifies some aspects of these problems in a different way, by focusing on interesting and useful symmetries of entropy and "relative entropy" and the Kullback--Leibler divergence (K--L, referred to as the "divergence" in the rest of this paper.) Particularly important in practical applications is the divergence between the "true", multivariable, probability density function (pdf) and any approximation of it [1], as are specific metric measures of the distances between approximations.
The paper is structured around recognition and exploitation of several properties of this divergence, which is a central function in information theory. We first show that the divergence admits of a simple series expansion with increasing numbers of variables in each successive term. We affect this expansion of the multivariable cross--entropy (or relative entropy) term of the divergence using the Möbius duality between multivariable entropies and multivariable interaction information [2--5]. This allows an expansion series in the number of interacting variables, which can be used as the approximation parameter: the more interactions considered, the more accurate the approximation. We then illustrate the derivation of some known factorizations of the pdf by truncating the expansion at small numbers of variables.
Well--known simple approximations emerge, including the Kirkwood superposition approximation at three variables. This is a widely used approximation in the theory of liquids [6--8]. Other approximations, like the seminal approximation of Chow and Liu [15], is closely connected to the expansion. We will not expand on this here, but explore and extend this connection in future work.
The divergence expansion is entirely general, can be extended to any degree, and leads to a number of useful relationships with other information theory measures. In the next section we define a new simple metric between probability density functions and show that it meets all the requirements of a true metric.
Unlike the approach of the Jensen--Shannon divergence --which is a measure based on symmetrizing the K--L divergence [16] --or that which use the Fisher metric to embed the functions in a Riemannian manifold [14], our metric provides a large class of information metrics that calculate distances directly, and thereby easily measure the relations between approximations, among other applications. We examine a few cases of specific pdf function classes (e.g., Gaussian, Poisson) and find explicit forms for the functions. Finally, we examine briefly the metric distances implied by different truncations of the divergence expansion.

Expanding the divergence.
Consider a set of variables = % , for which we have many values constituting a data set.
The concepts of maximum entropy and minimum divergence have been used to devise approaches to the inference of the best estimate of the true probability density function from a data set. The relation between the "true" and an approximate probability density function (pdf) is best characterized by the Kullback--Leibler divergence. If the true pdf is and an approximation to it is ′ then the divergence is given by where traverses all possible states of . The approximated entropy (called the cross--entropy) is defined as so the divergence is simply the difference between the true entropy and the cross entropy:

of 22
In this form it is clear that the approximate joint entropy must be greater than H( ) since we know the divergence is always non--negative [1]. This is a consequence of the well--known Jensen's inequality. If P' is an approximation to P, then as the approximation gets better and better the divergence converges to zero. The approximation of the joint entropy is the measure of the accuracy of the approximation and minimizing H' (under some set of constraints or assumptions) must be optimum. Using other information theory measures related to the joint entropies in Equation 3, however, can also be used to good effect.
Specifically, we use the Möbius inversion relation between the entropy and interaction information [3--5]. This relationship can be written , where the sum is over all subsets of . H and I can be exchanged in this symmetric form of the relation and the equation still holds.
The symmetry derives from the inherent structure of the subset lattice, which is a hypercube [9]. Inserting the joint entropy expression into Equation 3 gives a sum over all subsets of the variables ∥ * = −1 4 56 * 4⊂: Now if we group terms by the number of variables in the subset and introduce notation to indicate the size of each of the subsets, the sum is rearranged as an expansion. : The physics implication would be simply that of independent particles, observables, etc., which leads to a simple Boltzmann distribution in equilibrium. The pdf becomes more complex, of course, if we truncate the expansion at a higher level.

Truncation at m=2
This truncation requires that ′ % , B , E = 0, which from Equation 4b implies this Let us denote the cross entropy term for this truncation as A 2. Then we have The cross entropy term A 2 is determined by the pdf P', and from 8b above we can see that the minimization of the divergence is the same as truncation of the expansion. This is equivalent to the approximation made by Chow--Liu [15]. In physical terms this is the same as ignoring all but pairwise interaction terms in a Hamiltonian, and is precisely the probabilistic version of the Kirkwood superposition approximation [6--8]. This approximation is used in the physics of dense multiparticle systems, like liquids. The resulting pair correlation function is used in deriving many of the thermodynamic properties of liquids. Singer [6] related this to the more general theoretical constructs like the Percus--Yevick approximation and the Bogoliubov--Born--Green--Kirkwood--Yvon (BBGKY) hierarchy.

Truncation at m=3
Parallel to the above we can express the truncation approximation at the next level using three terms: In terms of the cross entropies the term A 3 becomes (9b) is zero, is the same as minimizing the divergence D 3 . Both imply that the approximation to the pdf is (10) Note that A 3 is also expressed simply in terms of the deltas used in the analysis of dependency and as a partial measure of complexity [10]. For three variables this quantity is the same as the conditional mutual information, as can be seen from the recursion relation, equation 12.
The approximation indicated on the right--hand side is based on the cross entropy approximation, that = * . The truncation of the expansion, leading to more complex representations of the variable interactions, can be taken to higher levels, of course, which leads in turn to higher--level, more complex, factorizations of the pdf. These factorizations are most simply seen by setting the cross interaction information for m variables equal to zero and inferring the implied pdf factors.

A relation to the deltas.
The truncation relation implies another simple equivalence that has direct intuitive meaning, and connects in a simple way to the differential interaction information [10]. From the general recursion relation for the interaction information we can derive a set of simple equivalences.
For the set n of n variables the general, multi--variable recursion relation for the interaction information is for all n choices of X n , where the set n--1 is the set missing X n . Thus the truncation, setting the left side to zero, implies exactly n relations, one for each choice of i : (13) The implication of the truncation criterion for the divergence at m=n, then, is that the interaction information, conditioned on each variable of a set n , is the same as the interaction information of the remaining n--1 variables. Note that the conditional in Equation 12 is the same (within a sign) as the asymmetric delta function for n variables [10], so the truncation of the divergence is seen to be equivalent to a simplification and truncation of the asymmetric delta.
For truncation at m=2 this would mean that all conditional mutual informations are equal to the mutual information itself: equivalent to specifying independence of the conditional variable.

Multi--information.
It is easy to show that the truncation embodied in Equation 11 also implies a simple relation between the "multi--information" (called "complete correlation" by Watanabe [11]) and the interaction information. The multi--information is defined as Ω U = % − ( U % ). This quantity is often used as a measure of overall multivariable dependence, since it goes to zero if all variables are independent. It is always positive, but has several drawbacks in that it does not distinguish at all the degrees of dependence (number of variables), and is not a metric.
We will not show the elementary proof of the general case of truncation at n variables here, This is easy to see by direct calculation using the marginal entropies. For n=4 Ω , , , = , , + , , The relation 14a is strongly intuitive in the sense that if the 3--variable interaction information is zero, the multi--information is simply the sum of the mutual information for all three pairs. A similar, but less intuitive, relationship is embodied in the 4--variable case, Equation 14b, and the general case is suggested.
The divergence expansion can also be expressed using the multi--information in a limited number of variables, as well as a series of truncation--approximate probability density functions, in the following way. Consider a series of functions {P m } related to the true, untruncated, probability density function, such that P m is the pdf of m variables that results from setting the interaction information equal to zero for subsets m . Then we have The divergence converges to zero for the series {P m } as the number of variables increases without bound. lim <→U ∥ < = 0 The divergence therefore induces a topology on the series of functions. The proof of 16 follows directly from the definitions.
Note that the multi--information is not a metric, and that a metric specifically gives a distance between different pdf approximations --or any different pdfs. This is a problem that has received much attention, we can complete this formalism around the K--L divergence and its approximations by devising a simple pdf metric.

Information Geometry and a simple metric.
Although it is sometimes thought of as a distance measure between probability distributions, the Kullback-Leibler divergence is not a true metric. Among the disqualifying properties is its asymmetry. There has, however, been much work devoted to the development of geometric measures of information, particularly in differential geometry [14], and symmetric divergences have been defined [16]. A derivative form, the Hessian, of the divergence does yield a metric tensor known as the Fisher information metric. This is a Riemannian metric tensor, and has been used extensively. While having a real metric is essential to a complete quantitative theory, it is even more useful if it is relatively simple and direct. Finite distances between functions in the differential manifold of the Fisher metric must be determined by integration along geodesics. Simpler metrics allow the direct calculation of the distance between probability density functions. We now describe such a simple information metric.
Consider the problem of comparing two approximate distributions, and using another pdf, , as a reference function. We use the K--L divergence to define a metric simply as the absolute value of the difference between two K--L divergences using the same reference function. This definition is embodied in the following equation.
We next establish that ( ∥ ) does indeed have the properties of a metric on a function space. A metric has the following four properties, which we show are fulfilled by our definition: 1. Non--negativity: ∥ ≥ 0 is assured because ≥ 0 and the absolute value in Equation 17 assures a summation that is non--negative. is therefore a true metric on the function space of pdf's, which we can use directly as a measure of information distance. Since the metric is determined by a reference function, represents a class of metrics, each determined by the choice of reference function. We now examine some properties of these metrics.
An intriguing similarity of the metric, the distance between functions defined by a third function, lies in Bayesian statistics. We could say that by defining the reference pdf, as a prior pdf, ( ∥ ) measures the distance between two posterior functions, and .
By measuring the distance between successive posteriors, one can monitor the convergence of Bayesian updating to a steady state distribution. The distance measure can also be used to assess quantitatively how close different posterior models are to each other. We could define a Dirichlet metric, i for example, if the reference pdf, or prior, were a Dirichlet distribution, or a uniform, or a Gaussian metric if the reference were uniform or Gaussian.

Special Metrics.
The fact that defines a metric on a function space inspires us to ask what specific functional forms yield metric spaces with particular properties. We could define a uniform probability density over the variable set , which leads to the very simple expression for this metric, > where ℕ is the number of values that the total set of variables can take on (consider it a vector.) An interesting class of metrics is provided by choosing a Gaussian reference.
If the functions R and S are also Gaussian we can illustrate a particularly simple expression for distances for the case of a single variable. Let the reference function be defined as a normal distribution with variance J and mean, , designated 19a) and the functions to be measured are: The distance between R and S then is: which can easily be evaluated. Using the simple properties of Gaussians we have There is another, special case worth mentioning. If the reference function is chosen to be a Dirac delta function 2 , which could be considered to be the limiting case of a Gaussian with vanishing standard deviation, the expression of Equation 20 simplifies further. The metric space is defined by the single parameter of the mean of the reference function, . The distance expression, x , is then The key property of the Dirac delta function, δ(x---x0), is that the integral over x with any function yields a specific value of the function, the distance between R and S is proportional to the squares of the distance from the reference mean. If we set the reference mean at zero (without loss of generality) it is clear that the relevant measure of distance is just the squares of the ratios of the mean to standard deviation.
This one--dimensional case has a simple geometric interpretation. Notice that with the exception of the single log term on the right--hand side of Equation 20, the expression is a quadratic form in the ratios of mean to standard deviation of R and S, and of the ratios of each of these standard deviations to the reference standard deviation.
In general, the Dirac delta reference function metric does not carry much information about the functions themselves, but if the function class is restricted it becomes both more interesting and useful. These logs, whose difference is the metric in Equation 16, are often called "surprisals" in information theory. So in this case, the metric is essentially how much more surprising is R than S at any specific point. We should mention that if multiple delta function metrics are used where the distance coordinates for each surprisal point t distances between R and S leads naturally to a multi--dimensional space representation of the log ratios. A three--dimensional representation, for example, reflects the three chosen points where the functions are compared.
Another interesting metric space results from selecting all three functions, the reference and the measured functions, as Poisson distributions. These discrete valued functions, , = ! − , yield a particularly simple metric distance. If the reference function has parameter , and the other two 1 and 2 the distance is simply Of course the distance vanishes when 6 goes to J . If the reference , is much smaller than the other two, ≪ 6 , J the distance is linear in the difference between them, while if it is very much larger, ≫ 6 , J the distance is proportional to the difference of the logs of the 's.
There are a very large number of possible special metrics based on a wide range of possible continuous distributions that could be used as reference functions, many of which lead to interesting functional expressions. To explore these further see the comprehensive list of such functions in the "Field Guide to Continuous Probability Distributions", which is available from Gavin Crooks' Website 3 .

Measuring the Independence of variable subsets.
Next consider comparing the probability of a given, single variable, 6 with that of the conditional probability of that variable given the remaining set − 6 of − 1 variables, 6 − 6 . If the chosen variable is independent of the others, so that we have 6 = 6 − 6 , then the distance is zero: Therefore, Generalizing from a single variable to a subset ′ we have: which is, of course dependent on the reference function, P, except in the limit where the distance goes to zero.

Comparing approximations from different truncated series.
Another application of these metric spaces is in the area of statistical physics, for example, that considers reduced probability distribution functions to approximate the true distribution functions. Both high degree of interactions, highly multivariable, and non--equilibrium problems defined by trajectories, could be directly approached with this apparatus. These approximations have often involved physically motivated simplifying truncation relationships, like those discussed above. The formalism developed here can be used to calculate the distance between probability functions that are truncated at different levels of approximation. This allows an assessment of the convergence of higher level truncations, in terms of the distance converging to zero.
The approximate functions that result from truncations of the variable number expansion at different numbers of variables can now be directly compared with a quantitative metric. The are determined by the true, or reference pdf.
Comparing distributions truncated at the first and second order, the probability functions 6 * and J * are determined by the factorizations of Equations 6c and 7a. The distance between these two truncation approximations, relative to the reference function, then is: Referring to Equations 6a and 8c this expression simplifies to 2 ′ ∥ 1 Recall that this is the sum of "cross" mutual information between all pairs of variables defined by the reference function. The distance of Equation 24b represents the distance between functions of pairwise dependence and independence. In general, the distance between two different truncation approximations can be seen easily from the expansion of Equation 4. The distance is simply the absolute value of the sum of the terms present in only one of the truncated series.

Conclusions
The Kullback--Leibler divergence has played a central role information theory and been used for several practical purposes in data analysis, machine learning, and model inference. It has provided ways to explore some key ideas in fields from information theory to thermodynamics.
We show here that it can continue to yield new results. The divergence can be expanded in the number of interacting variables, yielding a systematic hierarchy of truncations, approximations to the probability density function, which is effectively a hierarchy of factorizations. The relationship between the set of entropies and the set of interaction informations through the Möbius inversion relation is a fundamental symmetry that is manifest here, but the full symmetry spectrum is deeper yet. It reflects a number of relationships with other information-related measures, that are based on this symmetry [9,10]. Since these relations can also be used to express the cross entropy differently, they should generate different expansions of the divergence, with different structures. This intriguing area may itself yield additional, useful applications, and has yet to be explored.
As we noted in the introduction there are several areas of potential application of these ideas.
The relation of one level of the truncation hierarchy to the Chow--Liu approximation [15] we noted immediately suggests the extension of the Chow--Liu algorithm to higher levels - Chow--Liu--like hypergraphs. This remains to be explored fully and will be addressed in a future publication. Other applications to networks and network inference are suggested by the notion of the metric classes based on specific reference functions. It is interesting that the divergence provides the basis for a new finite difference metric that gives measures of distances between pdf's, real or estimated, continuous or discrete. It should be possible to simplify graph distance measures, given a set of specific constraints, by optimizing the choice of a reference function.
Tailoring the metric to specific classes of graphs, for example, should enable simplification of model inference in some cases. These ideas will be addressed in future work.
If we added the constraint of specific function forms for the pdf's --the exponential family of functions like Gaussians, for example --the natural extension leads to a number of specific approximations and metric form. The relationship of our metric (Eqn. 18) to the Fisher information metric can be obtained from the convergence to zero of this distance. There are a wide variety of metrics that can be derived by symmetrizing the divergence in various ways. The Jensen--Shannon divergence is one of these, but there are several others that use variations on the theme of averaging the cross entropy terms in various ways. The proposed metric here is the first to our knowledge to use a third probability density function to define the character of the metric space. This has the potential advantage, as indicated by the simple examples shown here, that the metric space can be tailored to the character of any function space. We suggest that Equation 18 defines what might be interpreted as a finite difference form of the Fisher metric. The metric can also be used directly to compare Bayesian estimators as the pdf is iteratively updated, to measure convergence.
The application of the general approach described here to a wide range of multivariable problems, including data analysis, model inference, multivariable physical problems, and problems involving complex biological systems, should be useful in providing new analysis methods and new insights.

Conflicts of Interest:
The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.