# Expansion of the Kullback-Leibler Divergence, and a New Class of Information Metrics

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Expanding the Divergence

_{m}indicates a subset of variables of cardinality m (|$\tau $

_{m}| = m). This then becomes an expansion in degrees, m, the number of variables. The full expansion includes, and terminates with, the full set of variables, $\nu $.

## 3. Truncations of the Series

_{m}) = 0 does not imply that higher terms, I(τ

_{m+}

_{1}) etc., are also zero. The truncation approximation necessarily sets all higher terms to zero). This is a key result of the expansion of the divergence. Truncation, and a factorization of the probability density function, results from setting all the higher interaction informations to zero. Thus, the expansion represents a method for approximation and simplification that specifically limits the degree of variable dependencies.

#### 3.1. Truncation at m = 1

#### 3.2. Truncation at m = 2

_{2}. Then we have

_{2}is determined by the pdf P′, and from (8b) above it can be seen that the minimization of the divergence is the same as truncation of the expansion. This is equivalent to the approximation made by Chow-Liu [12]. In physical terms, this is the same as ignoring all but pairwise interaction terms in a Hamiltonian, and is precisely the probabilistic version of the Kirkwood superposition approximation [1,2,11]. This approximation is used in the physics of dense multiparticle systems, like liquids. The resulting pair correlation function is used in deriving many of the thermodynamic properties of liquids. Singer [1] related this to the more general theoretical constructs like the Percus-Yevick approximation and the Bogoliubov-Born-Green-Kirkwood-Yvon (BBGKY) hierarchy.

_{2}is determined by the pdf P′, and from (8b) above it can be seen that the minimization of the divergence is the same as truncation of the expansion. This is equivalent to the approximation made by Chow-Liu [12]. In physical terms, this is the same as ignoring all but pairwise interaction terms in a Hamiltonian, and is precisely the probabilistic version of the Kirkwood superposition approximation [1,2,11]. This approximation is used in the physics of dense multiparticle systems, like liquids. The resulting pair correlation function is used in deriving many of the thermodynamic properties of liquids. Singer [1] related this to the more general theoretical constructs like the Percus-Yevick approximation and the Bogoliubov-Born-Green-Kirkwood-Yvon (BBGKY) hierarchy.

#### 3.3. Truncation at m = 3

_{3}becomes

_{3}. Both imply that the approximation to the pdf is

_{3}is also expressed simply in terms of the deltas used in the analysis of dependency and as a partial measure of complexity [16]. For three variables this quantity is the same as the conditional mutual information, as can be seen from the recursion relation, Equation (12).

## 4. A Relation to the Deltas

_{n}of n variables the general, multi-variable recursion relation for the interaction information is

_{n}, where the set $\nu $

_{n}

_{−1}is the set missing X

_{n}. Thus the truncation, setting the left side to zero, implies exactly n relations, one for each choice of i:

_{n}is the same as the interaction information of the remaining n − 1 variables. Note that the conditional in Equation (12) is the same (within a sign) as the asymmetric delta function for n variables [16], so the truncation of the divergence is seen to be equivalent to a simplification and truncation of the asymmetric delta. For truncation at m = 2, this would mean that all conditional mutual informations are equal to the mutual information itself: equivalent to specifying independence of the conditional variable.

## 5. Multi-Information

_{m}} related to the true, untruncated, probability density function, such that P

_{m}is the pdf of m variables that results from setting the interaction information equal to zero for subsets $\tau $

_{m}. Then we have

_{m}} as the number of variables increases to n.

## 6. Information Geometry and a Simple Metric

- (1)
- Non-negativity: ${\mathfrak{D}}_{P}\left(R\parallel S\right)\ge 0$ is assured because $P\left(\nu \right)\ge 0$ and the absolute value in Equation (17) assures a summation that is non-negative.
- (2)
- Identity of indiscernibles: ${\mathfrak{D}}_{P}\left(R\parallel S\right)=0$ when $R\left(\nu \right)=S\left(\nu \right)$. ${\mathfrak{D}}_{P}\left(R\parallel S\right)=\left|{\sum}_{s}P\left(s\right)\left(log\left(\frac{R\left(s\right)}{P\left(s\right)}\right)-log\left(\frac{S\left(s\right)}{P\left(s\right)}\right)\right)\right|=0$. For a metric it must also be true that ${\mathfrak{D}}_{P}\left(R\parallel S\right)\ne 0$, unless $R\left(\nu \right)=S\left(\nu \right)$, otherwise the metric is a pseudometric. This condition does not hold for all choices of P, R and S and therefore the metric property may apply only to specific spaces, and must be examined in each case. We illustrate this later for some specific cases.
- (3)
- Symmetry: ${\mathfrak{D}}_{P}\left(R\parallel S\right)={\mathfrak{D}}_{P}\left(S\parallel R\right)$.$${\mathfrak{D}}_{P}\left(R\parallel S\right)=\left|{\displaystyle \sum}_{s}P\left(s\right)\left\{log\left(\frac{R\left(s\right)}{P\left(s\right)}\right)-log\left(\frac{S\left(s\right)}{P\left(s\right)}\right)\right\}\right|={\mathfrak{D}}_{P}\left(S\parallel R\right)$$
- (4)
- Subadditivity, obeying the triangle inequality:$${\mathfrak{D}}_{P}\left(R\parallel S\right)\le \mathfrak{D}\left(R\parallel Q\right)+\mathfrak{D}\left(Q\parallel S\right)$$$$\begin{array}{l}{\mathfrak{D}}_{P}\left(R\parallel S\right)=\left|{\displaystyle \sum _{s}}P\left(s\right)log\left(\frac{R\left(s\right)}{S\left(s\right)}\right)\right|=\left|{\displaystyle \sum _{s}}P\left(s\right)log\left(\frac{R\left(s\right)}{Q\left(s\right)}\right)\left(\frac{Q\left(s\right)}{S\left(s\right)}\right)\right|\\ \phantom{{\mathfrak{D}}_{P}\left(R\parallel S\right)=|{\displaystyle \sum _{s}}}=\left|{\displaystyle \sum _{s}}P\left(s\right)log\left(\frac{R\left(s\right)}{Q\left(s\right)}\right)+{\displaystyle \sum _{s}}P\left(s\right)log\left(\frac{Q\left(s\right)}{S\left(s\right)}\right)\right|\\ \phantom{{\mathfrak{D}}_{P}\left(R\parallel S\right)=|{\displaystyle \sum _{s}}}\le \left|{\displaystyle \sum _{s}}P\left(s\right)log\left(\frac{R\left(s\right)}{Q\left(s\right)}\right)\right|+\left|{\displaystyle \sum _{s}}P\left(s\right)log\left(\frac{Q\left(s\right)}{S\left(s\right)}\right)\right|={\mathfrak{D}}_{P}\left(R\parallel Q\right)+{\mathfrak{D}}_{P}\left(Q\parallel S\right)\end{array}$$

## 7. Special Metrics

_{0}), is that the integral over x with any function yields a specific value of the function, $\int \delta \left(x-{x}_{0}\right)f\left(x\right)dx=f\left({x}_{0}\right)$. The metric space is defined by the single parameter of the mean of the reference function, $\mu $. The distance expression, ${\mathfrak{D}}_{\delta}$, is then

_{1}and $\lambda $

_{2}the distance is simply

## 8. Measuring the Independence of Variable Subsets

## 9. Comparing Approximations from Different Truncated Series

## 10. Application to Networks

## 11. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Singer, A. Maximum entropy formulation of the Kirkwood superposition approximation. J. Chem. Phys.
**2004**, 121, 3657–3666. [Google Scholar] [CrossRef] [PubMed] - Kirkwood, J.G. Statistical Mechanics of Fluid Mixtures. J. Chem. Phys.
**1935**, 3, 300. [Google Scholar] [CrossRef] - Watanabe, S. Information theoretic analysis of multivariate correlation. IBM J. Res. Dev.
**1960**, 4, 66–82. [Google Scholar] [CrossRef] - Nemenman, I.; Shafee, F.; Bialek, W. Entropy and Inference, Revisted. In Advances in Neural Information Processing Systems 14; The MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
- Lin, H.; Tegmark, M. Why does deep and cheap learning work so well? arXiv, 2016; arXiv:1608.08225. [Google Scholar]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat.
**1951**, 22, 79–86. [Google Scholar] [CrossRef] - McGill, W.J. Multivariate information transmission. Psychometrika
**1954**, 19, 97–116. [Google Scholar] [CrossRef] - Bell, A.J. The Co-Information Lattice; ICA: Nara, Japan, 2003. [Google Scholar]
- Jakulin, A.; Bratko, I. Quantifying and visualizing attribute interactions: An approach based on entropy. arXiv, 2004; arXiv preprint cs/0308002. [Google Scholar]
- Sakhanenko, N.A.; Galas, D.J. Biological Data Analysis as an Information Theory Problem: Multivariable Dependence Measures and the Shadows Algorithm. J. Comput. Biol.
**2015**, 22, 1–20. [Google Scholar] [CrossRef] [PubMed] - Cochran, R.V.; Lund, L.H. On the Kirkwood Superposition Approximation. J. Chem. Phys.
**1964**, 41, 3499–3504. [Google Scholar] [CrossRef] - Chow, C.K.; Liu, C.N. Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory
**1968**. [Google Scholar] [CrossRef] - Nielsen, F. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv, 2010; arXiv:1009.4004. [Google Scholar]
- Amari, S.; Nagaoka, H. Methods of Information Geometry, Volume 191 of Translations of Mathematical Monographs; American Mathematical Society: Providence, RI, USA, 2000. [Google Scholar]
- Galas, D.J.; Sakhanenko, N.A. Multivariate information measures: A unification using Möbius operators on subset lattices. arXiv, 2016; arXiv:1601.06780v2. [Google Scholar]
- Galas, D.J.; Sakhanenko, N.A.; Skupin, A.; Ignac, T. Describing the Complexity of Systems: Multivariable “Set Complexity” and the Information Basis of Systems Biology. J. Comput. Biol.
**2014**, 21, 118–140. [Google Scholar] [CrossRef] [PubMed] - Cheng, Y.; Hua, X.; Wang, H.; Qin, Y.; Li, X. The geometry of signal detection with applications to radar signal processing. Entropy
**2016**, 18, 381. [Google Scholar] [CrossRef] - Eguchi, S.; Copas, J. Interpreting Kullback-Leibler divergence with the Neyman-Pearson Lemma. J. Multivar. Anal.
**2006**, 97, 2034–2040. [Google Scholar] [CrossRef]

**Figure 1.**(

**a**) The metric distance between Gaussian’s R and S (Equation (21)) for the Dirac delta function reference metric with mean at zero can be represented as a hyperbolic function (distance is the vertical axis where for simplicity the metric distance is the deviation from the zero plane—the absolute value) with a saddle point. (

**b**) Another geometric metaphor for the distance is the area of the blue annular region divided by $\pi $ is the distance, where $\frac{{\mu}_{1}}{{\sigma}_{1}}>\frac{{\mu}_{2}}{{\sigma}_{2}}$. The expression for this area is simply: Area = $\frac{1}{\pi}\left(\frac{{\mu}_{1}}{{\sigma}_{1}}+\frac{{\mu}_{2}}{{\sigma}_{2}}\right)\left(\frac{{\mu}_{1}}{{\sigma}_{1}}-\frac{{\mu}_{2}}{{\sigma}_{2}}\right)$.

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Galas, D.J.; Dewey, G.; Kunert-Graf, J.; Sakhanenko, N.A.
Expansion of the Kullback-Leibler Divergence, and a New Class of Information Metrics. *Axioms* **2017**, *6*, 8.
https://doi.org/10.3390/axioms6020008

**AMA Style**

Galas DJ, Dewey G, Kunert-Graf J, Sakhanenko NA.
Expansion of the Kullback-Leibler Divergence, and a New Class of Information Metrics. *Axioms*. 2017; 6(2):8.
https://doi.org/10.3390/axioms6020008

**Chicago/Turabian Style**

Galas, David J., Gregory Dewey, James Kunert-Graf, and Nikita A. Sakhanenko.
2017. "Expansion of the Kullback-Leibler Divergence, and a New Class of Information Metrics" *Axioms* 6, no. 2: 8.
https://doi.org/10.3390/axioms6020008