Abstract
We investigate the asymptotic properties of the plug-in estimator for the Jeffreys divergence, the symmetric variant of the Kullback–Leibler (KL) divergence. This study focuses specifically on the divergence between discrete distributions. Traditionally, estimators rely on two independent samples corresponding to two distinct conditions. However, we propose a one-sample estimator where the condition results from a random event. We establish the estimator’s asymptotic unbiasedness (law of large numbers) and asymptotic normality (central limit theorem). Although the results are expected, the proofs require additional technical work due to the randomness of the conditions.
MSC:
62F12; 62F05; 62G05
1. Introduction
The Kullback–Leibler (KL) divergence was developed as an extension of Shannon’s information; see [1]. This measure is widely used in various research fields where there is a need to compare two (probability) measures. Significant research has been focused on generalizing this concept. KL divergence is considered nowadays as a special case of f-divergence measures, with . In turn, this class of divergence is a part of a broader class of divergences known as information divergences. For a literature review of various classes of information divergence measures, we refer the reader to [2]. For the estimation of f-divergence, refer to [3] and the references therein.
In mathematical statistics, the Kullback–Leibler (KL) divergence introduced in [1], also called relative entropy in information theory, is a type of statistical measure used to quantify the dissimilarity between two probability measures. This measure of divergence has been widely used in various fields, such as variational inference [4,5], Bayesian inference [6,7], metric learning [8,9,10], machine learning [11,12], computer vision [13,14], physics [15], biology [16], and information geometry [17], among many other application fields. It is worth mentioning works related to the application of various types of divergences for goodness-of-fit problems, where theoretical and empirical distributions are compared. For instance, see [18,19,20] and the references therein.
We consider an estimation problem of the symmetrized version of KL divergence known as Jeffreys divergence. There is a vast body of literature dedicated to the statistical estimation of information-type divergence. Most of the works are dedicated to continuous-type distributions, because, in general, it is hard to calculate the definite integral in the definition of the convergence. See [21,22] and references within. However, here, we focused on discrete (or categorical) random variables.
Let be a finite set, where . A random variable with values from is characterized by the mass distribution . Let be the set of all (positive) distributions on ,
For any two , the KL divergence between and is defined as follows
KL divergence is asymmetric, and the values of and are different in general. This lack of symmetry, in some specific contexts, can be a disadvantage when measuring similarities between probability measures; see, e.g., [17,23,24,25,26,27,28]. Thus, it is often quite useful to work with a symmetrization of the KL divergence which is defined as
In [1], Kullback and Leibler consider this symmetrized version and refer to it as the divergence between the two measures. This concept was earlier introduced and studied by Harold Jeffreys in 1948, with some papers and Wikipedia citing the second edition of [29]. Today, this symmetrization is known as Jeffreys divergence. For some advantages and applications of including symmetrization in KL divergence, see, for example, [30,31,32,33,34,35]. Additionally, the Jeffreys divergence is related to the Population Stability Index (PSI) used in finance and serves as a foundation of the Cluster Validity Index (CVI), as discussed in [36].
The problem of estimating information divergence (and entropy) for discrete random variables is well established in the literature; see, for example, [37,38,39,40,41,42,43]. It is important to note that these references do not represent a comprehensive list of all relevant works; they are merely examples that illustrate the diversity of approaches in this field. In these works, the authors examine the convergence properties of estimators derived through various methods for discrete distributions. They analyze the asymptotic behavior and provide theoretical insights into the performance of these estimators, contributing to a deeper understanding of various types of divergence and related measures. All such studies known to us generally operate within the following framework. Given two independent samples and drawn from two discrete distributions and , respectively, the goal is to develop an estimator for the divergence based on the sample sizes l and m and to analyze their convergence.
However, in practice, the following scenario arises. It involves paired sample observations: , , , …, , where indicates a binary condition, and and represent the conditional distributions and , respectively. In other words, the sample sizes l and m mentioned above are realizations of a random experiment and should be treated as random variables, where l has a binomial distribution with success probability in a sequence of n trials, and . Below, in Section 2.1, we consider an example of this scenario.
This paper specifically focuses on this scenario, establishing the asymptotic properties of the plug-in estimator within this framework. To the best of our knowledge, there are few theoretical works on the asymptotic properties of estimating the Jeffreys divergence, and none directly address the framework we consider. While the standard -method can be employed to derive the central limit theorem (CLT), it typically yields a general and cumbersome variance formula involving the product of large matrices. In contrast, our direct and probabilistic approach provides a more explicit and straightforward variance formula. An additional benefit of our approach is that the results obtained can be effectively integrated into undergraduate statistics courses.
2. Notations and Main Results
Let be a bivariate random vector with values from defined on the probability space . Recall that is a finite set, where . We denote by the expectation with respect to . The (marginal) distribution of Y is a Bernoulli distribution with success probability , and let . Let and two conditional distributions of X,
for any . We will assume that all probabilities above are positive,
and , i.e. there exists at least one j, such that . Moreover, we do not assume that the number r is known. Consider the empirical probability measures and , generated by the sequence i.i.d. random variables from the bivariate distribution, , defined from
where is an indicator of an event A. We assume that the above fractions are zero when the corresponding denominator takes the zero value. Additionally, consider the following notations
Based on (1), we define the plug-in estimator for the Jeffreys divergence as follows,
where the standard convention is adopted: if , then , and if , then . Note that the estimator appears to depend on r. However, for any symbol with or , the corresponding term is not included in the sum. Thus, the sum contains only the terms corresponding to the observed symbols from , and we do not need to know r.
In this paper, we study the convergence properties, specifically, the asymptotic unbiasedness, or law of large numbers (LLN), and asymptotic normality, or central limit theorem (CLT), of the plug-in estimator (3). To this end, denote
Theorem 1.
(Law of Large Numbers) The following equality holds
Proof.
Theorem 2.
(Central Limit Theorem) The following convergence holds true
where means the convergence in distribution,
The plug-in estimator of variance yields the following corollary.
Corollary 1.
The following convergence holds true
where
Indeed, the sequence , by the law of large numbers, converges in probability to , and hence the proof of the corollary directly follows from Theorem 2 and Slutsky’s theorem. Recall that Slutsky’s theorem states that if a sequence of random variables converges in distribution and another sequence converges in probability to a constant, then their product (or sum) also converges in distribution, which is applicable in this context to complete the proof of this corollary.
Let us now proceed with the proof of Theorem 2.
Proof.
To prove the theorem, we will show that the asymptotics of the sequence of interest matches those of the sequence in Lemma 3, involving the differences and for . To achieve this, we need some technical work, isolating the relevant term from Lemma 3 and the term that converges to zero in probability. We rewrite the in the following way:
Denote
We will use the Taylor expansion for a function with a remainder term in Lagrange form (see e.g., [44] (p. 880)): for any x with , there exists a positive such that
Applying this to the function , we have
for , where . It is easy to see that the following upper bound holds when and :
Applying (5) and (6) with , we obtain
for any . Using (7) and upper bounds (17) and (18) from Lemma 1 in Section 3 (Auxiliary Results), we obtain the following: for any , and for some constant ,
Denote
For any , we have
Using the last bound (9) together with upper bounds (17) and (18) from Lemma 1, we obtain the following: for any , and for some constant ,
Once more, the inequalities (17) and (18) from Lemma 1 and Lemma 2 (see Section 3) give us the following: for any , and some constant ,
Denote
Thus, thanks to Slutsky’s theorem, the random sequences and have the same limit by distribution. Finally, Lemma 3 with
concludes the proof. □
2.1. Example
We show how to cluster catastrophic processes (see, [45,46]), using Jeffreys Divergence for their characteristics. Suppose that we have two types of insurance claims. Let be a type of claim with and , and let X be the size of damage, or payment according to the claim. The conditional distribution of X is , , where , . We assume that there is no difference between the conditional distribution of X if
where is a small critical value that distinguishes the types of claims. Let be the sample of size n, and let and be the respective estimations. For sufficiently large n, we can estimate the probability (see Corollary 1)
where
These relations propose a calculation of the probability that inequality (12) is true.
3. Auxiliary Results
The following lemma contains the proofs of inequalities (17) and (18) used in the proof of Theorem 2. The other inequalities of the lemma, (13)–(16), are used in the proofs in this section.
Lemma 1.
For any , the following inequalities hold
where , , , .
Proof.
The Hoeffding inequality proves (13). Indeed,
Let us prove (17). We have
Denote
Let us find the upper bound for the probability . We have
Next, for any , inequality (13) provides the following upper bound for the first term from the righthand side of the last inequality above (20),
The upper bound for the probabilities , , and are obtained by the Hoeffding inequality:
The next lemma contains a huge, multilevel upper bound, which was used in the proof of Theorem 2.
Lemma 2.
For any , the following inequality holds
Proof.
For any and , we obtain
Let us bound from above. Using inequality and using Lemma 1, for any , we obtain
Next, we apply the upper bounds (17) and (18) for the probabilities in the last line from the inequality above. Thus,
We obtain the upper bound for in the same way:
Lemma 3.
For any given constants , , , such that , the following convergence takes place
with
Proof.
Denote
We have
Utilizing (27), for any , , we obtain
We obtain the upper bound for using inequalities from Lemma 1. Indeed, inequalities (13) and (15)) imply that for any , there exists such that
We obtain the upper bound for the probability in the same way. Indeed, inequalities (13) and (15) imply that for any , there exists such that
Finally, the bound for the probability is provided by inequality (13):
for some . Relations (28)–(31) imply that for any , there exists such that
Denote
In a completely similar manner to the above, for any and some , we obtain
It is easy to see that
Therefore, by Slutsky’s theorem, the weak limit of the original sequence coincides with the limit sequences . It remains to be noted that
and to apply the central limit theorem. □
4. Conclusions
In this paper, we studied the symmetrized version of the KL divergence, . Today, it is known as Jeffreys divergence and is popular for classification problems. In this context, the distributions and represent the conditional probability distributions of a characteristic of interest under two different classes or conditions.
We established the asymptotic unbiasedness and normality of the plug-in estimator for the Jeffreys divergence. We considered a one-sample estimator, where, for a given sample size n, the number of observations in one condition is random and follows a binomial distribution. This differs from the traditional approach, where the properties of estimators are studied as the given sample sizes n and m of the two classes increase (see [38]). The results were expected, but additional technical work was required due to the randomness of the number of observations in one class. Moreover, we did not find detailed proofs for the Jeffreys divergence in the literature.
In this paper, we avoided referencing some known methods for proving normality, such as the -method, and provided detailed proofs instead. We believe that such proofs are accessible to undergraduate students.
Author Contributions
Conceptualization, V.G., A.L., A.Y., H.R.; methodology, A.L., A.Y., H.R.; writing—original draft preparation, A.L., A.Y., H.R.; writing—review and editing, V.G., O.L., L.S.; project administration, L.S. All authors have read and agreed to the published version of the manuscript.
Funding
V. Glinskiy and A. Logachov thank RSCF for financial support via the grant 24-28-01047; A. Yambartsev thanks FAPESP for financial support via the grant 2023/13453-5.
Data Availability Statement
The data that support the findings of this study are available from the corresponding authors upon reasonable request.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Basseville, M. Divergence measures for statistical data processing—An annotated bibliography. Signal Process. 2013, 93, 621–633. [Google Scholar] [CrossRef]
- Rubenstein, P.; Bousquet, O.; Djolonga, J.; Riquelme, C.; Tolstikhin, I.O. Practical and consistent estimation of f-divergences. Adv. Neural Inf. Process. Syst. 2019, 32., 4072–4082. [Google Scholar]
- Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
- Zhang, C.; Bütepage, J.; Kjellström, H.; Mandt, S. Advances in variational inference. Jieee Trans. Pattern Anal. Mach. Intell. 2018, 41, 2008–2026. [Google Scholar] [CrossRef]
- Tzikas, D.G.; Likas, A.C.; Galatsanos, N.P. The variational approximation for Bayesian inference. IEEE Signal Process. Mag. 2008, 25, 131–146. [Google Scholar] [CrossRef]
- Jewson, J.; Smith, J.Q.; Holmes, C. Principles of Bayesian inference using general divergence criteria. Entropy 2018, 20, 442. [Google Scholar] [CrossRef]
- Ji, S.; Zhang, Z.; Ying, S.; Wang, L.; Zhao, X.; Gao, Y. Kullback–Leibler divergence metric learning. IEEE Trans. Cybern. 2020, 52, 2047–2058. [Google Scholar] [CrossRef]
- Noh, Y.K.; Sugiyama, M.; Liu, S.; Plessis, M.C.; Park, F.C.; Lee, D.D. Bias reduction and metric learning for nearest-neighbor estimation of Kullback-Leibler divergence. Artif. Intell. Stat. 2014, 1, 669–677. [Google Scholar] [CrossRef]
- Suárez, J.L.; García, S.; Herrera, F. A tutorial on distance metric learning: Mathematical foundations, algorithms, experimental analysis, prospects and challenges. Neurocomputing 2021, 425, 300–322. [Google Scholar] [CrossRef]
- Claici, S.; Yurochkin, M.; Ghosh, S.; Solomon, J. Model fusion with Kullback-Leibler divergence. Int. Conf. Mach. Learn. 2020, 1, 2038–2047. [Google Scholar]
- Póczos, B.; Xiong, L.; Schneider, J. Nonparametric divergence estimation with applications to machine learning on distributions. arXiv 2012, arXiv:1202.3758. [Google Scholar]
- Cui, S.; Luo, C. Feature-based non-parametric estimation of Kullback–Leibler divergence for SAR image change detection. Remote. Sens. Lett. 2016, 7, 1102–1111. [Google Scholar] [CrossRef]
- Deledalle, C.A. Estimation of Kullback-Leibler losses for noisy recovery problems within the exponential family. Electron. J. Stat. 2017, 11, 3141–3164. [Google Scholar] [CrossRef]
- Granero-Belinchón, C.; Roux, S.G.; Garnier, N.B. Kullback-Leibler divergence measure of intermittency: Application to turbulence. Phys. Rev. 2018, 97, 013107. [Google Scholar] [CrossRef]
- Charzyńska, A.; Gambin, A. Improvement of the k-NN entropy estimator with applications in systems biology. Entropy 2015, 18, 13. [Google Scholar] [CrossRef]
- Belavkin, R.V. Asymmetric topologies on statistical manifolds. Int. Conf. Geom. Sci. Inf. 2015, 1, 203–210. [Google Scholar]
- Jager, L.; Wellner, J.A. Goodness-of-fit tests via phi-divergences. Ann. Statist. 2007, 35, 2018–2053. [Google Scholar] [CrossRef]
- Vexler, A.; Gurevich, G. Empirical likelihood ratios applied to goodness-of-fit tests based on sample entropy. Comput. Stat. Data Anal. 2010, 54, 531–545. [Google Scholar] [CrossRef]
- Evren, A.; Tuna, E. On some properties of goodness of fit measures based on statistical entropy. Int. J. Res. Rev. Appl. Sci. 2012, 13, 192–205. [Google Scholar]
- Bulinski, A.; Dimitrov, D. Statistical estimation of the Kullback–Leibler divergence. Mathematics 2021, 9, 544. [Google Scholar] [CrossRef]
- Broniatowski, M. Estimation of the Kullback-Leibler divergence. Math. Methods Stat. 2003, 12, 391–409. [Google Scholar]
- Seghouane, A.K.; Amari, S.I. Variants of the Kullback-Leibler divergence and their role in model selection. Ifac Proc. Vol. 2006, 39, 826–831. [Google Scholar] [CrossRef]
- Audenaert, K.M. On the asymmetry of the relative entropy. J. Math. Phys. 2013, 54, 073506. [Google Scholar] [CrossRef]
- Pinski, F.J.; Simpson, G.; Stuart, A.M.; Weber, H. Kullback–Leibler approximation for probability measures on infinite dimensional spaces. Siam J. Math. Anal. 2015, 27, 4091–4122. [Google Scholar] [CrossRef]
- Zeng, J.; Xiao, F. A fractal belief KL divergence for decision fusion. Eng. Appl. Artif. Intell. 2023, 121, 106027. [Google Scholar] [CrossRef]
- Kamiński, M. On the Symmetry Importance in a Relative Entropy Analysis for Some Engineering Problems. Symmetry 2022, 14, 1945. [Google Scholar] [CrossRef]
- Johnson, D.H.; Sinanovic, S. Symmetrizing the kullback-leibler distance. IEEE Trans. Inf. Theory 2001, 1, 1–10. [Google Scholar]
- Jeffreys, H. The Theory of Probability; Oxford Classic Texts in the Physical Sciences: Oxford, UK, 1998. [Google Scholar]
- Chen, J.; Matzinger, H.; Zhai, H.; Zhou, M. Centroid estimation based on symmetric kl divergence for multinomial text classification problem. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; pp. 1174–1177. [Google Scholar]
- Andriamanalimanana, B.; Tekeoglu, A.; Bekiroglu, K.; Sengupta, S.; Chiang, C.F.; Reale, M.; Novillo, J. Symmetric kullback-leibler divergence of softmaxed distributions for anomaly scores. In Proceedings of the 2019 IEEE Conference on Communications and Network Security (CNS), Washington, DC, USA, 10–12 June 2019; Volume 1, pp. 1–6. [Google Scholar]
- Domke, J. An easy to interpret diagnostic for approximate inference: Symmetric divergence over simulations. arXiv 2021, arXiv:2103.01030. [Google Scholar]
- Nguyen, B.; Morell, C.; De Baets, B. Supervised distance metric learning through maximization of the Jeffrey divergence. Pattern Recognit. 2017, 64, 215–225. [Google Scholar] [CrossRef]
- Moreno, P.; Ho, P.; Vasconcelos, N. A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications. Adv. Neural Inf. Process. Syst. 2003, 6. [Google Scholar]
- Yao, Z.; Lai, Z.; Liu, W. A symmetric KL divergence based spatiogram similarity measure. In Proceedings of the 2011 18th IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September 2011; Volume 1, pp. 193–196. [Google Scholar]
- Said, A.B.; Hadjidj, R.; Foufou, S. Cluster validity index based on Jeffrey divergence. Pattern Anal Appl. 2017, 20, 21–31. [Google Scholar] [CrossRef]
- Antos, A.; Kontoyiannis, I. Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithms 2001, 19, 163–193. [Google Scholar] [CrossRef]
- Zhang, Z.; Grabchak, M. Nonparametric estimation of Küllback-Leibler divergence. Neural Comput. 2014, 26, 2570–2593. [Google Scholar] [CrossRef]
- Jiao, J.; Venkat, K.; Han, Y.; Weissman, T. Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 2015, 61, 2835–2885. [Google Scholar] [CrossRef]
- Jiao, J.; Venkat, K.; Han, Y.; Weissman, T. Maximum likelihood estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 2017, 63, 6774–6798. [Google Scholar] [CrossRef]
- Bulinski, A.; Dimitrov, D. Divergence Measures Estimation and Its Asymptotic Normality Theory in the discrete case. Eur. J. Pure Appl. Math. 2019, 12, 790–820. [Google Scholar]
- Yao, L.Q.; Liu, S.H. Symmetric KL-divergence by Stein’s Method. arXiv 2024, arXiv:2401.11381. [Google Scholar]
- Bobkov, S.G.; Chistyakov, G.P.; Götze, F. Rényi divergence and the central limit theorem. Ann. Probab. 2019, 47, 270–323. [Google Scholar] [CrossRef]
- Abramowitz, M.; Stegun, I.A. (Eds.) Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th ed.; Dover: New York, NY, USA, 1972. [Google Scholar]
- Rojas, H.; Logachov, A.; Yambartsev, A. Order Book Dynamics with Liquidity Fluctuations: Asymptotic Analysis of Highly Competitive Regime. Mathematics 2023, 11, 4235. [Google Scholar] [CrossRef]
- Logachov, A.; Logachova, O.; Yambartsev, A. Processes with catastrophes: Large deviation point of view. Stoch. Process. Their Appl. 2024, 176, 104447. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).