Next Article in Journal
TVGeAN: Tensor Visibility Graph-Enhanced Attention Network for Versatile Multivariant Time Series Learning Tasks
Next Article in Special Issue
Improving the Structure of the Electricity Demand Response Aggregator Based on Holonic Approach
Previous Article in Journal
Quantum Machine Learning: Exploring the Role of Data Encoding Techniques, Challenges, and Future Directions
Previous Article in Special Issue
On the Gradient Method in One Portfolio Management Problem
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Asymptotic Properties of a Statistical Estimator of the Jeffreys Divergence: The Case of Discrete Distributions

by
Vladimir Glinskiy
1,2,
Artem Logachov
1,3,
Olga Logachova
4,
Helder Rojas
5,6,
Lyudmila Serga
1,2,* and
Anatoly Yambartsev
7
1
Department of Business Analytics, Siberian Institute of Management—Branch of the Russian Presidential Academy of National Economy and Public Administration, Novosibirsk State University of Economics and Management, 630102 Novosibirsk, Russia
2
Department of Statistics, Novosibirsk State University of Economics and Management, 630099 Novosibirsk, Russia
3
Department of Computer Science in Economics, Novosibirsk State Technical University (NSTU), 630087 Novosibirsk, Russia
4
Department of Higher Mathematics, Siberian State University of Geosystems and Technologies (SSUGT), 630108 Novosibirsk, Russia
5
Escuela Profesional de Ingeniería Estadística, Universidad Nacional de Ingeniería, Lima 00051, Peru
6
Department of Mathematics, Imperial College London, London SW7 2AZ, UK
7
Department of Statistics, Institute of Mathematics and Statistics, University of São Paulo (USP), São Paulo 05508-220, Brazil
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(21), 3319; https://doi.org/10.3390/math12213319
Submission received: 26 August 2024 / Revised: 15 October 2024 / Accepted: 21 October 2024 / Published: 23 October 2024
(This article belongs to the Special Issue Mathematical Modeling and Applications in Industrial Organization)

Abstract

:
We investigate the asymptotic properties of the plug-in estimator for the Jeffreys divergence, the symmetric variant of the Kullback–Leibler (KL) divergence. This study focuses specifically on the divergence between discrete distributions. Traditionally, estimators rely on two independent samples corresponding to two distinct conditions. However, we propose a one-sample estimator where the condition results from a random event. We establish the estimator’s asymptotic unbiasedness (law of large numbers) and asymptotic normality (central limit theorem). Although the results are expected, the proofs require additional technical work due to the randomness of the conditions.

1. Introduction

The Kullback–Leibler (KL) divergence was developed as an extension of Shannon’s information; see [1]. This measure is widely used in various research fields where there is a need to compare two (probability) measures. Significant research has been focused on generalizing this concept. KL divergence is considered nowadays as a special case of f-divergence measures, with f ( · ) = ln ( · ) . In turn, this class of divergence is a part of a broader class of divergences known as information divergences. For a literature review of various classes of information divergence measures, we refer the reader to [2]. For the estimation of f-divergence, refer to [3] and the references therein.
In mathematical statistics, the Kullback–Leibler (KL) divergence introduced in [1], also called relative entropy in information theory, is a type of statistical measure used to quantify the dissimilarity between two probability measures. This measure of divergence has been widely used in various fields, such as variational inference [4,5], Bayesian inference [6,7], metric learning [8,9,10], machine learning [11,12], computer vision [13,14], physics [15], biology [16], and information geometry [17], among many other application fields. It is worth mentioning works related to the application of various types of divergences for goodness-of-fit problems, where theoretical and empirical distributions are compared. For instance, see [18,19,20] and the references therein.
We consider an estimation problem of the symmetrized version of KL divergence known as Jeffreys divergence. There is a vast body of literature dedicated to the statistical estimation of information-type divergence. Most of the works are dedicated to continuous-type distributions, because, in general, it is hard to calculate the definite integral in the definition of the convergence. See [21,22] and references within. However, here, we focused on discrete (or categorical) random variables.
Let X = { a 1 , a 2 , , a r } be a finite set, where r 2 . A random variable with values from X is characterized by the mass distribution p = ( p a ) a X . Let P ( X ) be the set of all (positive) distributions on X ,
P ( X ) = p = ( p a ) a X R r : p a > 0 and a X p a = 1 .
For any two p , q P ( X ) , the KL divergence between p and q is defined as follows
D K L ( p | | q ) = a X p a ln p a q a .
KL divergence is asymmetric, and the values of D K L ( p | | q ) and D K L ( q | | p ) are different in general. This lack of symmetry, in some specific contexts, can be a disadvantage when measuring similarities between probability measures; see, e.g., [17,23,24,25,26,27,28]. Thus, it is often quite useful to work with a symmetrization of the KL divergence which is defined as
D K L s y m ( p | | q ) : = D K L ( p | | q ) + D K L ( q | | p ) , = a X ( p a q a ) ln p a q a .
In [1], Kullback and Leibler consider this symmetrized version and refer to it as the divergence between the two measures. This concept was earlier introduced and studied by Harold Jeffreys in 1948, with some papers and Wikipedia citing the second edition of [29]. Today, this symmetrization is known as Jeffreys divergence. For some advantages and applications of including symmetrization in KL divergence, see, for example, [30,31,32,33,34,35]. Additionally, the Jeffreys divergence is related to the Population Stability Index (PSI) used in finance and serves as a foundation of the Cluster Validity Index (CVI), as discussed in [36].
The problem of estimating information divergence (and entropy) for discrete random variables is well established in the literature; see, for example, [37,38,39,40,41,42,43]. It is important to note that these references do not represent a comprehensive list of all relevant works; they are merely examples that illustrate the diversity of approaches in this field. In these works, the authors examine the convergence properties of estimators derived through various methods for discrete distributions. They analyze the asymptotic behavior and provide theoretical insights into the performance of these estimators, contributing to a deeper understanding of various types of divergence and related measures. All such studies known to us generally operate within the following framework. Given two independent samples X 1 ( 1 ) , X 2 ( 1 ) , , X l ( 1 ) and X 1 ( 2 ) , X 2 ( 2 ) , , X m ( 2 ) drawn from two discrete distributions p and q , respectively, the goal is to develop an estimator for the divergence based on the sample sizes l and m and to analyze their convergence.
However, in practice, the following scenario arises. It involves paired sample observations: ( X , Y ) , ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , …, ( X n , Y n ) , where Y i { 0 , 1 } indicates a binary condition, and p and q represent the conditional distributions P ( · Y = 0 ) and P ( · Y = 1 ) , respectively. In other words, the sample sizes l and m mentioned above are realizations of a random experiment and should be treated as random variables, where l has a binomial distribution with success probability P ( Y = 0 ) in a sequence of n trials, and m = n l . Below, in Section 2.1, we consider an example of this scenario.
This paper specifically focuses on this scenario, establishing the asymptotic properties of the plug-in estimator within this framework. To the best of our knowledge, there are few theoretical works on the asymptotic properties of estimating the Jeffreys divergence, and none directly address the framework we consider. While the standard δ -method can be employed to derive the central limit theorem (CLT), it typically yields a general and cumbersome variance formula involving the product of large matrices. In contrast, our direct and probabilistic approach provides a more explicit and straightforward variance formula. An additional benefit of our approach is that the results obtained can be effectively integrated into undergraduate statistics courses.
The paper is organized as follows. In the next section, Section 2, we introduce definitions and formulate the asymptotic results with proofs. Auxiliary results are provided in Section 3.

2. Notations and Main Results

Let ( X , Y ) be a bivariate random vector with values from X × { 0 , 1 } defined on the probability space ( Ω , A , P ) . Recall that X = { a 1 , a 2 , , a r } is a finite set, where r 2 . We denote by E the expectation with respect to P . The (marginal) distribution of Y is a Bernoulli distribution with success probability p : = P ( Y = 1 ) ( 0 , 1 ) , and let q : = 1 p = P ( Y = 0 ) . Let p = ( p j ) 1 j r and q = ( q j ) 1 j r two conditional distributions of X,
p j : = P ( X = a j Y = 1 ) and q j : = P ( Y = a j Y = 0 ) ,
for any j = 1 , , r . We will assume that all probabilities above are positive,
min 1 j r { p j , q j } > 0 ,
and p q , i.e. there exists at least one j, such that p j q j . Moreover, we do not assume that the number r is known. Consider the empirical probability measures p ^ n = ( p ^ j , n ) 1 j r and q ^ n = ( q ^ j , n ) 1 j r , generated by the sequence i.i.d. random variables ( X 1 , Y 1 ) , ( X n , Y n ) from the bivariate distribution, ( p , q ) , defined from
p ^ j , n : = i = 1 n I ( X i = a j , Y i = 1 ) i = 1 n I ( Y i = 1 ) , q ^ j , n : = i = 1 n I ( X i = a j , Y i = 0 ) i = 1 n I ( Y i = 0 ) ,
where I ( A ) is an indicator of an event A. We assume that the above fractions are zero when the corresponding denominator takes the zero value. Additionally, consider the following notations
p ^ n : = 1 n i = 1 n I ( Y i = 1 ) , q ^ n : = 1 n i = 1 n I ( Y i = 0 ) .
Based on (1), we define the plug-in estimator for the Jeffreys divergence D K L s y m ( p | | q ) as follows,
D K L s y m ( p ^ n | | q ^ n ) : = j = 1 r ( p ^ j , n q ^ j , n ) ln p ^ j , n q ^ j , n .
where the standard convention is adopted: if x 0 , then 0 log ( 0 / x ) = 0 , and if x > 0 , then x log ( x / 0 ) = . Note that the estimator appears to depend on r. However, for any symbol a j X with p ^ j , n = 0 or q ^ j , n = 0 , the corresponding term is not included in the sum. Thus, the sum contains only the terms corresponding to the observed symbols from X , and we do not need to know r.
In this paper, we study the convergence properties, specifically, the asymptotic unbiasedness, or law of large numbers (LLN), and asymptotic normality, or central limit theorem (CLT), of the plug-in estimator (3). To this end, denote
η n : = D K L s y m ( p ^ n | | q ^ n ) D K L s y m ( p | | q ) = j = 1 r ( p ^ j , n q ^ j , n ) ln p ^ j , n q ^ j , n ( p j q j ) ln p j q j .
Theorem 1. 
(Law of Large Numbers) The following equality holds
lim n η n = 0 a . s .
Proof. 
It follows directly from the strong law of large numbers for sequences p ^ n , q ^ n , p ^ j , n , q ^ j , n , 1 j r , defined in (1) and (2). □
Theorem 2. 
(Central Limit Theorem) The following convergence holds true
n η n d ξ N ( 0 , σ 2 ) , a s n ,
where d means the convergence in distribution,
σ 2 = E ( j = 1 r ( 1 p ( I ( X = a j , Y = 1 ) p p j ) p j ( I ( Y = 1 ) p ) 1 + ln p j q j q j p j + 1 q ( I ( X = a j , Y = 0 ) q q j ) q j ( I ( Y = 0 ) q ) 1 + ln q j p j p j q j ) ) 2 .
The plug-in estimator of variance yields the following corollary.
Corollary 1. 
The following convergence holds true
n η n σ ^ n d ξ ˜ N ( 0 , 1 ) , a s n ,
where
σ ^ n 2 = 1 n i = 1 n ( j = 1 r [ 1 p ^ n ( I ( X i = a j , Y i = 1 ) p ^ n p ^ j , n ) p ^ j , n ( I ( Y i = 1 ) p ^ n ) 1 + ln p ^ j , n q ^ j , n q ^ j , n p ^ j , n + 1 q ^ n ( I ( X i = a j , Y i = 0 ) q ^ n q ^ j , n ) q ^ j , n ( I ( Y i = 0 ) q ^ n ) 1 + ln q ^ j , n p ^ j , n p ^ j , n q ^ j , n ] ) 2 .
Indeed, the sequence σ ^ n 2 , by the law of large numbers, converges in probability to σ 2 , and hence the proof of the corollary directly follows from Theorem 2 and Slutsky’s theorem. Recall that Slutsky’s theorem states that if a sequence of random variables converges in distribution and another sequence converges in probability to a constant, then their product (or sum) also converges in distribution, which is applicable in this context to complete the proof of this corollary.
Let us now proceed with the proof of Theorem 2.
Proof. 
To prove the theorem, we will show that the asymptotics of the sequence of interest matches those of the sequence in Lemma 3, involving the differences p ^ j , n p j and q ^ j , n q j for j = 1 , , r . To achieve this, we need some technical work, isolating the relevant term from Lemma 3 and the term that converges to zero in probability. We rewrite the n η n in the following way:
n η n = n j = 1 r ( p ^ j , n q ^ j , n ) ln p ^ j , n q ^ j , n ( p j q j ) ln p j q j = n j = 1 r ( p ^ j , n q ^ j , n ) ln p ^ j , n q ^ j , n ± ( p j q j ) ln p ^ j , n q ^ j , n ( p j q j ) ln p j q j = n j = 1 r ( ( p ^ j , n p j ) ( q ^ j , n q j ) ) ln p ^ j , n q ^ j , n + n j = 1 r p j q j ln p ^ j , n q j p j q ^ j , n = n j = 1 r ( ( p ^ j , n p j ) ( q ^ j , n q j ) ) ln p ^ j , n q ^ j , n + n j = 1 r p j q j ln 1 + p ^ j , n q j p j q ^ j , n p j q ^ j , n .
Denote
y j , n : = ln 1 + p ^ j , n q j p j q ^ j , n p j q ^ j , n .
We will use the Taylor expansion for a function f ( x ) with a remainder term in Lagrange form (see e.g., [44] (p. 880)): for any x with | x | < 1 , there exists a positive K x ( 0 , 1 ) such that
f ( x ) = f ( 0 ) + f ( 0 ) 1 ! x + f ( K x x ) 2 ! x 2 .
Applying this to the function f ( x ) = ln ( 1 + x ) , we have
ln ( 1 + x ) = x x 2 2 ( 1 + K x x ) 2 ,
for | x | < 1 , where K x ( 0 , 1 ) . It is easy to see that the following upper bound holds when | x | 1 2 and K x ( 0 , 1 ) :
x 2 2 ( 1 + K x x ) 2 2 x 2 .
Applying (5) and (6) with x = p ^ j , n q j p j q ^ j , n p j q ^ j , n , we obtain
y j , n p ^ j , n q j p j q ^ j , n p j q ^ j , n 2 p ^ j , n q j p j q ^ j , n p j q ^ j , n 2 I p ^ j , n q j p j q ^ j , n p j q ^ j , n 1 2 + y j , n p ^ j , n q j p j q ^ j , n p j q ^ j , n I p ^ j , n q j p j q ^ j , n p j q ^ j , n > 1 2 a . s .
for any 1 j r . Using (7) and upper bounds (17) and (18) from Lemma 1 in Section 3 (Auxiliary Results), we obtain the following: for any ε > 0 , 1 j r and for some constant C > 0 ,
P n y j , n p ^ j , n q j p j q ^ j , n p j q ^ j , n > ε P 2 p ^ j , n q j p j q ^ j , n p j q ^ j , n 2 I p ^ j , n q j p j q ^ j , n p j q ^ j , n 1 2 > ε 2 n + P n y j , n p ^ j , n q j p j q ^ j , n p j q ^ j , n I p ^ j , n q j p j q ^ j , n p j q ^ j , n > 1 2 > ε 2 P p ^ j , n q j p j q ^ j , n p j q ^ j , n 2 > ε 4 n + P p ^ j , n q j p j q ^ j , n p j q ^ j , n > 1 2 2 P p ^ j , n q j p j q ^ j , n p j q ^ j , n > min ( ε , 1 ) 2 n 1 4 2 P q j p ^ j , n p j + p j q ^ j , n q j > min ( ε , 1 ) p j q j 4 n 1 4 , q ^ j , n q j 2 + 2 P q ^ j , n < q j 2 2 P p ^ j , n p j > min ( ε , 1 ) p j 8 n 1 4 + 2 P q ^ j , n q j > min ( ε , 1 ) q j 8 n 1 4 + 2 P | q j q ^ j , n | > q j 2 4 exp { n C } + 2 exp { n C } .
Denote
z j , n : = p ^ j , n q j p j q ^ j , n p j q ^ j , n , v j , n : = q ^ j , n q j q j .
For any 1 j r , we have
z j , n p ^ j , n q j p j q ^ j , n p j q j = 1 p j q j p ^ j , n q j p j q ^ j , n 1 + q ^ j , n q j q j ( p ^ j , n q j p j q ^ j , n ) = 1 p j q j ( p ^ j , n q j p j q ^ j , n ) v j , n + k = 2 v j , n k I v j , n 1 2 + z j , n p ^ j , n q j p j q ^ j , n p j q j I v j , n > 1 2 2 p j | p ^ j , n p j | · | v j , n | I v j , n 1 2 + 2 q j | q ^ j , n q j | · | v j , n | I v j , n 1 2 + z j , n p ^ j , n q j p j q ^ j , n p j q j I v j , n > 1 2 a . s .
Using the last bound (9) together with upper bounds (17) and (18) from Lemma 1, we obtain the following: for any ε > 0 , 1 j r and for some constant C > 0 ,
P n z j , n p ^ j , n q j p j q ^ j , n p j q j > ε P n z j , n p ^ j , n q j p j q ^ j , n p j q j I v j , n > 1 2 > ε 3 + P | p ^ j , n p j | · | v j , n | I v j , n 1 2 > ε p j 6 n + P | q ^ j , n q j | · | v j , n | I v j , n 1 2 > ε q j 6 n P | p ^ j , n p j | > ε p j 6 n 1 4 + P | v j , n | > ε p j 6 n 1 4 + P | q ^ j , n q j | > ε q j 6 n 1 4 + P | v j , n | > ε q j 6 n 1 4 + P v j , n > 1 2 4 exp { C n } + exp { C n } .
Once more, the inequalities (17) and (18) from Lemma 1 and Lemma 2 (see Section 3) give us the following: for any ε > 0 , 1 j r and some constant C > 0 ,
P n ( ( p ^ j , n p j ) ( q ^ j , n q j ) ) ln p ^ j , n q ^ j , n ( ( p ^ j , n p j ) ( q ^ j , n q j ) ) ln p j q j ε P | p ^ j , n p j | · ln p ^ j , n q j p j q ^ j , n ε 2 n + P | q ^ j , n q j | · ln p ^ j , n q j p j q ^ j , n ε 2 n P | p ^ j , n p j | ε 2 n 1 4 + 2 P ln p ^ j , n q j p j q ^ j , n ε 2 n 1 4 + P | q ^ j , n q j | ε 2 n 1 4 4 exp { C n } .
Denote
ξ n : = n j = 1 r ( p ^ j , n p j ) 1 + ln p j q j q j p j + ( q ^ j , n q j ) ) 1 + ln q j p j p j q j .
Relations (4), (8), (10), and (11) imply that for any ε > 0 there exist constants C 1 > 0 and C > 0 such that
P ( | n η n ξ n | > ε ) C 1 exp { C n } .
Thus, thanks to Slutsky’s theorem, the random sequences η n and ξ n have the same limit by distribution. Finally, Lemma 3 with
b j = 1 + ln p j q j q j p j , c j = 1 + ln q j p j p j q j ,
concludes the proof. □

2.1. Example

We show how to cluster catastrophic processes (see, [45,46]), using Jeffreys Divergence for their characteristics. Suppose that we have two types of insurance claims. Let Y { 0 , 1 } be a type of claim with P ( Y = 1 ) = p and P ( Y = 0 ) = q , and let X be the size of damage, or payment according to the claim. The conditional distribution of X is P ( X = a j | Y = 1 ) = p j , P ( X = a j | Y = 0 ) = q j , where a j , 1 j r . We assume that there is no difference between the conditional distribution of X if
D K L s y m ( p | | q ) = j = 1 r ( p j q j ) ln p j q j < α ,
where α is a small critical value that distinguishes the types of claims. Let ( X 1 , Y 1 ) , , ( X n , Y n ) be the sample of size n, and let D K L s y m ( p ^ n | | q ^ n ) and σ ^ n be the respective estimations. For sufficiently large n, we can estimate the probability (see Corollary 1)
P D K L s y m < α = P η n > D K L s y m ( p ^ n | | q ^ n ) α = P η n > D K L s y m ( p ^ n | | q ^ n ) α = P n η n σ ^ n > n D K L s y m ( p ^ n | | q ^ n ) α σ ^ n 1 2 π b n e t 2 2 d t ,
where
b n = n D K L s y m ( p ^ n | | q ^ n ) α σ ^ n .
These relations propose a calculation of the probability that inequality (12) is true.

3. Auxiliary Results

The following lemma contains the proofs of inequalities (17) and (18) used in the proof of Theorem 2. The other inequalities of the lemma, (13)–(16), are used in the proofs in this section.
Lemma 1. 
For any g > 0 , the following inequalities hold
P p ^ n p ) > g 2 exp n g 2 2 ( max ( p , q ) ) 2 ,
P q ^ n q ) > g 2 exp n g 2 2 ( max ( p , q ) ) 2 ,
max 1 j r P 1 n i = 1 n ( I ( X i = a j , Y i = 1 ) p p j ) > g 2 exp n g 2 2 ( max ( p p max , 1 p p min ) ) 2 ,
max 1 j r P 1 n i = 1 n ( I ( X i = a j , Y i = 0 ) q q j ) > g 2 exp n g 2 2 ( max ( q q max , 1 q q min ) ) 2 ,
max 1 j r P p ^ j , n p j > g 2 exp n g 2 p 2 2 7 ( max ( p , q ) ) 2 p max 2 + 2 exp n g 2 p 2 8 ( max ( p p max , 1 p p min ) ) 2 + exp n p 2 p min 2 2 ( max ( p p max , 1 p p min ) ) 2 + exp n p 2 8 ( max ( p , q ) ) 2 ,
max 1 j r P q ^ j , n q j > g 2 exp n g 2 q 2 2 7 ( max ( p , q ) ) 2 q max 2 + 2 exp n g 2 q 2 8 ( max ( q q max , 1 q q min ) ) 2 + exp n q 2 q min 2 2 ( max ( q q max , 1 q q min ) ) 2 + exp n q 2 8 ( max ( p , q ) ) 2 ,
where p min : = min 1 j r p j , p max : = max 1 j r p j , q min : = min 1 j r p j , q max : = max 1 j r p j .
Proof. 
The Hoeffding inequality proves (13). Indeed,
P p ^ n p ) > g = P 1 n i = 1 n ( I ( Y i = 1 ) p ) > g P i = 1 n ( I ( Y i = 1 ) p ) > n g + P i = 1 n ( I ( Y i = 1 ) p ) > n g 2 exp n g 2 2 ( max ( p , q ) ) 2 .
Inequalities (14)–(16) are obtained in the same way, and therefore we omit their proofs.
Let us prove (17). We have
P p ^ j , n p j > g = P 1 n i = 1 n I ( X i = a j , Y i = 1 ) 1 n i = 1 n I ( Y i = 1 ) ± 1 n i = 1 n I ( X i = a j , Y i = 1 ) p p j > g P 1 n i = 1 n I ( X i = a j , Y i = 1 ) 1 n i = 1 n I ( Y i = 1 ) 1 n i = 1 n I ( X i = a j , Y i = 1 ) p > g 2 + P 1 n i = 1 n I ( X i = a j , Y i = 1 ) p p j > g 2 = : P 1 + P 2 .
Denote
A : = 1 n i = 1 n I ( X i = a j , Y i = 1 ) > 2 p p j , B : = 1 n i = 1 n I ( Y i = 1 ) < p 2 .
Let us find the upper bound for the probability P 1 . We have
P 1 = P p 1 n i = 1 n I ( Y i = 1 ) 1 n i = 1 n I ( X i = a j , Y i = 1 ) p 1 n i = 1 n I ( Y i = 1 ) > g 2 = P p 1 n i = 1 n I ( Y i = 1 ) > g p 1 n i = 1 n I ( Y i = 1 ) 2 1 n i = 1 n I ( X i = a j , Y i = 1 ) , A ¯ , B ¯ + P ( A ) + P ( B ) P 1 n i = 1 n ( I ( Y i = 1 ) p ) > g p 8 p j + P ( A ) + P ( B ) .
Next, for any 1 j r , inequality (13) provides the following upper bound for the first term from the righthand side of the last inequality above (20),
P 1 n i = 1 n ( I ( Y i = 1 ) p ) > g p 8 p j 2 exp { n g 2 p 2 2 7 ( max ( p , q ) ) 2 max 1 j r p j 2 } .
The upper bound for the probabilities P ( A ) , P ( B ) , and P 2 are obtained by the Hoeffding inequality:
P ( A ) = P i = 1 n ( I ( X i = a j , Y i = 1 ) p p j ) > n p p j exp n p 2 min 1 j r p j 2 2 max 1 j r ( max ( p p j , 1 p p j ) ) 2 , P ( B ) = P i = 1 n ( p I ( Y i = 1 ) ) > p n 2 exp n p 2 8 ( max ( p , q ) ) 2 , P 2 2 exp n g 2 p 2 8 max 1 j r ( max ( p p j , 1 p p j ) ) 2 .
Finally, inequalities (19)–(22) yield relation (17).
The proof of inequality (18) has a similar structure as the proof of (17). □
The next lemma contains a huge, multilevel upper bound, which was used in the proof of Theorem 2.
Lemma 2. 
For any g > 0 , the following inequality holds
max 1 j r P ln p ^ j , n q j p j q ^ j , n > g 4 exp n g 2 p 2 p min 2 2 11 ( max ( p , q ) ) 2 p max 2 + 4 exp n g 2 p 2 p min 2 2 7 ( max ( p p max , 1 p p min ) ) 2 + 2 exp n p 2 p min 2 2 9 ( max ( p , q ) ) 2 p max 2 + 2 exp n p 2 p min 2 2 5 ( max ( p p max , 1 p p min ) ) 2 + 3 exp n p 2 p min 2 2 ( max ( p p max , 1 p p min ) ) 2 + 3 exp n p 2 8 ( max ( p , q ) ) 2 + 4 exp n g 2 q 2 q min 2 2 11 ( max ( p , q ) ) 2 q max 2 + 4 exp n g 2 q 2 q min 2 2 7 ( max ( q q max , 1 q q min ) ) 2 + 2 exp n q 2 q min 2 2 9 ( max ( p , q ) ) 2 q max 2 + 2 exp n q 2 q min 2 2 5 ( max ( q q max , 1 q q min ) ) 2 + 3 exp n q 2 q min 2 2 ( max ( q q max , 1 q q min ) ) 2 + 3 exp n q 2 8 ( max ( p , q ) ) 2 .
Proof. 
For any g > 0 and 1 j r , we obtain
P ln p ^ j , n q j p j q ^ j , n > g = P ln p ^ j , n q j p j q ^ j , n > g + P ln p ^ j , n q j p j q ^ j , n > g = : P 1 + P 2 .
Let us bound P 1 from above. Using inequality e g 1 g and using Lemma 1, for any 1 j r , we obtain
P 1 = P p ^ j , n q j p j q ^ j , n > e g = P p ^ j , n q j p j q ^ j , n 1 > e g 1 P p ^ j , n q j p j q ^ j , n 1 > g P p ^ j , n q j p j q ^ j , n p j q ^ j , n > g , q ^ j , n q j 2 + P q ^ j , n < q j 2 P p ^ j , n q j p j q ^ j , n > g 1 p j q j 2 + P q ^ j , n < q j 2 = P q j ( p ^ j , n p j ) p j ( q ^ j , n q j ) > g p j q j 2 + P q ^ j , n < q j 2 P q j | p ^ j , n p j | + p j | q ^ j , n q j | > g p j q j 2 + P q ^ j , n < q j 2 P | p ^ j , n p j | > g p j 4 + P | q ^ j , n q j | > g q j 4 + P q ^ j , n < q j 2 = P | p ^ j , n p j | > g p j 4 + P | q ^ j , n q j | > g q j 4 + P q j q ^ j , n > q j 2
Next, we apply the upper bounds (17) and (18) for the probabilities in the last line from the inequality above. Thus,
P 1 2 exp n g 2 p 2 p min 2 2 11 ( max ( p , q ) ) 2 p max 2 + 2 exp n g 2 p 2 p min 2 2 7 ( max ( p p max , 1 p p min ) ) 2 + exp n p 2 p min 2 2 ( max ( p p max , 1 p p min ) ) 2 + exp n p 2 8 ( max ( p , q ) ) 2 + 2 exp n g 2 q 2 q min 2 2 11 ( max ( p , q ) ) 2 q max 2 + 2 exp n g 2 q 2 q min 2 2 7 ( max ( q q max , 1 q q min ) ) 2 + 2 exp n q 2 q min 2 2 9 ( max ( p , q ) ) 2 q max 2 + 2 exp n q 2 q min 2 2 5 ( max ( q q max , 1 q q min ) ) 2 + 2 exp n q 2 q min 2 2 ( max ( q q max , 1 q q min ) ) 2 + 2 exp n q 2 8 ( max ( p , q ) ) 2 .
We obtain the upper bound for P 2 in the same way:
P 2 2 exp n g 2 p 2 p min 2 2 11 ( max ( p , q ) ) 2 p max 2 + 2 exp n g 2 p 2 p min 2 2 7 ( max ( p p max , 1 p p min ) ) 2 + 2 exp n p 2 p min 2 2 9 ( max ( p , q ) ) 2 p max 2 + 2 exp n p 2 p min 2 2 5 ( max ( p p max , 1 p p min ) ) 2 + 2 exp n p 2 p min 2 2 ( max ( p p max , 1 p p min ) ) 2 + 2 exp n p 2 8 ( max ( p , q ) ) 2 + 2 exp n g 2 q 2 q min 2 2 11 ( max ( p , q ) ) 2 q max 2 + 2 exp n g 2 q 2 q min 2 2 7 ( max ( q q max , 1 q q min ) ) 2 + exp n q 2 q min 2 2 ( max ( q q max , 1 q q min ) ) 2 + exp n q 2 8 ( max ( p , q ) ) 2 .
Inequalities (24)–(26) imply (23). □
Lemma 3. 
For any given constants b j , c j , 1 j r , such that j = 1 r ( | b j | + | c j | ) > 0 , the following convergence takes place
n j = 1 r ( p ^ j , n p j ) b j + ( q ^ j , n q j ) c j d ξ N ( 0 , σ 2 ) , a s n ,
with
σ 2 : = E ( j = 1 r ( 1 p ( I ( X = a j , Y = 1 ) p p j ) p j ( I ( Y = 1 ) p ) b j + 1 q ( I ( X = a j , Y = 0 ) q q j ) q j ( I ( Y = 0 ) q ) c j ) ) 2 .
Proof. 
Denote
y j , n : = 1 n p i = 1 n ( I ( X i = a j , Y i = 1 ) p p j ) p j n i = 1 n ( I ( Y i = 1 ) p ) , z n : = 1 n p i = 1 n ( I ( Y i = 1 ) p ) , v j , n : = 1 n p i = 1 n ( I ( X i = a j , Y i = 1 ) p p j ) .
We have
| n ( p ^ j , n p j ) y j , n | = 1 n i = 1 n I ( X i = a j , Y i = 1 ) p 1 + 1 n p i = 1 n ( I ( Y i = 1 ) p ) n p j y j , n = 1 n p i = 1 n I ( X i = a j , Y i = 1 ) 1 z n + k = 2 ( z n ) k n p j y j , n I | z n | 1 2 + | n ( p ^ j , n p j ) y j , n | I | z n | > 1 2 = ( v j , n + n p j ) 1 z n + k = 2 ( z n ) k n p j y j , n I | z n | 1 2 + | n ( p ^ j , n p j ) y j , n | I | z n | > 1 2 = v j , n z n + 1 n p i = 1 n I ( X i = a j , Y i = 1 ) k = 2 ( z n ) k I | z n | 1 2 + | n ( p ^ j , n p j ) y j , n | I | z n | > 1 2 | v j , n z n | I | z n | 1 2 + 2 n p i = 1 n I ( X i = a j , Y i = 1 ) z n 2 I | z n | 1 2 + | n ( p ^ j , n p j ) y j , n | I | z n | > 1 2 a . s .
Utilizing (27), for any 1 j r , ε > 0 , we obtain
P ( | n ( p ^ j , n p j ) y j , n | > ε ) P | v j , n z n | I | z n | 1 2 > ε 3 + P 2 n p i = 1 n I ( X i = a j , Y i = 1 ) z n 2 I | z n | 1 2 > ε 3 + P | n ( p ^ j , n p j ) y j , n | I | z n | > 1 2 > ε 3 P | v j , n z n | > ε 3 + P 2 n p i = 1 n I ( X i = a j , Y i = 1 ) z n 2 > ε 3 + P | z n | > 1 2 = : P 1 + P 2 + P 3 .
We obtain the upper bound for P 1 using inequalities from Lemma 1. Indeed, inequalities (13) and (15)) imply that for any ε > 0 , there exists C > 0 such that
P 1 = P 1 n i = 1 n ( I ( X i = a j , Y i = 1 ) p p j ) · 1 n i = 1 n ( I ( Y i = 1 ) p ) > p 2 ε 3 n P 1 n i = 1 n ( I ( X i = a j , Y i = 1 ) p p j ) > p ε 3 n 1 4 + P 1 n i = 1 n ( I ( Y i = 1 ) p ) > p ε 3 n 1 4 2 exp C n .
We obtain the upper bound for the probability P 2 in the same way. Indeed, inequalities (13) and (15) imply that for any ε > 0 , there exists C > 0 such that
P 2 = P 1 n i = 1 n I ( X i = a j , Y i = 1 ) 1 n i = 1 n ( I ( Y i = 1 ) p ) 2 > p 3 ε 6 n P 1 n i = 1 n ( I ( Y i = 1 ) p ) 2 > p 2 ε 3 n p j , 1 n i = 1 n I ( X i = a j , Y i = 1 ) p p j 2 + P 1 n i = 1 n I ( X i = a j , Y i = 1 ) < p p j 2 P 1 n i = 1 n ( I ( Y i = 1 ) p ) > p ε 3 p j n 1 4 + P 1 n i = 1 n ( p p j I ( X i = a j , Y i = 1 ) ) > p p j 2 exp C n + exp C n .
Finally, the bound for the probability P 3 is provided by inequality (13):
P 3 = P 1 n i = 1 n ( I ( Y i = 1 ) p ) > 1 2 exp C n ,
for some C > 0 . Relations (28)–(31) imply that for any ε > 0 , there exists C > 0 such that
max 1 j r P ( | n ( p ^ j , n p j ) y j , n | > ε ) 4 exp C n .
Denote
y j , n : = 1 n q i = 1 n ( I ( X i = a j , Y i = 0 ) q q j ) q j n i = 1 n ( I ( Y i = 0 ) q ) .
In a completely similar manner to the above, for any ε > 0 and some C > 0 , we obtain
max 1 j r P ( | n ( q ^ j , n q j ) y j , n | > ε ) 4 exp C n .
It is easy to see that
n j = 1 r ( p ^ j , n p j ) b j + ( q ^ j , n q j ) c j = j = 1 r y j , n b j + y j , n c j + j = 1 r ( n ( p ^ j , n p j ) y j , n ) b j + j = 1 r ( n ( q ^ j , n q j ) y j , n ) c j = : ξ n + ϕ n + ζ n .
By (32) and (33), for any ε > 0 , we obtain
lim n P ( | ϕ n + ζ n | > ε ) = 0 .
Therefore, by Slutsky’s theorem, the weak limit of the original sequence coincides with the limit sequences ξ n . It remains to be noted that
ξ n = 1 n i = 1 n j = 1 r ( 1 p ( I ( X i = a j , Y i = 1 ) p p j ) p j ( I ( Y i = 1 ) p ) b j + 1 q ( I ( X i = a j , Y i = 0 ) q q j ) q j ( I ( Y i = 0 ) q ) c j )
and to apply the central limit theorem. □

4. Conclusions

In this paper, we studied the symmetrized version of the KL divergence, D K L s y m ( p | | q ) : = D K L ( p | | q ) + D K L ( q | | p ) . Today, it is known as Jeffreys divergence and is popular for classification problems. In this context, the distributions p and q represent the conditional probability distributions of a characteristic of interest under two different classes or conditions.
We established the asymptotic unbiasedness and normality of the plug-in estimator for the Jeffreys divergence. We considered a one-sample estimator, where, for a given sample size n, the number of observations in one condition is random and follows a binomial distribution. This differs from the traditional approach, where the properties of estimators are studied as the given sample sizes n and m of the two classes increase (see [38]). The results were expected, but additional technical work was required due to the randomness of the number of observations in one class. Moreover, we did not find detailed proofs for the Jeffreys divergence in the literature.
In this paper, we avoided referencing some known methods for proving normality, such as the δ -method, and provided detailed proofs instead. We believe that such proofs are accessible to undergraduate students.

Author Contributions

Conceptualization, V.G., A.L., A.Y., H.R.; methodology, A.L., A.Y., H.R.; writing—original draft preparation, A.L., A.Y., H.R.; writing—review and editing, V.G., O.L., L.S.; project administration, L.S. All authors have read and agreed to the published version of the manuscript.

Funding

V. Glinskiy and A. Logachov thank RSCF for financial support via the grant 24-28-01047; A. Yambartsev thanks FAPESP for financial support via the grant 2023/13453-5.

Data Availability Statement

The data that support the findings of this study are available from the corresponding authors upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  2. Basseville, M. Divergence measures for statistical data processing—An annotated bibliography. Signal Process. 2013, 93, 621–633. [Google Scholar] [CrossRef]
  3. Rubenstein, P.; Bousquet, O.; Djolonga, J.; Riquelme, C.; Tolstikhin, I.O. Practical and consistent estimation of f-divergences. Adv. Neural Inf. Process. Syst. 2019, 32., 4072–4082. [Google Scholar]
  4. Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
  5. Zhang, C.; Bütepage, J.; Kjellström, H.; Mandt, S. Advances in variational inference. Jieee Trans. Pattern Anal. Mach. Intell. 2018, 41, 2008–2026. [Google Scholar] [CrossRef]
  6. Tzikas, D.G.; Likas, A.C.; Galatsanos, N.P. The variational approximation for Bayesian inference. IEEE Signal Process. Mag. 2008, 25, 131–146. [Google Scholar] [CrossRef]
  7. Jewson, J.; Smith, J.Q.; Holmes, C. Principles of Bayesian inference using general divergence criteria. Entropy 2018, 20, 442. [Google Scholar] [CrossRef]
  8. Ji, S.; Zhang, Z.; Ying, S.; Wang, L.; Zhao, X.; Gao, Y. Kullback–Leibler divergence metric learning. IEEE Trans. Cybern. 2020, 52, 2047–2058. [Google Scholar] [CrossRef]
  9. Noh, Y.K.; Sugiyama, M.; Liu, S.; Plessis, M.C.; Park, F.C.; Lee, D.D. Bias reduction and metric learning for nearest-neighbor estimation of Kullback-Leibler divergence. Artif. Intell. Stat. 2014, 1, 669–677. [Google Scholar] [CrossRef]
  10. Suárez, J.L.; García, S.; Herrera, F. A tutorial on distance metric learning: Mathematical foundations, algorithms, experimental analysis, prospects and challenges. Neurocomputing 2021, 425, 300–322. [Google Scholar] [CrossRef]
  11. Claici, S.; Yurochkin, M.; Ghosh, S.; Solomon, J. Model fusion with Kullback-Leibler divergence. Int. Conf. Mach. Learn. 2020, 1, 2038–2047. [Google Scholar]
  12. Póczos, B.; Xiong, L.; Schneider, J. Nonparametric divergence estimation with applications to machine learning on distributions. arXiv 2012, arXiv:1202.3758. [Google Scholar]
  13. Cui, S.; Luo, C. Feature-based non-parametric estimation of Kullback–Leibler divergence for SAR image change detection. Remote. Sens. Lett. 2016, 7, 1102–1111. [Google Scholar] [CrossRef]
  14. Deledalle, C.A. Estimation of Kullback-Leibler losses for noisy recovery problems within the exponential family. Electron. J. Stat. 2017, 11, 3141–3164. [Google Scholar] [CrossRef]
  15. Granero-Belinchón, C.; Roux, S.G.; Garnier, N.B. Kullback-Leibler divergence measure of intermittency: Application to turbulence. Phys. Rev. 2018, 97, 013107. [Google Scholar] [CrossRef]
  16. Charzyńska, A.; Gambin, A. Improvement of the k-NN entropy estimator with applications in systems biology. Entropy 2015, 18, 13. [Google Scholar] [CrossRef]
  17. Belavkin, R.V. Asymmetric topologies on statistical manifolds. Int. Conf. Geom. Sci. Inf. 2015, 1, 203–210. [Google Scholar]
  18. Jager, L.; Wellner, J.A. Goodness-of-fit tests via phi-divergences. Ann. Statist. 2007, 35, 2018–2053. [Google Scholar] [CrossRef]
  19. Vexler, A.; Gurevich, G. Empirical likelihood ratios applied to goodness-of-fit tests based on sample entropy. Comput. Stat. Data Anal. 2010, 54, 531–545. [Google Scholar] [CrossRef]
  20. Evren, A.; Tuna, E. On some properties of goodness of fit measures based on statistical entropy. Int. J. Res. Rev. Appl. Sci. 2012, 13, 192–205. [Google Scholar]
  21. Bulinski, A.; Dimitrov, D. Statistical estimation of the Kullback–Leibler divergence. Mathematics 2021, 9, 544. [Google Scholar] [CrossRef]
  22. Broniatowski, M. Estimation of the Kullback-Leibler divergence. Math. Methods Stat. 2003, 12, 391–409. [Google Scholar]
  23. Seghouane, A.K.; Amari, S.I. Variants of the Kullback-Leibler divergence and their role in model selection. Ifac Proc. Vol. 2006, 39, 826–831. [Google Scholar] [CrossRef]
  24. Audenaert, K.M. On the asymmetry of the relative entropy. J. Math. Phys. 2013, 54, 073506. [Google Scholar] [CrossRef]
  25. Pinski, F.J.; Simpson, G.; Stuart, A.M.; Weber, H. Kullback–Leibler approximation for probability measures on infinite dimensional spaces. Siam J. Math. Anal. 2015, 27, 4091–4122. [Google Scholar] [CrossRef]
  26. Zeng, J.; Xiao, F. A fractal belief KL divergence for decision fusion. Eng. Appl. Artif. Intell. 2023, 121, 106027. [Google Scholar] [CrossRef]
  27. Kamiński, M. On the Symmetry Importance in a Relative Entropy Analysis for Some Engineering Problems. Symmetry 2022, 14, 1945. [Google Scholar] [CrossRef]
  28. Johnson, D.H.; Sinanovic, S. Symmetrizing the kullback-leibler distance. IEEE Trans. Inf. Theory 2001, 1, 1–10. [Google Scholar]
  29. Jeffreys, H. The Theory of Probability; Oxford Classic Texts in the Physical Sciences: Oxford, UK, 1998. [Google Scholar]
  30. Chen, J.; Matzinger, H.; Zhai, H.; Zhou, M. Centroid estimation based on symmetric kl divergence for multinomial text classification problem. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; pp. 1174–1177. [Google Scholar]
  31. Andriamanalimanana, B.; Tekeoglu, A.; Bekiroglu, K.; Sengupta, S.; Chiang, C.F.; Reale, M.; Novillo, J. Symmetric kullback-leibler divergence of softmaxed distributions for anomaly scores. In Proceedings of the 2019 IEEE Conference on Communications and Network Security (CNS), Washington, DC, USA, 10–12 June 2019; Volume 1, pp. 1–6. [Google Scholar]
  32. Domke, J. An easy to interpret diagnostic for approximate inference: Symmetric divergence over simulations. arXiv 2021, arXiv:2103.01030. [Google Scholar]
  33. Nguyen, B.; Morell, C.; De Baets, B. Supervised distance metric learning through maximization of the Jeffrey divergence. Pattern Recognit. 2017, 64, 215–225. [Google Scholar] [CrossRef]
  34. Moreno, P.; Ho, P.; Vasconcelos, N. A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications. Adv. Neural Inf. Process. Syst. 2003, 6. [Google Scholar]
  35. Yao, Z.; Lai, Z.; Liu, W. A symmetric KL divergence based spatiogram similarity measure. In Proceedings of the 2011 18th IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September 2011; Volume 1, pp. 193–196. [Google Scholar]
  36. Said, A.B.; Hadjidj, R.; Foufou, S. Cluster validity index based on Jeffrey divergence. Pattern Anal Appl. 2017, 20, 21–31. [Google Scholar] [CrossRef]
  37. Antos, A.; Kontoyiannis, I. Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithms 2001, 19, 163–193. [Google Scholar] [CrossRef]
  38. Zhang, Z.; Grabchak, M. Nonparametric estimation of Küllback-Leibler divergence. Neural Comput. 2014, 26, 2570–2593. [Google Scholar] [CrossRef]
  39. Jiao, J.; Venkat, K.; Han, Y.; Weissman, T. Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 2015, 61, 2835–2885. [Google Scholar] [CrossRef]
  40. Jiao, J.; Venkat, K.; Han, Y.; Weissman, T. Maximum likelihood estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 2017, 63, 6774–6798. [Google Scholar] [CrossRef]
  41. Bulinski, A.; Dimitrov, D. Divergence Measures Estimation and Its Asymptotic Normality Theory in the discrete case. Eur. J. Pure Appl. Math. 2019, 12, 790–820. [Google Scholar]
  42. Yao, L.Q.; Liu, S.H. Symmetric KL-divergence by Stein’s Method. arXiv 2024, arXiv:2401.11381. [Google Scholar]
  43. Bobkov, S.G.; Chistyakov, G.P.; Götze, F. Rényi divergence and the central limit theorem. Ann. Probab. 2019, 47, 270–323. [Google Scholar] [CrossRef]
  44. Abramowitz, M.; Stegun, I.A. (Eds.) Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th ed.; Dover: New York, NY, USA, 1972. [Google Scholar]
  45. Rojas, H.; Logachov, A.; Yambartsev, A. Order Book Dynamics with Liquidity Fluctuations: Asymptotic Analysis of Highly Competitive Regime. Mathematics 2023, 11, 4235. [Google Scholar] [CrossRef]
  46. Logachov, A.; Logachova, O.; Yambartsev, A. Processes with catastrophes: Large deviation point of view. Stoch. Process. Their Appl. 2024, 176, 104447. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Glinskiy, V.; Logachov, A.; Logachova, O.; Rojas, H.; Serga, L.; Yambartsev, A. Asymptotic Properties of a Statistical Estimator of the Jeffreys Divergence: The Case of Discrete Distributions. Mathematics 2024, 12, 3319. https://doi.org/10.3390/math12213319

AMA Style

Glinskiy V, Logachov A, Logachova O, Rojas H, Serga L, Yambartsev A. Asymptotic Properties of a Statistical Estimator of the Jeffreys Divergence: The Case of Discrete Distributions. Mathematics. 2024; 12(21):3319. https://doi.org/10.3390/math12213319

Chicago/Turabian Style

Glinskiy, Vladimir, Artem Logachov, Olga Logachova, Helder Rojas, Lyudmila Serga, and Anatoly Yambartsev. 2024. "Asymptotic Properties of a Statistical Estimator of the Jeffreys Divergence: The Case of Discrete Distributions" Mathematics 12, no. 21: 3319. https://doi.org/10.3390/math12213319

APA Style

Glinskiy, V., Logachov, A., Logachova, O., Rojas, H., Serga, L., & Yambartsev, A. (2024). Asymptotic Properties of a Statistical Estimator of the Jeffreys Divergence: The Case of Discrete Distributions. Mathematics, 12(21), 3319. https://doi.org/10.3390/math12213319

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop