Abstract
In the field of information theory, statistics and other application areas, the information-theoretic divergences are used widely. To meet the requirement of metric properties, we introduce a class of new metrics based on triangular discrimination which are bounded. Moreover, we obtain some sharp inequalities for the triangular discrimination and other information-theoretic divergences. Their asymptotic approximation properties are also involved.
1. Introduction
In many applications such as pattern recognition, machine learning, statistics, optimization and other applied branches of mathematics, it is beneficial to use the information-theoretic divergences rather than the squared Euclidean distance to estimate the (dis)similarity of two probability distributions or positive arrays [,,,,,,,,]. Among them the Kullback–Leibler divergence (relative entropy), triangular discrimination, variation distance, Hellinger distance, Jensen–Shannon divergence, symmetric Chi-square divergence, J-divergence and other important measures often play a critical role. Unfortunately, most of these divergences do not satisfy the metric properties and unboundedness []. As we know, metric properties are the preconditions for numerous convergence properties of iterative algorithms []. Moreover, boundedness is also highly concerned in numerical computations and simulations. In paper [], Endres and Schindelin have proved that the square root of twice Jensen–Shannon divergence is a metric. Triangular discrimination presented by Topsøe in [] is a non-logarithmic measure and is simple in complex computation. Inspired by [], we discuss the triangular discrimination. In this paper, the main result is that a class of new metrics derived from the triangular discrimination are introduced. Finally, some new relationships among triangular discrimination, Jensen–Shannon divergence, square of Hellinger distance, variation distance are also obtained.
2. Definition and Auxiliary Results
Definition 1. Let
be the set of all complete finite discrete probability distributions. For all , the triangular discrimination is defined by
In the above definition, we use convention based on limitation property that .
The triangular discrimination is obviously symmetric, nonnegative and vanishes for , but it does not fulfill the triangle inequality. In the view of the foregoing, the concept of triangular discrimination should be generalized. If , the function is studied:
where .
In the following, the power of the summand in with all are discussed.
Definition 2. Let the function be defined by
It is easy to see that and . To all , the issue of whether satisfies the triangle inequality is considered in the following.
Lemma 1. If the function is defined by
with , then
Proof. As
we can get
Lemma 2. If the function is defined by with , then h is monotonic increasing in and monotonic decreasing in .
Proof. Straightforward derivative shows
in and in . Thus the lemma holds.
Assuming , we introduce function defined by
Lemma 3. The function has two minima, one at and the other at .
Proof. The derivative of the function is
So for and for . It shows is monotonic decreasing in and monotonic increasing in .
Next consider the monotonicity of in the open interval .
From Lemma 3, we have
From Lemma 2, we have
Using (5) and (6),
Let
then
The equality holds if and only if . So with respect to variable r in the open interval , and are both monotonic decreasing, is also monotonic decreasing. Using (4),
this shows . So we can see has only one zero point in the open interval with respect to variable r. As a consequence, has only one zero point in the open interval with respect to variable r. This means in the interval , in the interval . From the above we know has only one maximum and no minimum in the open interval .
As a result, the conclusion in the lemma is obtained. ☐
Theorem 1. Let , then
Proof. If , then . The triangle inequality (7) obviously holds.
If and one of is equal to 0, it is easy to obtain that (7) holds.
Next we assume without loss of generality. Note that the formula is valid:
From Lemma 3 the triangle inequality (7) can be easily proved for any number . ☐
Corollary 1. Let . If , then
Proof. Let and , then which follows from the concavity of . Now a γ which satisfies can be found. Thus from Theorem 1,
This is the triangle inequality (8) for the function . ☐
Theorem 2. Let . If , then the triangle inequality (8) does not hold.
Proof. Assuming , let . Firstly the formula is valid:
The derivative of the function l is
When , let
Using l’Hôspital’s rule,
So
According to the definition of derivative, there exists a such that for any ,
This shows the triangle inequality (8) does not hold. ☐
To sum up the theorems and corollary above, we can obtain the main theorem:
Theorem 3. The function satisfies the triangle inequality (8) if and only if .
3. Metric Properties of
In this section, we mainly prove the following theorem:
Theorem 4. The function is a metric on the space if and only if .
Proof. From (2) we can get . It is easy to see that with equality only for and . So what we concern is whether the triangle inequality
holds for any .
☐
When , , the triangle inequality (9) holds apparently. So we assume in the following.
Next we consider the value of α in two cases respectively:
(i) :
From Theorem 3, the inequality holds. Applying Minkowski’s inequality we have
So the triangle inequality (9) holds.
(ii) :
Let
where
Then .
Next we prove and are not the extreme points of the function . By the symmetry we only need to prove is not the extreme point.
By partial derivative,
Since , we might as well assume and .
Then taking (11) and (12) into (10), we have
Therefore, is not the extreme point of the function . For the same reason, is also not the extreme point.
Using the definition of extreme point, there exists a point such that . As , then . The inequality is not consistent with the triangle inequality (9).
From what has been discussed above, the conclusion in the theorem is obtained. ☐s
The generalization of this result to continuous probability distributions is straightforward. Consider a measurable space , and P, Q are probability distributions with Radon-Nykodym densities , w.r.t. a dominating σ-finite measure μ. Then
is a metric if and only if .
Next we will discuss the maxima and minima of . It is obvious that is the minima, if and only if . Because can rewrite in the form
obtains the maxima 2 when are two distinct deterministic distributions, namely . Then the metric achieves its maximum value .
4. Some Inequalities among the Information-Theoretic Divergences
Definition 3. For all , the Jensen–Shannon divergence is defined by
The square of the Hellinger distance is defined by
The variance distance is defined by
Next we introduce the Csiszár’s f-divergence [].
Definition 4. Let be a convex function satisfying , the f-divergence measure introduced by Csiszár is defined as
for all
The triangular discrimination, Jensen–Shannon divergence, the square of the Hellinger distance, variance distance are all f-divergence.
Example 1. (Triangular Discrimination) Let us consider
in (15). Then we can verify is convex because , , and .
Example 2. (Jensen–Shannon divergence) Let us consider
in (15). Then we can verify is convex because , and . By standard inequality , holds.
Example 3. (Square of Hellinger distance) Let us consider
in (15). Then we can verify is convex because , , and .
Example 4. (Variation distance) Let us consider
in (15). Then we can easily get is convex, , and .
Theorem 5. Let be two nonnegative generating functions and there exists the real constants such that and if then
if , then . We have the inequalities:
Proof. The conditions can be rewritten as . So from the formula (15),
and
☐
We have shown that , , , are all nonnegative. In the following we will have some inequalities.
Theorem 6.
Proof. When , both and are not equal to 0. We consider the function:
The derivative of the function is
Let
Straightforward derivative shows
So is concave function when and . This means gets the maximum 0 at the point . Accordingly when . From (16), we find
and
Using l’Hôspital’s rule (differentiate twice),
Using l’Hôspital’s rule (differentiate once),
Thus
When , . As a consequence of Theorem 5, we obtain the result
Thus the theorem is proved. ☐
Theorem 7.
Proof. When , both and are not equal to 0. We consider the function:
The derivative of the function is
By standard inequality ,
So
and
Using l’Hôspital’s rule (differentiate twice),
Using l’Hôspital’s rule (differentiate once),
Thus
or
When , . As a consequence of Theorem 5, we obtain the result
Thus the theorem is proved. ☐
Theorem 8.
Proof. When , both and are not equal to 0. We consider the function:
When , . As a consequence of Theorem 5, we obtain the result . This means . Next,
Thus the theorem is proved. ☐
From the above theorems, inequalities among these measures are given by
These inequalities are sharper than the inequalities in [] Theorem 2 and [] (Section 3.1).
5. Asymptotic Approximation
Definition 5. For all , the Chi-square divergence is defined by
In [],
In this section, we will discuss the asymptotic approximation of and when in norm.
Theorem 9. If , then
Proof. From Taylor’s series expansion at q, we have
Hence
☐
Equivalently, when . So in some cases, one of the information-theoretic divergences can be substituted for another. The asymptotic property can also interpret the boundedness of triangular discrimination and, on the other hand, the new metrics.
Acknowledgments
The authors would like to thank the editor and referees for their helpful suggestions and comments on the manuscript. This manuscript is supported by China Postdoctoral Science Foundation (2015M571255), the National Science Foundation of China (the NSF of China) Grant No. 71171119, the Fundamental Research Funds for the Central Universities (FRF-CU) Grant No. 2722013JC082, and the Fundamental Research Funds for the Central Universities under grant number NKZXTD1403.
Author Contributions
Wrote the paper: Guoxiang Lu and Bingqing Li. Both authors have read and approved the final manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Basseville, M. Divergence measures for statistical data processing—An annotated bibliography. Signal Process. 2013, 93, 621–633. [Google Scholar] [CrossRef]
- Csiszár, I.; Shields, P.C. Information theory and statistics: A tutorial. Found. Trends Commun. Inf. Theory 2004, 1, 417–528. [Google Scholar] [CrossRef]
- Dragomir, S.S.; Gluščević, V. Some inequalities for the Kullback–Leibler and χ2-distances in information theory and applications. Tamsui Oxf. J. Math. Sci. 2001, 17, 97–111. [Google Scholar]
- Reid, M.D.; Williamson, R.C. Information, divergence and risk for binary experiments. J. Mach. Learn. Res. 2011, 12, 731–817. [Google Scholar]
- Liese, F.; Vajda, I. On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 2006, 52, 4394–4412. [Google Scholar] [CrossRef]
- Vajda, I. Theory of Statistical Inference and Information; Kluwer Academic Press: London, UK, 1989. [Google Scholar]
- Csiszár, I. Axiomatic characterizations of information measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef]
- Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 2011, 13, 134–170. [Google Scholar] [CrossRef]
- Taneja, I.J. Seven means, generalized triangular discrimination, and generating divergence measures. Information 2013, 4, 198–239. [Google Scholar] [CrossRef]
- Arndt, C. Information Measures: Information and its Description in Science and Engineering; Springer Verlag: Berlin, Germany, 2004. [Google Scholar]
- Brown, R.F. A Topological Introduction to Nonlinear Analysis; Birkhäuser: Basel, Switzerland, 1993. [Google Scholar]
- Endres, D.M.; Schindelin, J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory 2003, 49, 1858–1860. [Google Scholar] [CrossRef]
- Topsøe, F. Some inequalities for information divergence and related measures of discrimination. IEEE Trans. Inf. Theory 2000, 46, 1602–1609. [Google Scholar] [CrossRef]
- Csiszár, I. Information type measures of differences of probability distribution and indirect observations. Studia Sci. Math. Hungar 1967, 2, 299–318. [Google Scholar]
- Taneja, I.J. Refinement inequalities among symmetric divergence measures. Austr. J. Math. Anal. Appl. 2005, 2. Available online: http://ajmaa.org/cgi-bin/paper.pl?string=v2n1/V2I1P8.tex (accessed on 14 July 2015). [Google Scholar]
© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).