Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities
Abstract
:1. Introduction
- with equality if and only if (nonnegativity and positive definiteness),
- (symmetry),
- (subaddivity/triangle inequality).
2. Family of Alpha-Divergences
2.1. Asymmetric Alpha-Divergences
- Nonnegativity: The Csiszár–Morimoto f-divergence is always nonnegative, and equal to zero if and only if probability densities and coincide. This follows immediately from the Jensens inequality (for normalized densities):
- Generalized entropy: It corresponds to a generalized f-entropy of the form:for which the Shannon entropy is a special case for . Note that is concave while f is convex.
- Convexity: For any
- Scaling: For any positive constant we have
- Symmetricity: For an arbitrary Csiszár–Morimoto f-divergence, it is possible to construct a symmetric divergence for .
- Convexity: is convex with respect to both and .
- Strict Positivity: and if and only if .
- Continuity: The Alpha-divergence is continuous function of real variable α in the whole range including singularities.
- Duality: .
- Exclusive/Inclusive Properties: [42]
- For , the estimation of that approximates is exclusive, that is for all . This means that the minimization of with respect to will force to be exclusive approximation, i.e., the mass of ) will lie within (see detail and graphical illustrations in [42]).
- For , the estimation of that approximates is inclusive, that is for all . In other words, the mass of includes all the mass of .
- Zero-forcing and zero-avoiding properties: [42]Here, we treat the case where and are not necessary mutually absolutely continuous. In such a case the divergence may diverges to ∞. However, the following two properties hold:
- For the estimation of that approximates is zero-forcing (coercive), that is, forces .
- For the estimation of that approximates is zero-avoiding, that is, implies .
2.2. Alpha-Rényi Divergence
2.3. Extended Family of Alpha-Divergences
| Divergence | Csiszár function |
2.4. Symmetrized Alpha-Divergences
- Symmetric Jensen–Shannon divergence [62,64]It is worth mentioning, that the Jensen–Shannon divergence is a symmetrized and smoothed variant of the Kullback–Leibler divergence, i.e., it can be interpreted as the average of the Kullback–Leibler divergences to the average distribution. For the normalized probability densities the Jensen–Shannon divergence is related to the Shannon entropy in the following sense:where
- Arithmetic-Geometric divergence [39]
- Symmetric Chi-square divergence [54]
3. Family of Beta-Divergences
- Convexity: The Bregman divergence is always convex in the first argument , but is often not in the second argument .
- Nonnegativity: The Bregman divergence is nonnegative with zero .
- Linearity: Any positive linear combination of Bregman divergences is also a Bregman divergence, i.e.,where are positive constants and are strictly convex functions.
- The three-point property generalizes the “Law of Cosines”:
- Generalized Pythagoras Theorem:where is the Bregman projection onto the convex set Ω and . When Ω is an affine set then it holds with equality. This is proved to be the generalized Pythagorean relation in terms of information geometry.
3.1. Generation of Family of Beta-divergences Directly from Family of Alpha-Divergences
| Alpha-divergence | Beta-divergence |
4. Family of Gamma-Divergences Generated from Beta- and Alpha-Divergences
- . The equality holds if and only if for a positive constant c.
- It is scale invariant for any value of γ, that is, , for arbitrary positive scaling constants .
- The Alpha-Gamma divergence is equivalent to the normalized Alpha-Rényi divergence (25), i.e.,for and normalized densities and .
- It can be expressed via generalized weighted mean:where the weighted mean is defined as .
- As , the Alpha-Gamma-divergence becomes the Kullback–Leibler divergence:
- For , the Alpha-Gamma-divergence can be expressed by the reverse Kullback–Leibler divergence:
| Alpha-divergence | Robust Alpha-Gamma-divergence |
| Divergence or | Gamma-divergence and |
- . The equality holds if and only if for a positive constant c.
- It is scale invariant, that is, , for arbitrary positive scaling constants .
- As , the Gamma-divergence becomes the Kullback–Leibler divergence:where and .
- For , the Gamma-divergence can be expressed as followsFor the discrete Gamma divergence we have the corresponding formula
- . The equality holds if and only if for a positive constant c, in particular, .
- It is scale invariant, that is,for arbitrary positive scaling constants .
- For , it is reduced to a special form of the symmetric Kullback–Leibler divergence (also called the J-divergence)where and .
- For , we obtain a simple divergence expressed by weighted arithmetic meanswhere weight function is such that .For the discrete Beta-Gamma divergence (or simply the Gamma divergence), we obtain divergenceIt is interesting to note that for the discrete symmetric Gamma-divergence can be expressed by expectation functions , where and .
- For , the asymmetric Gamma-divergences (equal to a symmetric Gamma-divergence) is reduced to Cauchy–Schwarz divergence, introduced by Principe [83]
5. Relationships for Asymmetric Divergences and their Unified Representation
| Divergence name | Formula |
| Alpha | |
| Beta | |
| Gamma | |
| )) | |
| Alpha-Rényi | |
| Bregman | , (see Eqation(57)) |
| Csiszár-Morimoto | , (see Equation (57)) for ) |
Duality
5.1. Conclusions and Discussion
References
- Amari, S. Differential-Geometrical Methods in Statistics; Springer Verlag: Berlin, Germany, 1985. [Google Scholar]
- Amari, S. Dualistic geometry of the manifold of higher-order neurons. Neural Network. 1991, 4, 443–451. [Google Scholar] [CrossRef]
- Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: New York, NY, USA, 2000. [Google Scholar]
- Amari, S. Integration of stochastic models by minimizing alpha-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef] [PubMed]
- Amari, S. Information geometry and its applications: Convex function and dually flat manifold. In Emerging Trends in Visual Computing; Nielsen, F., Ed.; Springer: New York, NY, USA; pp. 75–102.
- Amari, S. Alpha-divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar] [CrossRef]
- Amari, S.; Cichocki, A. Information geometry of divergence functions. Bull. Pol. Acad. Sci. 2010. (in print). [Google Scholar] [CrossRef]
- Murata, N.; Takenouchi, T.; Kanamori, T.; Eguchi, S. Information geometry of U-Boost and Bregman divergence. Neural Comput. 2004, 16, 1437–1481. [Google Scholar] [CrossRef] [PubMed]
- Fujimoto, Y.; Murata, N. A modified EM Algorithm for mixture models based on Bregman divergence. Ann. Inst. Stat. Math. 2007, 59, 57–75. [Google Scholar] [CrossRef]
- Zhu, H.; Rohwer, R. Bayesian Invariant measurements of generalization. Neural Process. Lett. 1995, 2, 28–31. [Google Scholar] [CrossRef]
- Zhu, H.; Rohwer, R. Measurements of generalisation based on information geometry. In Mathematics of Neural Networks: Model Algorithms and Applications; Ellacott, S.W., Mason, J.C., Anderson, I.J., Eds.; Kluwer: Norwell, MA, USA, 1997; pp. 394–398. [Google Scholar]
- Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 56, 2882–2903. [Google Scholar] [CrossRef]
- Boissonnat, J.D.; Nielsen, F.; Nock, R. Bregman Voronoi diagrams. Discrete and Computational Geometry (Springer) 2010. (in print). [Google Scholar] [CrossRef]
- Yamano, T. A generalization of the Kullback-Leibler divergence and its properties. J. Math. Phys. 2009, 50, 85–95. [Google Scholar] [CrossRef]
- Minami, M.; Eguchi, S. Robust blind source separation by Beta-divergence. Neural Comput. 2002, 14, 1859–1886. [Google Scholar]
- Bregman, L. The relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. Comp. Math. Phys., USSR 1967, 7, 200–217. [Google Scholar] [CrossRef]
- Csiszár, I. Eine Informations Theoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitt von Markoffschen Ketten. Magyar Tud. Akad. Mat. Kutat Int. Kzl 1963, 8, 85–108. [Google Scholar]
- Csiszár, I. Axiomatic characterizations of information measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef]
- Csiszár, I. Information measures: A critial survey. In Transactions of the 7th Prague Conference, Prague, Czech Republic, 18–23 August 1974; Reidel: Dordrecht, Netherlands, 1977; pp. 83–86. [Google Scholar]
- Ali, M.; Silvey, S. A general class of coefficients of divergence of one distribution from another. J. Royal Stat. Soc. 1966, Ser B, 131–142. [Google Scholar]
- Hein, M.; Bousquet, O. Hilbertian metrics and positive definite kernels on probability measures. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, Barbados, 6–8 January 2005; Ghahramani, Z., Cowell, R., Eds.; AISTATS. 2005; 10, pp. 136–143. [Google Scholar]
- Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J. Referential duality and representational duality on statistical manifolds. In Proceedings of the Second International Symposium on Information Geometry and its Applications, University of Tokyo, Tokyo, Japan, 12–16 December 2005; University of Tokyo: Tokyo, Japan, 2006; pp. 58–67. [Google Scholar]
- Zhang, J. A note on curvature of a-connections of a statistical manifold. Ann. Inst. Stat. Math. 2007, 59, 161–170. [Google Scholar] [CrossRef]
- Zhang, J.; Matsuzoe, H. Dualistic differential geometry associated with a convex function. In Springer Series of Advances in Mechanics and Mathematics; 2008; Springer: New York, NY, USA; pp. 58–67. [Google Scholar]
- Lafferty, J. Additive models, boosting, and inference for generalized divergences. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, Santa Cruz, CA, USA, 7–9 July 1999; ACM: New York, NY, USA, 1999. [Google Scholar]
- Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
- Villmann, T.; Haase, S. Divergence based vector quantization using Fréchet derivatives. Neural Comput. 2010. (submitted for publication). [Google Scholar]
- Villmann, T.; Haase, S.; Schleif, F.M.; Hammer, B. Divergence based online learning in vector quantization. In Proceedings of the International Conference on Artifial Intelligence and Soft Computing (ICAISC2010), LNAI, Zakopane, Poland, 13–17 June 2010.
- Cichocki, A.; Zdunek, R.; Phan, A.H.; Amari, S. Nonnegative Matrix and Tensor Factorizations; John Wiley & Sons Ltd: Chichester, UK, 2009. [Google Scholar]
- Cichocki, A.; Zdunek, R.; Amari, S. Csiszár’s divergences for nonnegative matrix factorization: Family of new algorithms. Springer, LNCS-3889 2006, 3889, 32–39. [Google Scholar]
- Cichocki, A.; Amari, S.; Zdunek, R.; Kompass, R.; Hori, G.; He, Z. Extended SMART algorithms for Nonnegative Matrix Factorization. Springer, LNAI-4029 2006, 4029, 548–562. [Google Scholar]
- Cichocki, A.; Zdunek, R.; Choi, S.; Plemmons, R.; Amari, S. Nonnegative tensor factorization using Alpha and Beta divergences. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Tulose, France, May 2007; Volume III, pp. 1393–1396.
- Cichocki, A.; Zdunek, R.; Choi, S.; Plemmons, R.; Amari, S.I. Novel multi-layer nonnegative tensor factorization with sparsity constraints. Springer, LNCS-4432 2007, 4432, 271–280. [Google Scholar]
- Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
- Liese, F.; Vajda, I. Convex Statistical Distances. Teubner-Texte zur Mathematik Teubner Texts in Mathematics 1987, 95, 1–85. [Google Scholar]
- Eguchi, S.; Kato, S. Entropy and divergence associated with power function and the statistical application. Entropy 2010, 12, 262–274. [Google Scholar] [CrossRef]
- Taneja, I. On generalized entropies with applications. In Lectures in Applied Mathematics and Informatics; Ricciardi, L., Ed.; Manchester University Press: Manchester, UK, 1990; pp. 107–169. [Google Scholar]
- Taneja, I. New developments in generalized information measures. In Advances in Imaging and Electron Physics; Hawkes, P., Ed.; Elsevier: Amsterdam, Netherlands, 1995; Volume 91, pp. 37–135. [Google Scholar]
- Gorban, A.N.; Gorban, P.A.; Judge, G. Entropy: The Markov ordering approach. Entropy 2010, 12, 1145–1193. [Google Scholar] [CrossRef]
- Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on a sum of observations. Ann. Math. Statist. 1952, 23, 493–507. [Google Scholar] [CrossRef]
- Minka, T. Divergence measures and message passing. Microsoft Research Technical Report (MSR-TR-2005) 2005. [Google Scholar]
- Taneja, I. On measures of information and inaccuarcy. J. Statist. Phys. 1976, 14, 203–270. [Google Scholar]
- Cressie, N.; Read, T. Goodness-of-Fit Statistics for Discrete Multivariate Data; Springer: New York, NY, USA, 1988. [Google Scholar]
- Cichocki, A.; Lee, H.; Kim, Y.D.; Choi, S. Nonnegative matrix factorization with Alpha-divergence. Pattern. Recognit. Lett. 2008, 29, 1433–1440. [Google Scholar] [CrossRef]
- Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Statist. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
- Havrda, J.; Charvát, F. Quantification method of classification processes: Concept of structrual a-entropy. Kybernetika 1967, 3, 30–35. [Google Scholar]
- Cressie, N.; Read, T. Multinomial goodness-of-fit tests. J. R. Stat. Soc. Ser. B 1984, 46, 440–464. [Google Scholar]
- Vajda, I. Theory of Statistical Inference and Information; Kluwer Academic Press: Amsterdam, Netherland, 1989. [Google Scholar]
- Hellinger, E. Neue Begründung der Theorie Quadratischen Formen von unendlichen vielen Veränderlichen. J. Reine Ang. Math. 1909, 136, 210–271. [Google Scholar]
- Morimoto, T. Markov processes and the H-theorem. J. Phys. Soc. Jap. 1963, 12, 328–331. [Google Scholar] [CrossRef]
- Österreicher, F. Csiszár’s f-divergences-basic properties. Technical report; In Research Report Collection; Victoria University: Melbourne, Australia, 2002. [Google Scholar]
- Harremoës, P.; Vajda, I. Joint range of f-divergences. In Accepted for presentation at ISIT 2010, Austin, TX, USA, 13–18 June 2010.
- Dragomir, S. Inequalities for Csiszár f-Divergence in Information Theory; Victoria University: Melbourne, Australia, 2000; (edited monograph). [Google Scholar]
- Rényi, A. On the foundation of information theory. Rev. Inst. Int. Stat. 1965, 33, 1–4. [Google Scholar] [CrossRef]
- Rényi, A. On measures of entropy and information. In Proceddings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA; Volome 1, pp. 547–561.
- Rényi, A. Probability Theory; North-Holland: Amsterdam, The Netherlands, 1970. [Google Scholar]
- Harremoës, P. Interpretaions of Rényi entropies and divergences. Physica A 2006, 365, 57–62. [Google Scholar] [CrossRef]
- Harremoës, P. Joint range of Rényi entropies. Kybernetika 2009, 45, 901–911. [Google Scholar]
- Hero, A.; Ma, B.; Michel, O.; Gorman, J. Applications of entropic spanning graphs. IEEE Signal Process. Mag. 2002, 19, 85–95. [Google Scholar] [CrossRef]
- Topsoe, F. Some inequalities for information divergence and related measuresof discrimination. IEEE Trans. Inf. Theory 2000, 46, 1602–1609. [Google Scholar] [CrossRef]
- Burbea, J.; Rao, C. Entropy differential metric, distance and divergence measures in probability spaces: A unified approach. J. Multi. Analysis 1982, 12, 575–596. [Google Scholar] [CrossRef]
- Burbea, J.; Rao, C. On the convexity of some divergence measures based on entropy functions. IEEE Trans. Inf. Theory 1982, IT-28, 489–495. [Google Scholar] [CrossRef]
- Sibson, R. Information radius. Probability Theory and Related Fields 1969, 14, 149–160. [Google Scholar] [CrossRef]
- Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. Lon., Ser. A 1946, 186, 453–461. [Google Scholar] [CrossRef]
- Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Statist. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Basu, A.; Harris, I.R.; Hjort, N.; Jones, M. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
- Mollah, M.; Minami, M.; Eguchi, S. Exploring latent structure of mixture ICA models by the minimum Beta-divergence method. Neural Comput. 2006, 16, 166–190. [Google Scholar] [CrossRef]
- Mollah, M.; Eguchi, S.; Minami, M. Robust prewhitening for ICA by minimizing Beta-divergence and its application to FastICA. Neural Process. Lett. 2007, 25, 91–110. [Google Scholar] [CrossRef]
- Kompass, R. A Generalized divergence measure for Nonnegative Matrix Factorization. Neural Comput. 2006, 19, 780–791. [Google Scholar] [CrossRef] [PubMed]
- Mollah, M.; Sultana, N.; Minami, M.; Eguchi, S. Robust extraction of local structures by the minimum of Beta-divergence method. Neural Netw. 2010, 23, 226–238. [Google Scholar] [CrossRef] [PubMed]
- Nielsen, F.; Nock, R. The dual Voronoi diagrams with respect to representational Bregman divergences. In Proceedings of the International Symposium on Voronoi Diagrams (ISVD), Copenhagen, Denmark, 23–26 June 2009.
- Cichocki, A.; Phan, A. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE (invited paper) 2009, E92-A (3), 708–721. [Google Scholar] [CrossRef]
- Cichocki, A.; Phan, A.; Caiafa, C. Flexible HALS algorithms for sparse non-negative matrix/tensor factorization. In Proceedings of the 18th IEEE workshops on Machine Learning for Signal Processing, Cancun, Mexico, 16–19 October 2008.
- Dhillon, I.; Sra, S. Generalized Nonnegative Matrix Approximations with Bregman Divergences. In Neural Information Processing Systems; MIT Press: Vancouver, Canada, 2005; pp. 283–290. [Google Scholar]
- Févotte, C.; Bertin, N.; Durrieu, J.L. Nonnegative matrix factorization with the Itakura-Saito divergence with application to music analysis. Neural Comput. 2009, 21, 793–830. [Google Scholar] [CrossRef] [PubMed]
- Itakura, F.; Saito, F. Analysis synthesis telephony based on the maximum likelihood method. In Proceedings of the of the 6th International Congress on Acoustics, Tokyo, Japan, 1968; pp. 17–20.
- Eggermont, P.; LaRiccia, V. On EM-like algorithms for minimum distance estimation. Technical report; In Mathematical Sciences; University of Delaware: Newark, DE, USA, 1998. [Google Scholar]
- Févotte, C.; Cemgil, A.T. Nonnegative matrix factorizations as probabilistic inference in composite models. In Proceedings of the 17th European Signal Processing Conference (EUSIPCO-09), Glasgow, Scotland, UK, 24–28 August 2009.
- Banerjee, A.; Dhillon, I.; Ghosh, J.; Merugu, S.; Modha, D. A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; ACM Press: New York, NY, USA, 2004; pp. 509–514. [Google Scholar]
- Lafferty, J. Additive models, boosting, and inference for generalized divergences. In Proceedings of the 12th Annual Conference on Computational Learning Theory, Santa Cruz, CA, USA, 7–9 July 1999; ACM Press: New York, USA, 1999; pp. 125–133. [Google Scholar]
- Frigyik., B.A.; Srivastan, S.; Gupta, M. Functional Bregman divergence and Bayesion estimation of distributions. IEEE Trans. Inf. Theory 2008, 54, 5130–5139. [Google Scholar] [CrossRef]
- Principe, J. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer: Berlin, Germany, 2010. [Google Scholar]
- Choi, H.; Choi, S.; Katake, A.; Choe, Y. Learning alpha-integration with partially-labeled data. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2010), Dallas, TX, USA, 14–19 March 2010.
- Jones, M.; Hjort, N.; Harris, I.R.; Basu, A. A comparison of related density-based minimum divergence estimators. Biometrika 1998, 85, 865–873. [Google Scholar] [CrossRef]
Appendix A. Divergences Derived from Tsallis α-entropy
Appendix B. Entropies and Divergences
Appendix C. Tsallis and Rényi Entropies
© 2010 by the authors licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license ( http://creativecommons.org/licenses/by/3.0/).
Share and Cite
Cichocki, A.; Amari, S.-i. Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities. Entropy 2010, 12, 1532-1568. https://doi.org/10.3390/e12061532
Cichocki A, Amari S-i. Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities. Entropy. 2010; 12(6):1532-1568. https://doi.org/10.3390/e12061532
Chicago/Turabian StyleCichocki, Andrzej, and Shun-ichi Amari. 2010. "Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities" Entropy 12, no. 6: 1532-1568. https://doi.org/10.3390/e12061532
APA StyleCichocki, A., & Amari, S.-i. (2010). Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities. Entropy, 12(6), 1532-1568. https://doi.org/10.3390/e12061532
