# Implications of the Cressie-Read Family of Additive Divergences for Information Recovery

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Minimum Power Divergence

_{i}′s are reference probabilities, and

**p**and

**q**are $n\times 1$ vectors of p

_{i}′s and q

_{i}′s, respectively. The usual probability distribution characteristics of p

_{i},q

_{i}∈ [0,1] ∀i, ${\sum}_{i=1}^{n}{p}_{i}=1$, and ${\sum}_{i=1}^{n}{q}_{i}=1$ are assumed to hold. The CR family of power divergences is defined through a class of additive convex functions that encompasses a broad family of test statistics, and represents a broad family of likelihood functional relationships within a moments-based estimation context, which will be discussed in Section 2.3. In addition, the CR measure exhibits proper convexity in

**p**, for all values of and

**q**, and embodies the required probability system characteristics, such as additivity and invariance with respect to a monotonic transformation of the divergence measures. In the context of extremum metrics, the general CR family of power divergence statistics represents a flexible family of pseudo-distance measures from which to derive empirical probabilities.

**p**,

**q**,γ), is one basis for representing a range of data sampling processes and likelihood function values.

#### 2.1. The CR Family and Minimum Power Divergence Estimation

**β:**

**q**is taken as given, and $\mathbf{B}$ denotes the appropriate parameter space for

**β**. Note in (2) that ${\mathbf{X}}_{i,}$ and ${\mathbf{Z}}_{i.}$ denote the ${i}^{th}$ rows of $\mathbf{X}$ and $\mathbf{Z}$, respectively. This class of estimation procedures is referred to as Minimum Power Divergence (MPD) estimation and additional details of the solution to this stochastic inverse problem are provided in the sections ahead. The MPD optimization problem may be represented as a two-step process. In particular, one can first optimize with respect to the choice of the sample probabilities,

**p**, and then optimize with respect to the structural parameters

**β**, for any choice of the CR family of divergence measures identified by the choice of γ, given

**q**.

**p**,

**q**,γ) or I(

**q**,

**p**,γ), the same collection of members of the family of divergence measures are ultimately spanned, when considering all of the possibilities for γ ∈ (−∞,∞)

#### 2.2. Popular Variants of $I\left(\mathbf{p},\mathbf{q},\gamma \right)$

**β**obtained by optimizing the are consistent and asymptotically normally distributed. They are also asymptotically efficient, relative to the optimal estimating function (OptEF) estimator [14], when a uniform distribution, or equivalently the empirical distribution function (EDF), is used for the reference distribution. The solution to the constrained optimization problem yields optimal estimates, $\widehat{\mathbf{p}}\left(\mathsf{\gamma}\right)$ and $\widehat{\mathsf{\beta}}\left(\mathsf{\gamma}\right)$, that cannot, in general, be expressed in closed form, and thus must be obtained using numerical methods.

#### 2.3. Relating Minimum Power Divergence to Maximum Likelihood

## 3. Identifying the Probability Space

#### 3.1. Distance–Divergence Measures

**q**is taken to be a uniform probability density function (PDF), we arrive at a family of additive convex functions. In this context, one is effectively considering the convex combination of the MEL and maximum empirical exponential likelihood (MEEL) measures. From the standpoint of extremum-minimization with respect to

**p**, the generalized divergence family reduces to:

**p**||

**q**) of the probability mass function

**p,**with respect to

**q**, is recovered. As $\mathsf{\alpha}\to 1$, the

**q**-weighted MEL stochastic inverse problem I(

**q**||

**p**) results. This generalized family of divergence measures permits a broadening of the canonical distribution functions and provides a framework for developing a loss-minimizing estimation rule. In an extremum estimation context, when α = 1/2, this results in what is known in the literature as Jeffrey’s J-divergence [19]. In this case, the full objective function, J-divergence J(

**p**||

**q**)=I(

**p**||

**q**) + I(

**q**||

**p**), is a convex combination of KL divergence I(

**p**||

**q**) and the reverse KL divergence I(

**q**||

**p**). In line with the complex nature of the problem, in the sections to follow, we demonstrate a convex estimation rule, which seeks to choose among MPD-type estimators to minimize quadratic risk (QR).

#### 3.2. A Minimum Quadratic Risk (QR) Estimation Rule

#### 3.3. The Case of Two CR Alternatives

#### 3.4. Empirical Calculation of α

## 4. Finite Sample Performance

**Table 1.**MSE Results for Convex Combinations of $\widehat{\mathsf{\beta}}\left(-1\right)$ and $\widehat{\mathsf{\beta}}\left(0\right)$.

Scenario $\left\{n,\mathsf{\tau},{\mathbf{R}}^{2}\right\}$ | $\mathrm{MSE}\left(\widehat{\mathsf{\beta}}\left(-1\right)\right)$ | $\mathrm{MSE}\left(\widehat{\mathsf{\beta}}\left(0\right)\right)$ | $\overline{\widehat{\mathsf{\alpha}}}\left(\mathsf{\gamma}=-1\right)$ | std$(\widehat{\mathsf{\alpha}})$ | $\mathrm{MSE}\left(\overline{\mathsf{\beta}}(\widehat{\mathsf{\alpha}})\right)$ |
---|---|---|---|---|---|

100,0.25,0.75 | 0.00343 | 0.00364 | 0.49712 | 0.29082 | 0.00180 |

100,0.5,0.5 | 0.01129 | 0.01113 | 0.49996 | 0.07340 | 0.00528 |

100,0.75,0.25 | 0.03801 | 0.03159 | 0.48670 | 0.30753 | 0.02105 |

250,0.25,0.75 | 0.00122 | 0.00136 | 0.49978 | 0.01591 | 0.00062 |

250,0.5,0.5 | 0.00437 | 0.00452 | 0.50158 | 0.02813 | 0.00219 |

250,0.75,0.25 | 0.01309 | 0.01323 | 0.50031 | 0.07018 | 0.00639 |

## 5. Concluding Remarks

## Acknowledgements

## References

- Cressie, N.; Read, T. Multinomial goodness of fit tests. J. R. Stat. Soc.
**1984**, B46, 440–464. [Google Scholar] - Read, T.R.; Cressie, N.A. Goodness of Fit Statistics for Discrete Multivariate Data; Springer Verlag: New York, NY, USA, 1988. [Google Scholar]
- Bjelakovic, I.; Dueschel, J.; Kruger, T.; Seiler, R.; Schultze, R.; Szkola, A. Typical Support and Sanov Large Deviations of Correlated States. Commun. Math. Phys.
**2008**, 279, 559–584. [Google Scholar] [CrossRef] - Ojima, I.; Okamura, K. Large Deviation Strategy for Inverse Problems; Kyoto Institute, Kyoto University: Kyoto, Japan, 2011. [Google Scholar]
- Hanel, R.; Thurner, S. A Comprehensive Classification of Complex Statistical System and Distribution Functions; Sante Fe Institute: Sante Fe, NM, USA, 2007. [Google Scholar]
- Gorban, A.; Gorban, P.; Judge, G. Entropy: The Markov Ordering Approach. Entropy
**2010**, 5, 1145–1193. [Google Scholar] [CrossRef] - Judge, G.G.; Mittelhammer, R.C. An Information Theoretic Approach to Econometrics; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
- Renyi, A. On measures of entropy and information. In Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
- Renyi, A. Probability Theory; North-Holland: Amsterdam, The Netherlands, 1970. [Google Scholar]
- Tsallis, C. Possible Generalization of Boltzmann-Gibbs Statistics. J. Stat. Phys.
**1988**, 52, 479–487. [Google Scholar] [CrossRef] - Osterreicher, F. Csiszar’s f-Divergencies-Basic Properties; Institute of Mathematics, University of Salzburg: Salzburg, Austria, 2002. Available online: http://www.unisalzburg.at/pls/portal/docs/1/246178.PDF, accessed on 21 October 2012.
- Osterreicher, F.; Vajda, I. A New Class of Metric Divergences on Probability Spaces and its Applicability in Statistics. Ann. Inst. Stat. Math.
**2003**, 55, 639–653. [Google Scholar] [CrossRef] - Shannon, C.E. A mathematical theory of communication. Bell System Tech. J.
**1948**, 27, 379–423. [Google Scholar] [CrossRef] - Baggerly, K.A. Empirical likelihood as a goodness of fit measure. Biometrika
**1998**, 85, 535–547. [Google Scholar] [CrossRef] - Mittelhammer, R.M.; Judge, G.G.; Miller, D.J. Econometrics Foundations; Cambridge University Press: New York, NY, USA, 2000. [Google Scholar]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat.
**1951**, 22, 79–86. [Google Scholar] [CrossRef] - Gorban, A. Equilibrium Encircling Equations of Chemical Kinetics and Their Thermodynamic Limit; Nauka: Novosibirsk, Russia, 1984. [Google Scholar]
- Gorban, A.; Karlin, I.V. Family of Additive Entropy Functions Out of the Thermodynamic Limit. Phys. Rev. E
**2003**, 67, 016104. [Google Scholar] [CrossRef] - Grendar, M.; Grendar, M. On the Probabilistic Rationale of I Divergence and J Divergence Minimization.
**2000**. [Google Scholar] - James, W.; Stein, C. Estimation with Quadratic Loss. In Proceedings of Fourth Berkeley Symposium on Statistics and Probability; University of California Press: Berkeley, CA, USA, 1961; pp. 361–379. [Google Scholar]
- Judge, G.; Bock, M.E. The Statistical Implication of Pre-Test and Stein-Rule Estimators; North Holland: Amsterdam, The Netherlands, 1978. [Google Scholar]
- Hahn, J.; Hausman, J. Notes on Bias in Estimators for Simultaneous Equation Models. Econ. Lett.
**2002**, 75, 237–241. [Google Scholar] [CrossRef]

© 2012 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Judge, G.G.; Mittelhammer, R.C. Implications of the Cressie-Read Family of Additive Divergences for Information Recovery. *Entropy* **2012**, *14*, 2427-2438.
https://doi.org/10.3390/e14122427

**AMA Style**

Judge GG, Mittelhammer RC. Implications of the Cressie-Read Family of Additive Divergences for Information Recovery. *Entropy*. 2012; 14(12):2427-2438.
https://doi.org/10.3390/e14122427

**Chicago/Turabian Style**

Judge, George G., and Ron C. Mittelhammer. 2012. "Implications of the Cressie-Read Family of Additive Divergences for Information Recovery" *Entropy* 14, no. 12: 2427-2438.
https://doi.org/10.3390/e14122427