# An Elementary Introduction to Information Geometry

## Abstract

**:**

## 1. Introduction

#### 1.1. Overview of Information Geometry

#### 1.2. Rationale and Outline of the Survey

## 2. Prerequisite: Basics of Differential Geometry

#### 2.1. Overview of Differential Geometry: Manifold $(M,g,\nabla )$

- A metric tensor g, and
- An affine connection ∇.

- The covariant derivative operator which provides a way to calculate differentials of a vector field Y with respect to another vector field X: namely, the covariant derivative ${\nabla}_{X}Y$,
- The parallel transport ${\prod}_{c}^{\nabla}$ which defines a way to transport vectors between tangent planes along any smooth curve c,
- The notion of ∇-geodesics ${\gamma}_{\nabla}$ which are defined as autoparallel curves, thus extending the ordinary notion of Euclidean straightness,
- The intrinsic curvature and torsion of the manifold.

#### 2.2. Metric Tensor Fields g

#### 2.3. Affine Connections ∇

#### 2.3.1. Covariant Derivatives ${\nabla}_{X}Y$ of Vector Fields

#### 2.3.2. Parallel Transport ${\prod}_{c}^{\nabla}$ along a Smooth Curve c

#### 2.3.3. ∇-Geodesics ${\gamma}_{\nabla}$: Autoparallel Curves

- Initial Value Problem (IVP): fix the conditions $\gamma \left(0\right)=p$ and $\dot{\gamma}\left(0\right)=v$ for some vector $v\in {T}_{p}$.
- Boundary Value Problem (BVP): fix the geodesic extremities $\gamma \left(0\right)=p$ and $\gamma \left(1\right)=q$.

#### 2.3.4. Curvature and Torsion of a Manifold

#### 2.4. The Fundamental Theorem of Riemannian Geometry: The Levi–Civita Metric Connection

**Theorem**

**1**

#### 2.5. Preview: Information Geometry versus Riemannian Geometry

## 3. Information Manifolds

#### 3.1. Overview

#### 3.2. Conjugate Connection Manifolds: $(M,g,\nabla ,{\nabla}^{*})$

**Definition**

**1**

**Definition**

**2**

**Property**

**1**

**Property**

**2.**

**Property**

**3.**

#### 3.3. Statistical Manifolds: $(M,g,C)$

**Definition**

**3**

#### 3.4. A Family ${\left\{(M,g,{\nabla}^{-\alpha},{\nabla}^{\alpha}={\left({\nabla}^{-\alpha}\right)}^{*})\right\}}_{\alpha \in \mathbb{R}}$ of Conjugate Connection Manifolds

**Theorem**

**2**

#### 3.5. The Fundamental Theorem of Information Geometry: ∇ $\kappa $-Curved ⇔ ${\nabla}^{*}$ $\kappa $-Curved

**Theorem**

**3**

**Corollary**

**1**

**Corollary**

**2**

**Definition**

**4**

#### 3.6. Conjugate Connections from Divergences: $(M,D)\equiv (M,{\phantom{\rule{4pt}{0ex}}}^{D}g,{\phantom{\rule{4pt}{0ex}}}^{D}\nabla ,{\phantom{\rule{4pt}{0ex}}}^{D}{\nabla}^{*}={\phantom{\rule{4pt}{0ex}}}^{{D}^{*}}\nabla )$

**Definition**

**5**(Divergence).

- $D(\theta :{\theta}^{\prime})\ge 0$ for all $\theta ,{\theta}^{\prime}\in \mathsf{\Theta}$ with equality holding iff $\theta ={\theta}^{\prime}$ (law of the indiscernibles),
- ${\partial}_{i,\xb7}D(\theta :{\theta}^{\prime}){{|}_{\theta ={\theta}^{\prime}}={\partial}_{\xb7,j}D(\theta :{\theta}^{\prime})|}_{\theta ={\theta}^{\prime}}=0$ for all $i,j\in \left[D\right]$,
- $-{\partial}_{\xb7,i}{\partial}_{\xb7,j}D(\theta :{\theta}^{\prime}){|}_{\theta ={\theta}^{\prime}}$ is positive-definite.

**Theorem**

**4**

#### 3.7. Dually Flat Manifolds (Bregman Geometry): $(M,F)\equiv (M,{\phantom{\rule{4pt}{0ex}}}^{{B}_{F}}g,{\phantom{\rule{4pt}{0ex}}}^{{B}_{F}}\nabla ,{\phantom{\rule{4pt}{0ex}}}^{{B}_{F}}{\nabla}^{*}={\phantom{\rule{4pt}{0ex}}}^{{B}_{{F}^{*}}}\nabla )$

**Theorem**

**5**

**Theorem**

**6**

**Theorem**

**7**

**Property**

**4**

#### 3.8. Hessian $\alpha $-Geometry: $(M,F,\alpha )\equiv (M,{\phantom{\rule{4pt}{0ex}}}^{F}g,{\phantom{\rule{4pt}{0ex}}}^{F}{\nabla}^{-\alpha},{\phantom{\rule{4pt}{0ex}}}^{F}{\nabla}^{\alpha}$)

#### 3.9. Expected $\alpha $-Manifolds of a Family of Parametric Probability Distributions: $(\mathcal{P},{}_{\mathcal{P}}g,{}_{\mathcal{P}}{\nabla}^{-\alpha},{}_{\mathcal{P}}{\nabla}^{\alpha})$

**Example**

**1.**

**Example**

**2**

**Example**

**3**

#### 3.10. Criteria for Statistical Invariance

- Which metric tensors g make sense in statistics?
- Which affine connections ∇ make sense in statistics?
- Which statistical divergences make sense in statistics (from which we can get the metric tensor and dual connections)?

**Theorem**

**8**

- The family of $\alpha $-divergences:$${I}_{\alpha}[p:q]:=\frac{4}{1-{\alpha}^{2}}\left(1-\int {p}^{\frac{1-\alpha}{2}}\left(x\right){q}^{\frac{1+\alpha}{2}}\left(x\right)\mathrm{d}\mu \left(x\right)\right),$$
- –
- The Kullback–Leibler when $\alpha \to 1$:$$\mathrm{KL}[p:q]=\int p\left(x\right)log\frac{p\left(x\right)}{q\left(x\right)}\mathrm{d}\mu \left(x\right),$$
- –
- The reverse Kullback–Leibler $\alpha \to -1$:$${\mathrm{KL}}^{*}[p:q]:=\int q\left(x\right)log\frac{q\left(x\right)}{p\left(x\right)}\mathrm{d}\mu \left(x\right)=\mathrm{KL}[q:p],$$
- –
- The symmetric squared Hellinger divergence:$${H}^{2}[p:q]:=\int {\left(\sqrt{p\left(x\right)}-\sqrt{q\left(x\right)}\right)}^{2}\mathrm{d}\mu \left(x\right),$$
- –
- The Pearson and Neyman chi-squared divergences [62], etc.

- The Jensen–Shannon divergence:$$\mathrm{JS}[p:q]:=\frac{1}{2}\int \left(p\left(x\right)log\frac{2p\left(x\right)}{p\left(x\right)+q\left(x\right)}+q\left(x\right)log\frac{2q\left(x\right)}{p\left(x\right)+q\left(x\right)}\right)\mathrm{d}\mu \left(x\right),$$
- The Total Variation$$\mathrm{TV}[p:q]:=\frac{1}{2}\int \left|p\left(x\right)-q\left(x\right)\right|\mathrm{d}\mu \left(x\right),$$

**Theorem**

**9.**

**Example**

**4.**

#### 3.11. Fisher–Rao Expected Riemannian Manifolds: $(\mathcal{P},{}_{\mathcal{P}}g)$

**Definition**

**6**

- The Fisher–Riemannian manifold of the family of bivariate location-scale families amount to hyperbolic geometry (hyperbolic manifold).
- The Fisher–Riemannian manifold of the family of location families amount to Euclidean geometry (Euclidean manifold).

**Example**

**5.**

#### 3.12. The Monotone $\alpha $-Embeddings and the Gauge Freedom of the Metric

#### 3.13. Dually Flat Spaces and Canonical Bregman Divergences

**Proof.**

- Consider an exponential family $\mathcal{E}$ of order D with densities defined according to a dominating measure $\mu $:$$\mathcal{E}=\{{p}_{\theta}\left(x\right)=exp({\theta}^{\top}t\left(x\right)-F\left(\theta \right))\phantom{\rule{4pt}{0ex}}:\phantom{\rule{4pt}{0ex}}\theta \in \mathsf{\Theta}\},$$$$F\left(\theta \right)={F}_{\mathcal{E}}\left({p}_{\theta}\right)=log\left(\int exp\left({\theta}^{\top}t\left(x\right)\right)\mathrm{d}\mu \left(x\right)\right),$$$${F}^{*}\left(\eta \right)=-h\left({p}_{\theta}\right)=\int p\left(x\right)logp\left(x\right)\mathrm{d}\mu \left(x\right),$$Let $\lambda \left(i\right)$ denotes the i-th coordinates of vector $\lambda $, and let us calculate the inner product ${\theta}_{1}^{\top}{\eta}_{2}={\sum}_{i}{\theta}_{1}\left(i\right){\eta}_{2}\left(i\right)$ of the Legendre–Fenchel divergence. We have ${\eta}_{2}\left(i\right)={E}_{{p}_{{\theta}_{2}}}\left[{t}_{i}\left(x\right)\right]$. Using the linear property of the expectation $E[\xb7]$, we find that ${\sum}_{i}{\theta}_{1}\left(i\right){\eta}_{2}\left(i\right)={E}_{{p}_{{\theta}_{2}}}\left[{\sum}_{i}{\theta}_{1}\left(i\right){t}_{i}\left(x\right)\right]$. Moreover, we have ${\sum}_{i}{\theta}_{1}\left(i\right){t}_{i}\left(x\right)=\left(log{p}_{{\theta}_{1}}\left(x\right)\right)+F\left({\theta}_{1}\right)$. Thus we have:$${\theta}_{1}^{\top}{\eta}_{2}={E}_{{p}_{{\theta}_{2}}}\left[log{p}_{{\theta}_{1}}+F\left({\theta}_{1}\right)\right]=F\left({\theta}_{1}\right)+{E}_{{p}_{{\theta}_{2}}}\left[log{p}_{{\theta}_{1}}\right].$$It follows that we get$$\begin{array}{}\mathrm{(159)}& \hfill {B}_{F,\mathcal{E}}[{p}_{{\theta}_{1}}:{p}_{{\theta}_{2}}]& =& F\left({\theta}_{1}\right)+{F}^{*}\left({\eta}_{2}\right)-{\theta}_{1}^{\top}{\eta}_{2},\hfill \mathrm{(160)}& & =& F\left({\theta}_{1}\right)-h\left({p}_{{\theta}_{2}}\right)-{E}_{{p}_{{\theta}_{2}}}[log{p}_{{\theta}_{1}}]-F\left({\theta}_{1}\right),\hfill \mathrm{(161)}& & =& {E}_{{p}_{{\theta}_{2}}}\left[log\frac{{p}_{{\theta}_{2}}}{{p}_{{\theta}_{1}}}\right]=:{D}_{{\mathrm{KL}}^{*}}[{p}_{{\theta}_{1}}:{p}_{{\theta}_{2}}].\hfill \end{array}$$By relaxing the exponential family densities ${p}_{{\theta}_{1}}$ and ${p}_{{\theta}_{2}}$ to be arbitrary densities ${p}_{1}$ and ${p}_{2}$, we obtain the reverse KL divergence between ${p}_{1}$ and ${p}_{2}$ from the dually flat structure induced by the integral-based log-normalizer of an exponential family:$$\begin{array}{}\mathrm{(162)}& \hfill {D}_{{\mathrm{KL}}^{*}}[{p}_{1}:{p}_{2}]& =& {E}_{{p}_{2}}\left[log\frac{{p}_{2}}{{p}_{1}}\right]=\int {p}_{2}\left(x\right)log\frac{{p}_{2}\left(x\right)}{{p}_{1}\left(x\right)}\mathrm{d}\mu \left(x\right),\hfill \mathrm{(163)}& & =& {D}_{\mathrm{KL}}[{p}_{2}:{p}_{1}].\hfill \end{array}$$The dual divergence ${D}^{*}[{p}_{1}:{p}_{2}]:=D[{p}_{2}:{p}_{1}]$ is obtained by swapping the distribution parameter orders. We have:$${D}_{{\mathrm{KL}}^{*}}^{*}[{p}_{1}:{p}_{2}]:={D}_{{\mathrm{KL}}^{*}}[{p}_{2}:{p}_{1}]={E}_{{p}_{1}}\left[log\frac{{p}_{1}}{{p}_{2}}\right]=:{D}_{\mathrm{KL}}[{p}_{1}:{p}_{2}],$$To summarize, the canonical Legendre–Fenchel divergence associated with the log-normalizer of an exponential family amounts to the statistical reverse Kullback–Leibler divergence between ${p}_{{\theta}_{1}}$ and ${p}_{{\theta}_{1}}$ (or the KL divergence between the swapped corresponding densities): ${D}_{\mathrm{KL}}[{p}_{{\theta}_{1}}:{p}_{{\theta}_{2}}]={B}_{F}({\theta}_{2}:{\theta}_{1})={A}_{F,{F}^{*}}({\theta}_{2}:{\eta}_{1})$. Notice that it is easy to check that ${D}_{\mathrm{KL}}[{p}_{{\theta}_{1}}:{p}_{{\theta}_{2}}]={B}_{F}({\theta}_{2}:{\theta}_{1})$ [74,75]. Here, we took the opposite direction by constructing ${D}_{\mathrm{KL}}$ from ${B}_{F}$.We may consider an auxiliary carrier term $k\left(x\right)$ so that the densities write ${p}_{\theta}\left(x\right)=exp({\theta}^{\top}t\left(x\right)-F\left(\theta \right)+k\left(x\right))$. Then the dual convex conjugate writes [76] as ${F}^{*}\left(\eta \right)=-h\left({p}_{\theta}\right)+{E}_{{p}_{\theta}}\left[k\left(x\right)\right]$.Notice that since the Bregman generator is defined up to an affine term, we may consider the equivalent generator $F\left(\theta \right)=-log{p}_{\theta}\left(\omega \right)$ instead of the integral-based generator. This approach yields ways to build formula bypassing the explicit use of the log-normalizer for calculating various statistical distances [77].
- In this second example, we consider a mixture family$$\mathcal{M}=\left\{{m}_{\theta}=\sum _{i=1}^{D}{\theta}_{i}{p}_{i}\left(x\right)+(1-\sum _{i=1}^{D}{\theta}_{i}){p}_{0}\left(x\right)\right\},$$$$F\left(\theta \right)={F}_{\mathcal{M}}\left({m}_{\theta}\right)=-h\left({m}_{\theta}\right)=\int {m}_{\theta}\left(x\right)log{m}_{\theta}\left(x\right)\mathrm{d}\mu \left(x\right).$$We have$${\eta}_{i}={[\nabla F\left(\theta \right)]}_{i}=\int ({p}_{i}\left(x\right)-{p}_{0}\left(x\right))log{m}_{\theta}\left(x\right)\mathrm{d}\mu \left(x\right),$$$${F}^{*}\left(\eta \right)=-\int {p}_{0}\left(x\right)log{m}_{\theta}\left(x\right)\mathrm{d}\mu \left(x\right)={h}^{\times}({p}_{0}:{m}_{\theta}),$$$$\begin{array}{ccc}\hfill \sum _{i}{\theta}_{1}\left(i\right)\int ({p}_{i}\left(x\right)-{p}_{0}\left(x\right))log{m}_{{\theta}_{2}}\left(x\right)\mathrm{d}\mu \left(x\right)& =& \int \sum _{i}{\theta}_{1}\left(i\right){p}_{i}\left(x\right)log{m}_{{\theta}_{2}}\left(x\right)\mathrm{d}\mu \left(x\right)\hfill \\ & & -\sum _{i}{\theta}_{1}\left(i\right){p}_{0}\left(x\right)log{m}_{{\theta}_{2}}\left(x\right)\mathrm{d}\mu \left(x\right).\hfill \end{array}$$That is$${\theta}_{1}^{\top}{\eta}_{2}=\int \sum _{i}{\theta}_{1}\left(i\right){p}_{i}log{m}_{{\theta}_{2}}\mathrm{d}\mu -\sum _{i}{\theta}_{1}\left(i\right){p}_{0}log{m}_{{\theta}_{2}}\mathrm{d}\mu .$$Thus it follows that we have the following statistical distance:$$\begin{array}{}\mathrm{(171)}& \hfill {B}_{F,\mathcal{M}}[{m}_{{\theta}_{1}}:{m}_{{\theta}_{2}}]& :=& F\left({\theta}_{1}\right)+{F}^{*}\left({\eta}_{2}\right)-{\theta}_{1}^{\top}{\eta}_{2},\hfill & & =& -h\left({m}_{{\theta}_{1}}\right)-\int {p}_{0}\left(x\right)log{m}_{{\theta}_{2}}\left(x\right)\mathrm{d}\mu \left(x\right)-\int \sum _{i}{\theta}_{1}\left(i\right){p}_{i}\left(x\right)log{m}_{{\theta}_{2}}\left(x\right)\mathrm{d}\mu \left(x\right)\hfill \mathrm{(172)}& & & +\sum _{i}{\theta}_{1}\left(i\right){p}_{0}\left(x\right)log{m}_{{\theta}_{2}}\left(x\right)\mathrm{d}\mu \left(x\right),\hfill \mathrm{(173)}& & =& -h\left({m}_{{\theta}_{1}}\right)-\int ((1-\sum _{i}{\theta}_{1}\left(i\right)){p}_{0}\left(x\right)+\sum _{i}{\theta}_{1}\left(i\right){p}_{i}\left(x\right))log{m}_{{\theta}_{2}}\left(x\right)\mathrm{d}\mu \left(x\right),\hfill \mathrm{(174)}& & =& -h\left({m}_{{\theta}_{1}}\right)-\int {m}_{{\theta}_{1}}\left(x\right)log{m}_{{\theta}_{2}}\left(x\right)\mathrm{d}\mu \left(x\right),\hfill \mathrm{(175)}& & =& \int {m}_{{\theta}_{1}}\left(x\right)log\frac{{m}_{{\theta}_{1}}\left(x\right)}{{m}_{{\theta}_{2}}\left(x\right)}\mathrm{d}\mu \left(x\right),\hfill \mathrm{(176)}& & =& {D}_{\mathrm{KL}}[{m}_{{\theta}_{1}}:{m}_{{\theta}_{2}}].\hfill \end{array}$$Thus we have ${D}_{\mathrm{KL}}[{m}_{{\theta}_{1}}:{m}_{{\theta}_{2}}]={B}_{F}({\theta}_{1}:{\theta}_{2})$. By relaxing the mixture densities ${m}_{{\theta}_{1}}$ and ${m}_{{\theta}_{2}}$ to arbitrary densities ${m}_{1}$ and ${m}_{2}$, we find that the dually flat geometry induced by the negentropy of densities of a mixture family induces a statistical distance which corresponds to the (forward) KL divergence. That is, we have recovered the statistical distance ${D}_{\mathrm{KL}}$ from ${B}_{F,\mathcal{M}}$. Note that in general the entropy of a mixture is not available in closed-form (because of the log sum term), except when the component distributions have pairwise disjoint supports. This latter case includes the case of Dirac distributions whose mixtures represent the categorical distributions.

**Example**

**6.**

- Base measure: $\nu \left(x\right)=\frac{\mu \left(x\right)}{x!}={e}^{k\left(x\right)}\mu \left(x\right)$ where μ is the counting measure and $k\left(x\right)=-log(x!)$ represents an auxiliary measure carrier term for defining the base measure ν,
- Sufficient statistics: $t\left(x\right)=x$,
- Natural parameter: $\theta =\theta \left(\lambda \right)=log\left(\lambda \right)\in \mathsf{\Theta}=\mathbb{R}$,
- Log-normalizer: $F\left(\theta \right)=exp\left(\theta \right)$ since $F\left(\theta \right(\lambda \left)\right)=\lambda $.

**Example**

**7.**

- Natural parameters: $\theta =({\theta}_{1},{\theta}_{2})$ with $\theta \left(\lambda \right)=(-\beta ,\alpha -1)$ with source parameter $\lambda =(\alpha ,\beta )$,
- Sufficient statistics: $t\left(x\right)=(x,log(x\left)\right)$,
- Log-normalizer: $F\left(\theta \right)=-\left({\theta}_{2}+1\right)log\left(-{\theta}_{1}\right)+log\mathsf{\Gamma}\left({\theta}_{2}+1\right)$,
- Dual parameterization: $\eta =({\eta}_{1},{\eta}_{2})={E}_{{p}_{\theta}}\left[t\left(x\right)\right]=\nabla F\left(\theta \right)=\left(\frac{{\theta}_{2}+1}{-{\theta}_{1}},-log\left(-{\theta}_{1}\right)+\psi \left({\theta}_{2}+1\right)\right)$, where $\psi \left(x\right)=\frac{d}{dx}ln\left(\mathsf{\Gamma}\left(x\right)\right)=\frac{{\mathsf{\Gamma}}^{\prime}\left(x\right)}{\mathsf{\Gamma}\left(x\right)}$ denotes the digamma function.

## 4. Some Applications of Information Geometry

- Statistics: Asymptotic inference, Expectation-Maximization (EM and the novel information-geometric em), time series (AutoRegressive Moving Average model, ARMA) models,
- Signal processing: Principal Component Analysis (PCA), Independent Component Analysis (ICA), Non-negative Matrix Factorization (NMF),
- Mathematical programming: Barrier function of interior point methods,
- Game theory: Score functions.

#### 4.1. Natural Gradient in Riemannian Space

#### 4.1.1. The Vanilla Gradient Descent Method

#### 4.1.2. Natural Gradient and Its Connection with the Riemannian Gradient

**Property**

**5**

#### 4.1.3. Natural Gradient in Dually Flat Spaces: Connections to Bregman Mirror Descent and Ordinary Gradient

**Property**

**6**

**Property**

**7**

**Proof.**

#### 4.1.4. An Application of the Natural Gradient: Natural Evolution Strategies (NESs)

#### 4.2. Some Illustrating Applications of Dually Flat Manifolds

#### 4.3. Hypothesis Testing in the Dually Flat Exponential Family Manifold $(\mathcal{E},{\mathrm{KL}}^{*})$

**Theorem**

**10**

#### 4.4. Clustering Mixtures in the Dually Flat Mixture Family Manifold $(\mathcal{M},\mathrm{KL})$

## 5. Conclusions: Summary, Historical Background, and Perspectives

#### 5.1. Summary

#### 5.2. A Brief Historical Review of Information Geometry

$(M,g)$ | Riemannian manifold |

$(\mathcal{P},{}_{\mathcal{P}}g)$ | Fisher–Riemannian (expected) Riemannian manifold |

$(M,g,\nabla )$ | Riemannian manifold $(M,g)$ with affine connection ∇ |

$(\mathcal{P},{}_{\mathcal{P}}g,{}_{\mathcal{P}}{\phantom{\rule{4pt}{0ex}}}^{e}{\nabla}^{\alpha})$ | Chentsov’s manifold with affine exponential $\alpha $-connection |

$(M,g,\nabla ,{\nabla}^{*})$ | Amari’s dualistic information manifold |

$(\mathcal{P},{}_{\mathcal{P}}g,{}_{\mathcal{P}}{\nabla}^{-\alpha},{}_{\mathcal{P}}{\nabla}^{\alpha})$ | Amari’s (expected) information $\alpha $-manifold, $\alpha $-geometry |

$(M,g,C)$ | Lauritzen’s statistical manifold [29] |

$(M,{}^{D}g,{}^{D}\nabla ,{}^{{D}^{*}}\nabla )$ | Eguchi’s conjugate connection manifold induced by divergence D |

$(M,{}^{F}g,{}^{F}C)$ | Chentsov/Amari’s dually flat manifold induced by convex potential F |

#### 5.3. Perspectives

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A. Monte Carlo Estimations of f-Divergences

**Definition**

**A1**

## Appendix B. The Multivariate Gaussian Family: An Exponential Family

## Appendix C. Skew Jensen Divergences and Bregman Divergences

**Lemma**

**A1**(Chordal slope lemma).

**Theorem**

**A1.**

**Proof.**