# On Maximum Entropy and Inference

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Background

#### 2.1. Spin Models with Interactions of Arbitrary Order

#### 2.2. Bayesian Model Selection on Mixture Models

## 3. Mapping Mixture Models into Spin Models

## 4. Illustrative Examples

#### 4.1. Recovering the Generating Hamiltonian from Symmetries: Two and Four Spins

#### 4.2. Exchangeable Spin Models

#### 4.3. The Deep Under-Sampling Limit

#### 4.4. A Real-World Example

## 5. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## Appendix A. The Completeness Relation (8)

## Appendix B. The Posterior Distribution of ϱ

## Appendix C. The Covariance Matrix of the g^{μ} Parameters

## Appendix D. Sufficient Statistics

## References and Notes

- Jaynes, E.T. Information Theory and Statistical Mechanics. Phys. Rev.
**1957**, 106, 620–630. [Google Scholar] [CrossRef] - Pitman, E.J.G. Sufficient statistics and intrinsic accuracy. In Mathematical Proceedings of the Cambridge Philosophical Society; Cambridge University Press: Cambridge, UK, 1936; Volume 32, pp. 567–579. [Google Scholar]
- Darmois, G. Sur les lois de probabilité à estimation exhaustive. C. R. Acad. Sci. Paris
**1935**, 200, 1265–1266. (In French) [Google Scholar] - Koopman, B.O. On distributions admitting a sufficient statistic. Trans. Am. Math. Soc.
**1936**, 39, 399–409. [Google Scholar] [CrossRef] - Ackley, D.H.; Hinton, G.E.; Sejnowski, T.J. A Learning Algorithm for Boltzmann Machines. Cogn. Sci.
**1985**, 9, 147–169. [Google Scholar] [CrossRef] - Schneidman, E.; Berry, M.J., II; Segev, R.; Bialek, W. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature
**2006**, 440, 1007–1012. [Google Scholar] [CrossRef] [PubMed] - Nguyen, H.C.; Zecchina, R.; Berg, J. Inverse statistical problems: From the inverse Ising problem to data science. arXiv, 2017; arXiv:1702.01522. [Google Scholar]
- Lee, E.; Broedersz, C.; Bialek, W. Statistical mechanics of the US Supreme Court. J. Stat. Phys.
**2015**, 160, 275–301. [Google Scholar] [CrossRef] - Wainwright, M.J.; Jordan, M.I. Variational inference in graphical models: The view from the marginal polytope. In Proceedings of the Annual Allerton Conference on Communication Control and Computing, Allerton, IL, USA, 23–25 September 1998; Volume 41, pp. 961–971. [Google Scholar]
- Sejnowski, T.J. Higher-order Boltzmann machines. AIP Conf. Proc.
**1986**, 151, 398–403. [Google Scholar] - Amari, S. Information Geometry on Hierarchy of Probability Distributions; IEEE: Hoboken, NJ, USA, 2001; Volume 47, pp. 1701–1711. [Google Scholar]
- Margolin, A.; Wang, K.; Califano, A.; Nemenman, I. Multivariate dependence and genetic networks inference. IET Syst. Biol.
**2010**, 4, 428–440. [Google Scholar] [CrossRef] [PubMed] - Merchan, L.; Nemenman, I. On the Sufficiency of Pairwise Interactions in Maximum Entropy Models of Networks. J. Stat. Phys.
**2016**, 162, 1294–1308. [Google Scholar] [CrossRef] - Limiting inference schemes to pairwise interactions is non-trivial when variables take more than two values (e.g., Potts spins). A notable example is that of the inference of protein contacts from amino acid sequences. There, each variable can take 20 possible values; hence, there are 200 parameters for each pair of positions. Sequences are typically n ∼ 100 amino acids long, so a pairwise model contains 200 n
^{2}/2 ∼ 10^{6}parameters. In spite of the fact that the number of available sequences is much less than that (i.e., N ∼ 10^{3}▽·10^{4}), learning Potts model parameters has been found to be an effective means to predict structural properties of proteins [7]. However, we will not enter into details related to the Potts model in the present work. - As already pointed out in [5], any higher order interaction can be reduced to pairwise interaction, introducing hidden variables. Conversely, higher order interactions may signal the presence of hidden variables.
- Haimovici, A.; Marsili, M. Criticality of mostly informative samples: A Bayesian model selection approach. J. Stat. Mech. Theory Exp.
**2015**, 2015, P10013. [Google Scholar] [CrossRef] - Collins, M.; Dasgupta, S.; Schapire, R.E. A Generalization of Principal Component Analysis to the Exponential Family. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2001; pp. 617–624. [Google Scholar]
- Beretta, A.; Battistin, C.; Mulatier, C.; Mastromatteo, I.; Marsili, M. The Stochastic complexity of spin models: How simple are simple spin models? arXiv, 2017; arXiv:1702.07549. [Google Scholar]
- Transtrum, M.K.; Machta, B.B.; Brown, K.S.; Daniels, B.C.; Myers, C.R.; Sethna, J.P. Perspective: Sloppiness and emergent theories in physics, biology, and beyond. J. Chem. Phys.
**2015**, 143, 010901. [Google Scholar] [CrossRef] [PubMed] - Tkačik, G.; Marre, O.; Mora, T.; Amodei, D.; Berry, M.J., II; Bialek, W. The simplest maximum entropy model for collective behavior in a neural network. J. Stat. Mech. Theory Exp.
**2013**, 2013, P03011. [Google Scholar] [CrossRef] - Notice that other inference methods may infer non-zero interactions in this case [7]. Note also that the statistics of the frequencies can be very different if one takes a subset of n′ < n spin, so the present approach may predict g
^{μ}≠ 0 when the same dataset is restricted to a subset of spins. - A conservative estimate of the number of significant interactions is given by the number of independent parameters g
_{λ}in our data. These are 18 in the U.S. Supreme Court data and 12 in the synthetic data. - Reference [8] remarks that the definitions of “yes” and “no” are somewhat arbitrary and do not carry any information on the political orientation associated with a given vote, since they are decided in lower courts; it also shows that, even when a “left-wing/right-wing” label is attached to the “yes/no” votes, the fields alone do not explain the data well.
- Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis; Chapman and Hall/CRC Press: Boca Raton, FL, USA, 2014; Volume 2. [Google Scholar]
- Box, G.E.P.; Tiao, G.C. Bayesian Inference in Statistical Analysis; Addison-Wesley Publishing Company: Boston, MA, USA, 1973. [Google Scholar]

**Figure 1.**A representation of the two models consistent with the ${\mathcal{Q}}^{\ast}$ partition in Equation (20). Blue links represent pairwise interactions and the shaded area represents a four body interaction.

**Figure 2.**Inferred parameters of a mean field Ising model Equation (22) with $n=10$ (bottom) and 20 (top) spins (only interactions up to $m=10$ spins are shown for $n=20$). (left) Couplings ${g}^{m}$ of m-th order interactions as a function of $\beta $ ($N={10}^{3}$ samples for $n=10$ and $N={10}^{4}$ for $n=20$). (right) ${g}^{m}$ as a function of N for $\beta =1$. The correct value ${g}^{2}=\beta /n$ is shown as a full line.

**Figure 3.**Inference of a system of $n=9$ spins from a dataset of $N=895$ samples. (left) Data generated from a pairwise Ising model Equation (22) with $\beta =2.28$. (right) Data from the U.S. Supreme Court [8]. The upper panels report the estimated values of the parameters ${\widehat{g}}^{\mu}$ as a function of the order m of the interaction. Different colors refer to inference limited to the largest $\ell =2,3,5,7$ singular values or to the case when all singular values are considered. The lower panels report the change in log likelihood (per sample point) when a single parameter ${g}^{\mu}$ is set to zero, as a function of the order $m=\left|\mu \right|$ of the interaction.

**Figure 4.**Hypergraph of the top 15 interactions between the nine judges of the second Rehnquist Court. Judges are represented as nodes with labels referring to the initials (as in [8]). Two-body interactions are represented by (red) links of a width that increases with $|{\Delta}_{\mu}|$, whereas four-body interactions as (green) shapes joining the four nodes. The shade of the nodes represents the ideological orientation, as reported in [8], from liberal (black) to conservative (white).

**Table 1.**Singular values and estimated parameters for the U.S. Supreme Court data. The parameters ${\widehat{g}}_{\lambda}^{\left(\ell \right)}$ refer to maximum entropy estimates of the model that considers only the top ℓ singular values (i.e., $\lambda \le \ell $), whereas ${\widehat{g}}_{\lambda}$ in the last column refers to estimated parameters using all singular values.

$\mathit{\lambda}$ | ${\mathbf{\Lambda}}_{\mathit{\lambda}}$ | ${\widehat{\mathit{g}}}_{\mathit{\lambda}}^{\left(\mathbf{2}\right)}$ | ${\widehat{\mathit{g}}}_{\mathit{\lambda}}^{\left(\mathbf{3}\right)}$ | ${\widehat{\mathit{g}}}_{\mathit{\lambda}}^{\left(\mathbf{4}\right)}$ | ${\widehat{\mathit{g}}}_{\mathit{\lambda}}^{\left(\mathbf{5}\right)}$ | ${\widehat{\mathit{g}}}_{\mathit{\lambda}}^{\left(\mathbf{7}\right)}$ | ${\widehat{\mathit{g}}}_{\mathit{\lambda}}$ |
---|---|---|---|---|---|---|---|

1 | 0.528 | 0.946 | 1.023 | 1.347 | 1.512 | 1.510 | 3.680 |

2 | 0.250 | −0.506 | −0.573 | −0.688 | −0.722 | −0.722 | −1.213 |

3 | 0.159 | 0 | 0.256 | 0.358 | 0.378 | 0.377 | 0.519 |

4 | 0.102 | 0 | 0 | −0.436 | −0.492 | −0.491 | −0.601 |

5 | 0.073 | 0 | 0 | 0 | −0.178 | −0.131 | −0.152 |

6 | 0.062 | 0 | 0 | 0 | 0 | 0.018 | 0.087 |

7 | 0.062 | 0 | 0 | 0 | 0 | −0.010 | −0.041 |

8 | 0.055 | 0 | 0 | 0 | 0 | 0 | −0.222 |

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Gresele, L.; Marsili, M.
On Maximum Entropy and Inference. *Entropy* **2017**, *19*, 642.
https://doi.org/10.3390/e19120642

**AMA Style**

Gresele L, Marsili M.
On Maximum Entropy and Inference. *Entropy*. 2017; 19(12):642.
https://doi.org/10.3390/e19120642

**Chicago/Turabian Style**

Gresele, Luigi, and Matteo Marsili.
2017. "On Maximum Entropy and Inference" *Entropy* 19, no. 12: 642.
https://doi.org/10.3390/e19120642