# Optimal Encoding in Stochastic Latent-Variable Models

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Results

#### 2.1. RBMs as a Statistical Machine-Learning Analogue of Stochastic Spiking Communication

#### 2.2. RBMs Provide an Energy-Based Interpretation of Spiking Population Codes

#### 2.3. Stimulus-Dependent Variability Suppression Is a Key Feature of Optimal Encoding

#### 2.4. Optimal Codes Exhibit Statistical Criticality

#### 2.5. Statistical Correlates of the Size-Accuracy Trade-Off

## 3. Discussion

## 4. Materials and Methods

#### 4.1. Datasets

#### 4.2. Restricted Boltzmann Machines

#### 4.3. Energy and Entropy

#### 4.4. Fisher Information

#### 4.5. Free Energy in RBMs

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Abbreviations

RBM | Restricted Boltzmann Machine |

FIM | Fisher Information Matrix |

v | ‘Visible stimulus’ pattern, the input to a neural sensory encoder |

h | ‘Hidden activation’ pattern of stimulus-driven binary neural activity (interpreted as spiking) |

W | The weight matrix for an RBM mapping visible activations to hidden-unit drive |

${B}_{h}$ | The biases on the hidden units for an RBM |

${B}_{v}$ | The biases on the visible units for an RBM |

$\varphi $ | Parameters $\{W,{B}_{h},{B}_{v}\}$ associated with an RBM model |

E | ‘Energy’, defined here as negative log-probability |

H | ‘Entropy’, in the Shannon sense |

${E}_{h,v}$ | The log-probability of simultaneously observing stimulus v and neural pattern h |

${H}_{h|v}$ | The entropy of the distribution of neural patterns h evoked by stimulus v |

${\mathcal{V}}_{E}$ | Set of input stimuli with similar energy (log-probability i.e., bitrate) |

${T}_{c}$ | The critical temperature of an RBM interpreted as an Ising spin model |

$\beta $ | Inverse temperature |

## References

- Barlow, H.B. Single units and sensation: A neuron doctrine for perceptual psychology? Perception
**1972**, 1, 371–394. [Google Scholar] [CrossRef][Green Version] - Shannon, C.E. A mathematical theory of communication, Part I, Part II. Bell Syst. Tech. J.
**1948**, 27, 623–656. [Google Scholar] [CrossRef] - Field, D.J. Relations between the statistics of natural images and the response properties of cortical cells. Josa A
**1987**, 4, 2379–2394. [Google Scholar] [CrossRef] [PubMed][Green Version] - Bell, A.J.; Sejnowski, T.J. An information-maximization approach to blind separation and blind deconvolution. Neural Comput.
**1995**, 7, 1129–1159. [Google Scholar] [CrossRef] [PubMed] - Vinje, W.E.; Gallant, J.L. Sparse coding and decorrelation in primary visual cortex during natural vision. Science
**2000**, 287, 1273–1276. [Google Scholar] [CrossRef] [PubMed][Green Version] - Churchland, M.M.; Byron, M.Y.; Cunningham, J.P.; Sugrue, L.P.; Cohen, M.R.; Corrado, G.S.; Newsome, W.T.; Clark, A.M.; Hosseini, P.; Scott, B.B.; et al. Stimulus onset quenches neural variability: A widespread cortical phenomenon. Nat. Neurosci.
**2010**, 13, 369–378. [Google Scholar] [CrossRef] [PubMed] - Orbán, G.; Berkes, P.; Fiser, J.; Lengyel, M. Neural variability and sampling-based probabilistic representations in the visual cortex. Neuron
**2016**, 92, 530–543. [Google Scholar] [CrossRef][Green Version] - Prentice, J.S.; Marre, O.; Ioffe, M.L.; Loback, A.R.; Tkačik, G.; Berry, M.J. Error-robust modes of the retinal population code. PLoS Comput. Biol.
**2016**, 12, e1005148. [Google Scholar] [CrossRef] - Loback, A.; Prentice, J.; Ioffe, M.; Berry, M., II. Noise-Robust Modes of the Retinal Population Code Have the Geometry of “Ridges” and Correspond to Neuronal Communities. Neural Comput.
**2017**, 29, 3119–3180. [Google Scholar] [CrossRef] - Destexhe, A.; Sejnowski, T.J. The Wilson–Cowan model, 36 years later. Biol. Cybern.
**2009**, 101, 1–2. [Google Scholar] [CrossRef] - Schneidman, E.; Berry, M.J.; Segev, R.; Bialek, W. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature
**2006**, 440, 1007–1012. [Google Scholar] [CrossRef] [PubMed][Green Version] - Shlens, J.; Field, G.D.; Gauthier, J.L.; Grivich, M.I.; Petrusca, D.; Sher, A.; Litke, A.M.; Chichilnisky, E. The structure of multi-neuron firing patterns in primate retina. J. Neurosci.
**2006**, 26, 8254–8266. [Google Scholar] [CrossRef] [PubMed][Green Version] - Köster, U.; Sohl-Dickstein, J.; Gray, C.M.; Olshausen, B.A. Modeling higher-order correlations within cortical microcolumns. PLoS Comput. Biol.
**2014**, 10, e1003684. [Google Scholar] [CrossRef] [PubMed] - Tkačik, G.; Mora, T.; Marre, O.; Amodei, D.; Palmer, S.E.; Berry, M.J.; Bialek, W. Thermodynamics and signatures of criticality in a network of neurons. Proc. Natl. Acad. Sci. USA
**2015**, 112, 11508–11513. [Google Scholar] [CrossRef][Green Version] - Hinton, G.E.; Brown, A.D. Spiking boltzmann machines. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2000; pp. 122–128. [Google Scholar]
- Nasser, H.; Marre, O.; Cessac, B. Spatio-temporal spike train analysis for large scale networks using the maximum entropy principle and Monte Carlo method. J. Stat. Mech. Theory Exp.
**2013**, 2013, P03006. [Google Scholar] [CrossRef][Green Version] - Zanotto, M.; Volpi, R.; Maccione, A.; Berdondini, L.; Sona, D.; Murino, V. Modeling retinal ganglion cell population activity with restricted Boltzmann machines. arXiv
**2017**, arXiv:1701.02898. [Google Scholar] - Gardella, C.; Marre, O.; Mora, T. Blindfold learning of an accurate neural metric. Proc. Natl. Acad. Sci. USA
**2018**, 115, 3267–3272. [Google Scholar] [CrossRef][Green Version] - Turcsany, D.; Bargiela, A.; Maul, T. Modelling Retinal Feature Detection with Deep Belief Networks in a Simulated Environment. In Proceedings of the 28th European Conference on Modelling and Simulation (ECMS), Brescia, Italy, 27–30 May 2014; pp. 364–370. [Google Scholar]
- Shao, L.Y. Linear-Nonlinear-Poisson Neurons Can Do Inference on Deep Boltzmann Machines. In Proceedings of the 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
- Schwab, D.J.; Nemenman, I.; Mehta, P. Zipf’s law and criticality in multivariate data without fine-tuning. Phys. Rev. Lett.
**2014**, 113, 068102. [Google Scholar] [CrossRef][Green Version] - Mastromatteo, I.; Marsili, M. On the criticality of inferred models. J. Stat. Mech. Theory Exp.
**2011**, 2011, P10012. [Google Scholar] [CrossRef] - Beggs, J.M.; Timme, N. Being critical of criticality in the brain. Front. Physiol.
**2012**, 3, 163. [Google Scholar] [CrossRef][Green Version] - Aitchison, L.; Corradi, N.; Latham, P.E. Zipf’s Law Arises Naturally When There Are Underlying, Unobserved Variables. PLoS Comput. Biol.
**2016**, 12, e1005110. [Google Scholar] [CrossRef] [PubMed][Green Version] - Touboul, J.; Destexhe, A. Power-law statistics and universal scaling in the absence of criticality. Phys. Rev. E
**2017**, 95, 012413. [Google Scholar] [CrossRef] [PubMed][Green Version] - Hinton, G.E. Training products of experts by minimizing contrastive divergence. Neural Comput.
**2002**, 14, 1771–1800. [Google Scholar] [CrossRef] [PubMed] - Hinton, G.E. A practical guide to training restricted Boltzmann machines. In Neural Networks: Tricks of the Trade; Springer: Berlin, Germany, 2012; pp. 599–619. [Google Scholar]
- Hinton, G.E.; Dayan, P.; Frey, B.J.; Neal, R.M. The “wake-sleep” algorithm for unsupervised neural networks. Science
**1995**, 268, 1158–1161. [Google Scholar] [CrossRef] - Dayan, P.; Hinton, G.E.; Neal, R.M.; Zemel, R.S. The Helmholtz Machine. Neural Comput.
**1995**, 7, 889–904. [Google Scholar] [CrossRef] - Mora, T.; Bialek, W. Are Biological Systems Poised at Criticality? J. Stat. Phys.
**2011**, 144, 268–302. [Google Scholar] [CrossRef][Green Version] - Sorbaro, M.; Herrmann, J.M.; Hennig, M. Statistical models of neural activity, criticality, and Zipf’s law. In The Functional Role of Critical Dynamics in Neural Systems; Springer: Berlin, Germany, 2019; pp. 265–287. [Google Scholar]
- Bradde, S.; Bialek, W. PCA meets RG. J. Stat. Phys.
**2017**, 167, 462–475. [Google Scholar] [CrossRef] - Meshulam, L.; Gauthier, J.L.; Brody, C.D.; Tank, D.W.; Bialek, W. Coarse graining, fixed points, and scaling in a large population of neurons. Phys. Rev. Lett.
**2019**, 123, 178103. [Google Scholar] [CrossRef][Green Version] - Stringer, C.; Pachitariu, M.; Steinmetz, N.; Carandini, M.; Harris, K.D. High-dimensional geometry of population responses in visual cortex. Nature
**2019**, 571, 361–365. [Google Scholar] [CrossRef] - Ioffe, M.L.; Berry, M.J., II. The structured ‘low temperature’ phase of the retinal population code. PLoS Comput. Biol.
**2017**, 13, e1005792. [Google Scholar] [CrossRef][Green Version] - Tyrcha, J.; Roudi, Y.; Marsili, M.; Hertz, J. The effect of nonstationarity on models inferred from neural data. J. Stat. Mech. Theory Exp.
**2013**, 2013, P03005. [Google Scholar] [CrossRef] - Nonnenmacher, M.; Behrens, C.; Berens, P.; Bethge, M.; Macke, J.H. Signatures of criticality arise from random subsampling in simple population models. PLoS Comput. Biol.
**2017**, 13, e1005718. [Google Scholar] [CrossRef] [PubMed][Green Version] - Saremi, S.; Sejnowski, T.J. On criticality in high-dimensional data. Neural Comput.
**2014**, 26, 1329–1339. [Google Scholar] [CrossRef] [PubMed][Green Version] - Swendsen, R.H.; Wang, J.S. Nonuniversal critical dynamics in Monte Carlo simulations. Phys. Rev. Lett.
**1987**, 58, 86. [Google Scholar] [CrossRef] - Stephens, G.J.; Mora, T.; Tkačik, G.; Bialek, W. Statistical Thermodynamics of Natural Images. Phys. Rev. Lett.
**2013**, 110, 018701. [Google Scholar] [CrossRef][Green Version] - Bedard, C.; Kroeger, H.; Destexhe, A. Does the 1/f frequency scaling of brain signals reflect self-organized critical states? Phys. Rev. Lett.
**2006**, 97, 118102. [Google Scholar] [CrossRef] [PubMed][Green Version] - Prokopenko, M.; Lizier, J.T.; Obst, O.; Wang, X.R. Relating Fisher information to order parameters. Phys. Rev. E
**2011**, 84, 041116. [Google Scholar] [CrossRef] [PubMed][Green Version] - Daniels, B.C.; Chen, Y.J.; Sethna, J.P.; Gutenkunst, R.N.; Myers, C.R. Sloppiness, robustness, and evolvability in systems biology. Curr. Opin. Biotechnol.
**2008**, 19, 389–395. [Google Scholar] [CrossRef] [PubMed][Green Version] - Gutenkunst, R.N.; Waterfall, J.J.; Casey, F.P.; Brown, K.S.; Myers, C.R.; Sethna, J.P. Universally sloppy parameter sensitivities in systems biology models. PLoS Comput. Biol.
**2007**, 3, e189. [Google Scholar] [CrossRef][Green Version] - Panas, D.; Amin, H.; Maccione, A.; Muthmann, O.; van Rossum, M.; Berdondini, L.; Hennig, M.H. Sloppiness in spontaneously active neuronal networks. J. Neurosci.
**2015**, 35, 8480–8492. [Google Scholar] [CrossRef][Green Version] - Schneidman, E.; Puchalla, J.L.; Segev, R.; Harris, R.A.; Bialek, W.; Berry, M.J. Synergy from silence in a combinatorial neural code. J. Neurosci.
**2011**, 31, 15732–15741. [Google Scholar] [CrossRef] [PubMed] - White, B.; Abbott, L.F.; Fiser, J. Suppression of cortical neural variability is stimulus-and state-dependent. J. Neurophysiol.
**2012**, 108, 2383–2392. [Google Scholar] [CrossRef] [PubMed][Green Version] - Festa, D.; Aschner, A.; Davila, A.; Kohn, A.; Coen-Cagli, R. Neuronal variability reflects probabilistic inference tuned to natural image statistics. BioRxiv
**2020**. [Google Scholar] [CrossRef] - Friston, K. The free-energy principle: A unified brain theory? Nat. Rev. Neurosci.
**2010**, 11, 127–138. [Google Scholar] [CrossRef] [PubMed] - LaMont, C.H.; Wiggins, P.A. Correspondence between thermodynamics and inference. Phys. Rev. E
**2019**, 99, 052140. [Google Scholar] [CrossRef] [PubMed][Green Version] - Aitchison, L.; Hennequin, G.; Lengyel, M. Sampling-based probabilistic inference emerges from learning in neural circuits with a cost on reliability. arXiv
**2018**, arXiv:1807.08952. [Google Scholar] - Song, J.; Marsili, M.; Jo, J. Resolution and relevance trade-offs in deep learning. J. Stat. Mech. Theory Exp.
**2018**, 2018, 123406. [Google Scholar] [CrossRef][Green Version] - Cubero, R.J.; Jo, J.; Marsili, M.; Roudi, Y.; Song, J. Statistical criticality arises in most informative representations. J. Stat. Mech. Theory Exp.
**2019**, 2019, 063402. [Google Scholar] [CrossRef][Green Version] - Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features From Tiny Images. Master’s Thesis, University of Toronto, Toronto, ON, Canada, 2009. [Google Scholar]
- Bengio, Y. Learning deep architectures for AI. Found. Trends Mach. Learn.
**2009**, 2, 1–127. [Google Scholar] [CrossRef]

**Figure 1.**Effect of the channel size on encoding of stimulus statistics: (

**a**) We trained RBMs to model local regions of (binarized) CIFAR-10 images. We interpret the number of hidden units as the size of a sensory communication channel. (

**b**) A minimum number of hidden units is required to faithfully capture stimulus statistics. We quantified model accuracy by the Kullback-Leibler divergence between model samples and held-out training data. Accuracy improves as the hidden-layer size increases, up to a point. Results for three different sizes of stimulus patches (13, 21, 37 pixels) are shown. (

**c**) Comparison of actual and predicted pattern probabilities for four hidden-layer sizes. We denote probability in terms of the negative log-probability (in bits), abbreviated as energy $E=-{log}_{2}Pr(\xb7)$. Larger models capture the stimulus distribution better. (

**d**,

**e**) Hidden-layer activation becomes sparser (

**d**) as model size increases, and more decorrelated (

**e**). 13 visible units were used for (

**c**–

**e**).

**Figure 2.**Informative stimuli suppress variability in stochastic spiking communication channels: We trained RBMs to encode 13 visible units from circular patches of binarized CIFAR-10 images. Left three plots show how the statistics of the evoked activity in hidden units (vertical axes) varies as a function of stimulus information content (horizontal axes), for various model sizes. Larger ‘energies’ (${E}_{v}$) represent stimuli (blue dots) that require more bits to communicate. All units are in bits. Rightmost plots show the slope of the corresponding quantity in terms of ${E}_{v}$ as a function of model size. (

**a**) Sufficiently large models learn to reduce channel entropy (variability) for stimuli that require more information to encode. (

**b**) To communicate more information, neural codes can either reduce stimulus-conditioned entropy ${H}_{h|v}$, or they can use rarer code-words, i.e., increase ${\u2329{E}_{h}\u232a}_{h|v}$. In sufficiently large models, we find that energy and entropy both decrease for stimuli that require more information to communicate. (gray bars; dots = mean, bars = inter-quartile range).

**Figure 3.**Stimulus information content predicts energy and entropy of evoked activity in latent units: Each plot shows the average stimulus-evoked entropy (${H}_{h|v}$) plus a constant (${I}_{\mathrm{enc}}$) on the vertical axis, against the information content of the code-words evoked by a given stimulus (${\u2329{E}_{h}\u232a}_{h|v}$ horizontal axis). Here, ${I}_{\mathrm{enc}}=\langle {D}_{KL}({Q}_{h|v}\parallel {Q}_{h})\rangle $ is the average energy-entropy relationship for all stimuli, which becomes approximately constant above a critical model size (Figure 2b). Color indicates the stimulus bitrate ${E}_{v}$. Points reflect the average energy and entropy of hidden patterns evoked for a given ${E}_{v}$. In too-small models (n = 10), low-variability states are used to represent common (low-information) stimuli. This relationship shifts as the encoding capacity increases (n = 20, 25). Above a critical model size (n ≥ 35), an inverse relationship between visible energies and the entropy of latent representations emerges: high-energy visible patterns suppress variability. A 1:1 trade-off between using energy and entropy for modulating bit rate also emerges (red lines). This relationship persists in larger models (n = 60, 120). This 1:1 trade-off reflects emergence of a $1/f$ power-law in the statistics of hidden unit activity, which gives rise to statistical criticality. Here, models were trained to encode 13 visible units from circular patches of binarized CIFAR-10 images.

**Figure 4.**Learned encoding strategies do not depend on the statistics of the stimulus distribution: In natural visual stimuli, the visible samples themselves display $1/f$ power-law statistics. This might encourage similar statistics in the activations of hidden units, explaining the 1:1 trade-off between modulating entropy and energy that we observed. Here we show the energy-entropy balance as a function of stimulus information content (i.e., bit-rate, ${E}_{v}$) for RBMs fit to two-dimensional lattice Ising models, sampled at a range of temperature above and below the critical temperature of ${T}_{c}\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}2/ln(1+\sqrt{2})\phantom{\rule{3.33333pt}{0ex}}\approx \phantom{\rule{3.33333pt}{0ex}}$2.269. The energy-entropy balance converges to identity regardless of the data temperature (right column). However, the critical hidden-layer size (N) does decrease with temperature, illustrated here (middle column) by the increasing hidden-layer size displaying intermediate energy-entropy statistics. Small models (left column) exhibit a correlation between visible energy and entropy for training-data temperatures above ${T}_{c}$. Ising models were simulated on a 10 × 10 grid, and sampled via the Swendsen-Wang algorithm [39] with 10 k steps burn-in and 100 k training patterns drawn every 100 samples. 13-unit patches were presented to the RBM for training. All units are in bits.

**Figure 5.**Analyses of parameter sensitivity suggests an optimal model size for encoding sensory statistics: (

**a**) Analysis of the Fisher Information Matrix (FIM) over a range of hidden-layer sizes (top to bottom; 13 visible units). From left to right, (1) FIM eigenvalue spectra ${\lambda}_{i}$ (y-axis) over a range of inverse temperatures $\beta $ indicate that model fits ($\beta $ = 1) past a certain size lie at a peak in their generalized susceptibility. This is a correlate of criticality in Ising spin models. Eigenvalues below ${10}^{-5}$ are truncated, and the largest and smallest eigenvalues are in red; (2) Important parameters in the leading FIM eigenvector align with individual hidden units, and become sparse for larger hidden layers. The eigenvector is displayed separately for the weights (matrix), and the visible (vertical) and hidden (horizontal) biases; (3) The average sensitivity of each parameter over all FIM eigenvectors, shown here as the square root of the FIM diagonal, also shows sparsity, indicating that beyond a certain size additional hidden units contribute little to model accuracy. Data is shown as in column 2; (4) Variance of the hidden unit activation as a function of stimulus energy. In larger models, units with sensitive parameters contribute to encoding low energy, less informative patterns. (

**b**) The average sensitivity of each parameter, measured by the trace of the FIM, normalized by hidden-layer size, decreases as hidden-layer size grows. (

**c**) Hidden unit projective fields from a model with 37 visible and 60 hidden units, ordered by relative sensitivity (rank indicated above each image). More important units (ranks 1–8) encode spatially simple features such as localized patches, while the least important ones (ranks 53–60) have complex features.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Rule, M.E.; Sorbaro, M.; Hennig, M.H. Optimal Encoding in Stochastic Latent-Variable Models. *Entropy* **2020**, *22*, 714.
https://doi.org/10.3390/e22070714

**AMA Style**

Rule ME, Sorbaro M, Hennig MH. Optimal Encoding in Stochastic Latent-Variable Models. *Entropy*. 2020; 22(7):714.
https://doi.org/10.3390/e22070714

**Chicago/Turabian Style**

Rule, Michael E., Martino Sorbaro, and Matthias H. Hennig. 2020. "Optimal Encoding in Stochastic Latent-Variable Models" *Entropy* 22, no. 7: 714.
https://doi.org/10.3390/e22070714