Optimal Encoding in Stochastic Latent-Variable Models

In this work we explore encoding strategies learned by statistical models of sensory coding in noisy spiking networks. Early stages of sensory communication in neural systems can be viewed as encoding channels in the information-theoretic sense. However, neural populations face constraints not commonly considered in communications theory. Using restricted Boltzmann machines as a model of sensory encoding, we find that networks with sufficient capacity learn to balance precision and noise-robustness in order to adaptively communicate stimuli with varying information content. Mirroring variability suppression observed in sensory systems, informative stimuli are encoded with high precision, at the cost of more variable responses to frequent, hence less informative stimuli. Curiously, we also find that statistical criticality in the neural population code emerges at model sizes where the input statistics are well captured. These phenomena have well-defined thermodynamic interpretations, and we discuss their connection to prevailing theories of coding and statistical criticality in neural populations.


Introduction
Latent-variable statistical models, such as the Restricted Boltzmann Machine (RBM), learn sparse, e cient representations of the hidden causes of their inputs.For example, highly interpretable compressed representations can emerge in auto-encoders and deep neural networks (Kramer, 1991;Tishby and Zaslavsky, 2015).In this sense, latent variable models may also be interpreted as encoders that extract and communicate statistical features of their inputs, an interpretation reminiscent of the problem solved by neuronal sensory systems.In both real and arti cial neural networks, one would like to communicate information about stimuli while minimizing the overall number of units needed.In what sense, then, can latent-variable models be interpreted as encoders that learn an optimal communication strategy, and what characteristics emerge from optimizing the latent layer size versus representational accuracy?e mathematical theory of communication, developed seventy years ago by Claude Shannon (Shannon, 1948), outlines the constraints of both deterministic and noisy optimal communication channels.However, neural networks face additional constraints: the mechanisms of computation and encoding in a neural network are constrained in their functional form.Each input must be conveyed by a single xed-sized activation pa ern, unlike channels commonly considered in communications theory, which allocate more transmission time to higher-information inputs.Moreover, sensory encoders are typically stochastic, with noise and variability constrained by the implementation, and dependent on the stimuli being encoded.ese sources of noise di er from, e.g., line noise in an electrical channel.
e interpretation of latent variable models as communication channels was explored by Dayan, Hinton, and colleagues (Hinton et al., 1995;Dayan et al., 1995;Dayan and Hinton, 1996).ey developed the concept of free energy for products-of-experts models, in which the latent encoding variables are independent conditioned on the data pa erns (and vice-versa).Minimizing free energy also minimizes the cost of communicating pa erns in the data.However, ing such models and identifying optimal model ts remains challenging.In particular, it is unclear how the optimal model size for a given problem should be chosen.To address these questions, we explore models over a range of sizes, and look for statistical signatures of the "optimal" model that achieves asymptotic accuracy with minimal size, and characterize the encoding strategies that emerge in such optimal models.

Results
is paper is organized conceptually into two parts.e rst half outlines the problem of representation and communication of stimuli in a restricted Boltzmann machine, and empirically explores the size-accuracy trade-o .e la er half develops an interpretation of the encoding strategy that emerges in such latent variable models at the optimal model size.

Sparse latent-variable models
Consider the problem of describing pa erns in data ∈V as arising from underlying hidden factors h∈H .From the perspective of communication, this is equivalent to learning a latent-variable encoder ("latent encoder") that represents pa erns from V using representations in H .Such models can be t by minimizing so-called free energy (Hinton et al., 1995).In this work, we explore the connection between visible data pa ern energy E and information or representational cost, as introduced by Hinton et al. (1995).In the derivations that follow, energy and information are synonymous with the negative log-probability, measured in bits: e goal is to learn an approximating distribution Q(h, |ϕ) with parameterization ϕ, for which the marginal distribution Q( |ϕ) closely matches the distribution in the training data.Here and in the following, we abbreviate conditional distributions on the parameters Q( Hinton et al. (1995) introduced an approximating distribution for binary latent-variable models, in which the conditional distribution of the hidden variables given the visible pa erns is independent.e free energy equation equates the energy of a visible pa ern under model parameters ϕ to the expected energy E ϕ h, minus the entropy of the conditional distribution Q A derivation of this equation is given in Appendix 1.In general the stimulus energy under the model E ϕ need not equal the true energy E .However, minimizing free energy in expectation over the training data causes model energy to approximate the data energy (Hinton et al., 1995).
It is this free-energy term that is minimized, on expectation over the data, by the contrastive divergence algorithm (Hinton, 2002).(2) Important parameters in the leading FIM eigenvector align with individual hidden units, and become sparser for larger hidden layers.e eigenvector is displayed separately for the weights (matrix), and the visible (vertical) and hidden (horizontal) biases; (3) e average sensitivity of each parameter over all FIM eigenvectors, shown here as the square root of the FIM diagonal (methods), also shows sparsity, indicating that beyond a certain size additional hidden units contribute li le to model accuracy.Data is shown as in column 2; (4) Zipf laws emerge around the critical hidden-layer size of 30 for the ranked code-word probabilities in the hidden layer, and for the full model (joint).e dashed line indicates a slope of −1.B e average sensitivity of each parameter, measured by the trace of the FIM, normalized by hidden-layer size, decreases as hidden-layer size grows.C Hidden unit projective elds from a model with 37 visible and 60 hidden units, ordered by relative sensitivity (rank indicated above each image).More important units (ranks 1-8) encode spatially simple features such as localized patches, while the least important ones (ranks 53-60) have complex features.

Optimal size of the latent layer
In RBMs, units representing the data vectors ∈V are connected to hidden, latent units with activities h∈H through a weight matrix W. is structure enforces conditional independence between the units in each layer: hidden units can be interpreted as latent factors.e joint activation ofd data and hidden units follows a Gibbs equilibrium distribution − log P ,h =βE ϕ ,h + log Z , where E ,h is the energy of a con guration of visible and hidden unit activations, ϕ = (W , B , B h ) are the parameters, including weight matrix W , visible biases B , and hidden biases B h .Z is the partition function, and β is an inverse temperature that is usually absorbed into the parameters.e RBM energy, then, (up to a normalization constant) is: When the model consists of binary units, it is equivalent to a spin model from statistical physics.We trained RBMs with a range of hidden layer sizes on small natural image patches (CIFAR-10, Krizhevsky and Hinton, 2009), to mimic the encoding of visual stimuli by retinal ganglion cells with spatially restricted receptive elds.Images were normalized and quantized into a binary representation, a procedure that retains essential statistical properties (Stephens et al., 2013).e RBMs then approximate the probabilities of observing various combinations of black and white pixels (Fig. 1A).As expected, models with only few hidden units produce imperfect ts, and the quality of the t improves for larger models and then stabilizes (Fig. 1B,C).As the model size increases, the hidden units change from very active to sparse and from strongly to weakly correlated (Fig. 1D,E).Correlations decrease, but do not disappear, indicating that su ciently large models provide sparse representations.Sparse and weakly correlated activity is a hallmark of population encoding in sensory neurons, and has previously been related to e cient encoding and information transmission (e.g.Barlow 1972;Field 1987;Vinje and Gallant 2000).
Beyond a certain size, saturating model accuracy indicates that the models have more parameters than required to encode the data (Fig. 1B).Are these extra parameters nevertheless used to support encoding?e importance of each parameter can be assessed by computing the curvature of the energy landscape with respect to small parameter changes, which yields the Fisher Information Matrix (FIM) of the model: is matrix becomes increasingly sparse for larger models (Fig. 2A), indicating that an increasing number of parameters are irrelevant or 'sloppy' (Machta et al., 2013), and can vary with minimal e ect on the model distribution.
FIM analysis yields three key insights.First, there is a signi cant gap between the rst and higher FIM eigenvalues for all model sizes except the smallest (Fig. 2A, rst column).Hence larger models are particularly sensitive along a single direction in parameter space, with a weaker e ect of the remaining, orthogonal directions.Second, in large models the corresponding rst eigenvector aligns with a subset of hidden units (Fig. 2A, second column). is sensitivity may be present in the weights or biases associated with individual hidden units.In contrast, small models exhibit high sensitivity in all parameters.In other words, the importance of individual latent units varies substantially in large models.ird, the average parameter sensitivity (see Methods) is heterogeneous, and mirrors the separation into important and less relevant hidden units seen in the rst eigenvectors (Fig. 2A, third column).Consistent with this observation, the average sensitivity, de ned as Diag(F (ϕ)) 1/2 , decreases with model size once a good t is achieved (Fig. 2B).e diagonal entries of the FIM can be computed locally (methods) from the variances of unit activations (biases) and from the variances of products h i j (weights).Consequentially, information regarding importance is available locally to each single unit, a quantity that sensory neurons could also compute and utilize.
In sum, FIM analysis reveals that large latent encoding models exhibit a highly anisotropic parameter space.
is seems an ine cient solution, as the not all available parameters appear to be fully exploited.Interestingly, the projective elds of hidden units with the most sensitive parameters encode 'simple' features such as local patches in an image (Fig. 1C), and resemble those of retinal ganglion cells.Complex projective elds resembling those in visual cortex emerge only in relatively 'sloppy' and low-importance hidden units.

Statistical criticality in well-t models
e analysis of the FIMs reveals that well-t statistical models show, as proposed before (Mastroma eo and Marsili, 2011), signs of statistical criticality.In the large N limit, spin models exhibit a continuous phase transition from the ordered to disordered phase at a critical inverse temperature β c .Although well-de ned only for systems approaching N →∞, the notion of transition from ordered states at low temperatures to disordered ones at high temperatures applies also in nite systems.e FIM is a generalized susceptibility measure, which diverges in the large N limit at this point.is behavior is evident for well-t models, where rst FIM eigenvalue as a function of the inverse temperature peaks at β=1 (the temperature at which the model was originally t; Fig. 2A, rst column).At lower temperatures, the eigenvalue spectrum spreads out, while it is increasingly concentrated for high temperatures as the state distribution approaches uniform in the limit β→0.In contrast, models that fail to t the data are always located in the 'hot', disordered phase.
Complementing this observation, we nd that the ranked probabilities of the hidden unit states, as well as those of the joint model, closely follow Zipf's law in models large enough to encode the data well (Fig. 2A, fourth column).Zipf's law states that Pr(x) ∝ 1/r (x), where r (x) is the frequency rank of pa ern x. e presence of Zipf's law in spin models like the RBM indicates the system is posed near the point of a phase transition.
is is a direct consequence of the divergence of relevant thermodynamic quantities, which follows from a vanishing curvature of the energy-entropy relationship (Mora and Bialek, 2011;Tkačik et al., 2015).
Recently it has been shown that Zipf's law emerges in latent-variable models under rather general conditions (Schwab et al., 2014;Aitchison et al., 2016).Schwab et al. (2014) show that Zipf's law emerges in the limit of a large hidden-layer in exponential-family latent variable models, and Aitchison et al. (2016) outline conditions under which unobserved variables lead Zipf's law without invoking statistical criticality.A natural question, then, is whether such an unobserved variable emerges in statistically-critical RBM ts.On possible variable, closely related to the unobserved variables explored by Aitchison et al. (2016), is the amount of information, or energy, being encoded, be it words in text or pixel pa erns in images.We elaborate on this in the following.

Stimulus energy as an unobserved variable
Latent-variable models compress and convey information about the correlations in their inputs via their hidden-layer representations, and in this sense can be viewed as communications channels.
is follows the description-length perspective rst outlined by Hinton et al. (1995), which connected pa ern energies to communication cost.In the RBM, the hidden layer encodes information about interdependencies between visible units, and the visible biases su ce to model independent aspects.We would like to know (1) how individual stimuli are conveyed in such a channel and (2) whether information-theoretic arguments can connect the emergence of statistical criticality to an encoding strategy.
To address these questions, we explore the free-energy equation of Hinton et al. (1995), and its relationship to the information content of individual stimulus code-words.We denote the free energy E for pa ern (Equation 1) in a compact form by making the model parameterization ϕ, as well as expectations with respect to Q ϕ h | , implicit where unambiguous: We hypothesize that under free-energy minimization, the visible-pa ern energy (information) emerges as an unobserved variable that explains the statistics of the latent-variable model as an encoder. is hypothesis is motivated by the speculation that the varying bit-rate of the encoded information is an unobserved variable in all three scenarios explored by Aitchison et al. (2016): natural languages, data with variable sequence length, and neural population spiking.
To illustrate this, we explore the energy-entropy trade-o in encoding strategy conditioned on the stimulus energy E .First, we organize the free energy into two terms.One term re ects the di erence between the expected stimulus-evoked energy E h h | and stimulus-evoked entropy H h | in the latent pa erns.
is is the KL-divergence from the marginal distribution Q h to the conditional distribution Q h | .Another term, E |h h | , re ects the negative-log-likelihood (NLL) of the pa ern , conditioned on hidden pa erns induced by : We observe that, in critical models, D K L (Q h | Q h ) approaches a constant value if averaged over the set stimuli V E within range ∆ of energy E, de ned as V E := ∈V |E −E|<∆ (Fig. 3A, e.g.35, 90 hidden units).We refer to V E as an "energy shell", re ecting the subset of V that require approximately E bits of information to communicate, and conditioning on a stimulus energy shell is analogous to conditioning on a bit-rate of E bits/sample for an encoding channel.Since the expectation within each shell does not depend on stimulus energy, the conditional energy equals the entropy up to a constant over a range of stimulus energies: is one-to-one scaling is exactly the exponent of Zipf's law.If it can also be shown that H h | V E varies with stimulus energy E, then we expect encounter a 1/f Zipf power law distribution over a range of energy scales in the latent units, arising as a mixture distribution over varied stimulus energies.

High-information stimuli suppress hidden-layer variability
e latent variables h communicate information about observed data , which exhibit a range of energies.e free-energy equation (Eq. 1) implies that more information can be conveyed either by reducing the entropy H h | , a measure of stimulus-evoked variability, or by increasing the energy E h, h | , i.e. using rarer latent code-words.Empirically, we nd that an encoding strategy in which higher-energy stimuli suppress hidden-layer entropy emerges around the critical model size (Fig. 3B). is emergence is illustrated for a range of hidden-layer sizes in Fig. 4. is strategy leads to a broad range of stimulus-evoked latent entropies H h | .Combined with the 1:1 scaling between latent energy and entropy (Eq.6, Fig. 3A), this implies a broad range of energies and Zipf's law for the ranked code-word distributions (Fig. 2).us, the stimulus pa ern energy shells act as an unobserved variable that explain statistical criticality via the mechanism discussed before (Schwab et al., 2014;Aitchison et al., 2016).Schwab et al. (2014) predicted this phenomenon in the limit of models with large hidden layers.We illustrate here that it emerges even in small models, provided those models are large enough to model stimulus distribution.is e ect is independent of the entropy of the stimulus, as we illustrate by ing samples from Ising models at di erent temperatures (Fig. S1).We also found that almost any regularizer prevents entropy suppression, and moves models away from the critical point to-wards the disordered phase (e.g.Fig. S2).
is the average energy-entropy relationship for all stimuli, which becomes approximately constant above the critical model size.Color indicates stimulus energy shell (color bar) and re ects the average energy and entropy of hidden pa erns evoked by visible pa erns near E .In too-small models (n=10), a subset of low-energy visible pa erns map to low-energy states in the latent units.e relationship between visible and latent pa ern energies shi s approaching the critical model size (n=20,25).At the critical model size (n=35), an inverse relationship between visible energies and the entropy of latent representations emerges: high-energy visible pa erns suppress variability in the latent units.e latent energy distribution is a mixture, parameterized by visible pa ern energy, each with the property that the sum of latent energies and entropies is constant.is relationship persists in larger models (n=60,120).
is gives rise to a power-law in latent activation frequencies with slope 1/f, a signature of statistical criticality.Models were t to 13 visible units.

Discussion
A central result of this study is that above a critical model size, the energy-entropy trade-o in latent representations does not depend on stimulus energy.Despite this, conditional entropy is suppressed by stimuli that require more information to describe, balanced by a proportionate decrease in energy.e combined e ects of entropy suppression and xed energy-entropy balance suggest an encoding interpretation of the emergence of Zipf power-law statistics in the codeword frequencies, which is connected to the sparse and stochastic nature of the RBM encoding.Due to the stochastic 'spiking' of a sparse RBM model, the conditional entropy grows linearly with the expected hidden-layer activation.It is surprising that this energy-entropy balance is retained despite large variations in conditional entropy used to encode stimuli with varying information content.We interpret the emergent encoding strategy as a solution for handling varying information-content in stimuli in a stochastic channel with xed bandwidth.
If this strategy is employed in sensory systems, we expect response variability to depend on the information-content of the individual stimuli within the full ensemble.Selective suppression of variability has indeed been reported in neural populations (van Steveninck et al., 1997;Jones et al., 2007;Bu s et al., 2007;Churchland et al., 2010;White et al., 2012).Particularly well established is variability suppression at stimulus onset (Churchland et al., 2010).Here we predict that stimulus frequency should correlate with response variability at the population level, an analysis that, so far, has not been carried out in this form.Our results also imply that the silence of a neuron can be informative, since low activation probabilities reduce variability both in RBMs and neurons.erefore, rare stimuli can be reliably conveyed by an informative pa ern of silences in encoding units, a phenomenon that may relate to synergistic silence observed in retinal codes (Schneidman et al., 2011).
e observed Zipf's law in the joint model distribution is a signature of criticality in a statistical physics model, where the system is posed near the transition between an ordered and a disordered regime (Mora and Bialek, 2011).Extending the nding that criticality is generally expected in the large-system limit in models with latent variables (Mastroma eo and Marsili, 2011;Schwab et al., 2014), we show that an optimal encoding strategy under a sparsity constraint yields this behavior already in small systems.Stimulus information then acts as a latent factor that not only a ects the average information encoded, but also its entropy, which is a special case of energy broadening described by Aitchison et al. (2016).Zipf's law has been observed in retinal population activity under a range of conditions (Tkačik et al., 2015), supporting the hypothesis that neural systems operate in this regime.However, the models investigated here would deviate from Zipf's law when driven by a stimulus ensemble with statistics that di er from the training data, as it lacks any form of adaptation.In real neurons, noise correlations have been observed to adapt to stimulus statistics (Gutnisky and Dragoi, 2008).We hypothesize this may be linked to an adjustment of population variability to stimulus statistics, which can be investigated directly by comparing code word variability.
In this work, even rare events are faithfully encoded by the latent variables.However, biological systems lter out behaviorally-irrelevant information, and rare events may be uninformative outliers.For example, appropriate regularization can discourage modeling of rare-events to reduce over-ing.In such scenarios, parameter anisotropy decreases as encoding entropy for rare stimuli increases.In this case, Zipf's law is still obtained near the optimal model size, which may be connected to the observation of criticality in intermediate layers of a deep network (Song et al., 2017).Yet more generally, any unobserved variable that leads to a broadening of latent code-word energies can give rise to Zipf's law (Aitchison et al., 2016), therefore it is plausible that additional constraints on learning could lead to di erent encoding solutions that also exhibit these statistics.
e speci c form of parameter space anisotropy we encounter is unexpected.Sensitive and insensitive directions in parameter space align closely with the latent units, and are not randomly distributed over available parameters.e FIM and, in particular, the readily computable parameter axis intersections, therefore signal whether model size can be reduced without penalizing likelihood.Since parameter axis intersections depend on locally available correlations, a neuron can in principle evaluate its own importance in an encoding network, which in turn could trigger apoptotic pruning of unneeded a erents during nervous system development.Equally, entire latent units in arti cial neural networks can be pruned using this strategy, extending a method of using the FIM diagonal as approximate measure of the importance of single parameters (Le Cun et al., 1990).Moreover, intersections also identify parameter null-spaces, which can be exploited to bias the encoding of novel stimuli in an already trained network to minimize forge ing of previously learned information (Kirkpatrick et al., 2017).
Overall, this work demonstrates that an optimal encoding strategy is linked to the emergence of statistical criticality in latent encoders.Model optimality can be assessed by investigating the energy-entropy relationship of the encoding variables, or more directly via the sparseness of the FIM.Moreover, the speci c structure of the FIM enables optimization of model size, and may yield biological insights into developmental pruning.On the other hand, deviations from the predicted energy-entropy relationship for optimal models may point towards a regularizing e ect of additional variables in neural systems, such as metabolic or physiological constraints.

Methods
Datasets Images from the CIFAR-10 ( Krizhevsky and Hinton, 2009) data set were converted to gray scale, and binarized around the median pixel intensity.90,000 randomly-selected circular patches of di erent radii were used as training data (Fig. 1A).
e learning rate was reduced in stages: 0.2, 0.1, 0.05, 0.01, 5e-3, 1e-3.8 epochs were trained at each rate with mini-batch size 4. To estimate model energies, 350,000 states were sampled via 500 chains of Gibbs sampling, keeping one sample every 150 steps.

Fisher Information
e Fisher information matrix (FIM, Eq. 3) is a positive semide nite matrix that de nes the curvature of a metric on the manifold of parameters, and indicates the sensitivity of the model to parameter changes.Divergence of an eigenvalue of the FIM indicates an abrupt change in the model distribution, i.e. a phase transition.
e FIM generalizes susceptibility and speci c heat, physical quantities that diverge at critical points.For a vector ì w in parameter space, we de ne sensitivity as e distribution of parameter sensitivity has in itself a racted interest (Daniels et al., 2008;Gutenkunst et al., 2007).For directions corresponding to eigenvectors of the Fisher information, the sensitivity is the square root of the corresponding eigenvalue.For changes in the k t h parameter, S k = √ F kk .In the case of RBMs (Eq.2), we can consider the de nition of the FIM (Eq. 3) with the biases and weights being possible values of ϕ.Expanding the derivatives, one gets to FIM entries of the form where the brackets indicate averaging over the distribution Pr( , h); these can be computed by sampling.e FIM diagonal summarizes the importance of individual units, and can be computed from locally-available variances and covariances: Energy and entropy In the RBM, hidden-layer entropy conditioned on stimulus can be calculated in closed form as: where a h | = W +B h is the stimulus-conditioned hidden-layer activation vector, which depends on the visible pa ern as well as the weight matrix W and hidden biases B h , f (x)=1/(1+e −x ) is the sigmoid function, and (x)= log(1+e x ).e expected conditional energy E h h | is computed via sampling, where each individual E h is computed, up to a constant, as: where B is the vector of visible biases and W i is the row of the weight matrix associated with the i t h visible unit.Energies are normalized using the energy of the lowest-energy pa ern, estimated by sampling.
Appendix 1: Background on free energy We review the derivation of free energy in the context of RBMs (Hinton et al., 1995).Consider the problem of approximating a data distribution P with a model distribution Q ϕ parameterized by ϕ.
In a latent variable model, one identi es a distribution on latent factors Q ϕ h , as well as a mapping from latent factors to data pa erns Q ϕ |h .e latent variables approximate the distribution over the data, i.e.
Such a model model can be optimized by minimizing the negative log-likelihood of data given model parameters: Jensen's inequality provides an upper bound on the negative log-likelihood that can be easier to minimize.is minimization is equivalent to minimizing the KL divergence from the model to the data distribution: is connects to the free-energy equation derived by Hinton et al. (1995), which highlights the relationship between conditional distributions Q ϕ h | and the visible pa ern energies E = − log P .When free energy is minimized over the data distribution, the model energies E ϕ approximate the data energies and: is relation is derived by Hinton et al. (1995), equation 5, from the perspective of minimizing communication cost, and in analogy to the Helmholtz free-energy from thermodynamics. is brief derivation illustrates the free-energy relationship in the context of minimizing an upperbound on the negative log-likelihood of a latent-variable model.e gure shows parameter sensitivity (le column), pa ern distributions for the hidden layer (hidden) and the full model (joint) as a function of rank (middle column; both as in Figure 2, main text), and the entropy of the hidden layer conditioned on the stimulus, plo ed as function of the stimulus energy (right column; as in Figure 3 main text).
is analysis shows that the FIM is less sparse than for unregularized models, with a more uniform utilization of the parameters also in large models.As a result, the state probability distributions for large models tend more towards uniform, and the model remains in the disordered phase, except close to the optimal model size (around 30 hidden units), where Zipf's law is still observed.However, regularization consistently prevents the suppression of encoding entropy at all model sizes investigated (up to 120 hidden units).Notably, the e ect shown here is very similar to the regularizing in uence of a large batch size during optimization.As in Figures 2 and 3 in the main text, the models had 13 visible units.

Figure 1 :
Figure 1: E ects of the model size on latent encoding.A Model schematic and example training data.B Model accuracy, quanti ed by the Kullback-Leibler divergence between model samples and held-out training data, improves as the hidden-layer size increases, up to a point.Results for three di erent sizes of stimulus patches (13, 21, 37 pixels) are shown.C Comparison of actual and predicted pa ern probabilities for four hidden-layer sizes.Consistent with the increasing model accuracy (B), larger models predict the true distribution be er.D,E Hidden-layer activation becomes sparser (D) as model size increases, and more decorrelated (E).13 visible units were used for C-E.

Figure 2 :
Figure 2: Parameter space anisotropy and statistical criticality in well-t models.A Analysis of the Fisher Information Matrix (FIM) over a range of hidden-layer sizes (top to bo om; 13 visible units).From le to right, (1) FIM eigenvalue spectra λ i (y-axis) over a range of inverse temperatures β indicate that model ts (β=1) past a certain size have a peak in their generalized susceptibility, indicating statistical criticality.Eigenvalues below 10 −5 are truncated, and the largest and smallest eigenvalues in red; (2) Important parameters in the leading FIM eigenvector align with individual hidden units, and become sparser for larger hidden layers.eeigenvector is displayed separately for the weights (matrix), and the visible (vertical) and hidden (horizontal) biases; (3) e average sensitivity of each parameter over all FIM eigenvectors, shown here as the square root of the FIM diagonal (methods), also shows sparsity, indicating that beyond a certain size additional hidden units contribute li le to model accuracy.Data is shown as in column 2; (4) Zipf laws emerge around the critical hidden-layer size of 30 for the ranked code-word probabilities in the hidden layer, and for the full model (joint).e dashed line indicates a slope of −1.B e average sensitivity of each parameter, measured by the trace of the FIM, normalized by hidden-layer size, decreases as hidden-layer size grows.C Hidden unit projective elds from a model with 37 visible and 60 hidden units, ordered by relative sensitivity (rank indicated above each image).More important units (ranks 1-8) encode spatially simple features such as localized patches, while the least important ones (ranks 53-60) have complex features.

Figure 3 :
Figure 3: Decomposition of free-energy reveals an encoding strategy.A e trade-o between minimizing stimulusevoked energy E h h | and maximizing stimulus-evoked entropy H h | varies for individual stimuli (blue dots).For models too small to represent the data (10 hidden units), the energy-entropy trade-o correlates with the stimulus energy.Above the critical model size (35, 90 hidden units), the energy-entropy trade-o varies li le when averaged within a stimulus energy shell (gray bars; dots=mean, bars=interquartile range).B An encoding strategy of reduced entropy (variability) for higher-energy (higher-information) codewords emerges at the critical model size.Models shown use 13 visible units.

Figure 4 :
Figure4: Visible pa ern energy emerges as an unobserved variable that predicts energy and entropy in stimulus-evoked latent pa erns.Each plot shows the average entropy plus I enc (y-axis) against the average energy E h h | (x-axis).I enc = D KL (Q h | Q h ) ∈Vis the average energy-entropy relationship for all stimuli, which becomes approximately constant above the critical model size.Color indicates stimulus energy shell (color bar) and re ects the average energy and entropy of hidden pa erns evoked by visible pa erns near E .In too-small models (n=10), a subset of low-energy visible pa erns map to low-energy states in the latent units.e relationship between visible and latent pa ern energies shi s approaching the critical model size (n=20,25).At the critical model size (n=35), an inverse relationship between visible energies and the entropy of latent representations emerges: high-energy visible pa erns suppress variability in the latent units.e latent energy distribution is a mixture, parameterized by visible pa ern energy, each with the property that the sum of latent energies and entropies is constant.is relationship persists in larger models (n=60,120).isgives rise to a power-law in latent activation frequencies with slope 1/f, a signature of statistical criticality.Models were t to 13 visible units.

Figure S1 :
FigureS1: Emergence of Zipf power-law statistics does not depend on stimulus statistics.At the critical temperature, the visible samples themselves display 1/f Zipf power-law statistics, and it is natural to ask whether the RBM ts inherit their power-law structure from the encoded stimuli.Here we illustrate the energy-entropy balance within stimulus energy-shells for RBMs t to two-dimensional la ice Ising models, sampled at a range of temperature above and below the critical temperature of T c =2/ ln(1+ √2)≈2.269.e energy-entropy balance converges on 1/f power-law statistics regardless of the data temperature (right column).However, the critical hidden-layer size (N ) does decrease with temperature, illustrated here (middle column) by the increasing hidden-layer size displaying intermediate energy-entropy statistics.Too-small models (le column) exhibit a correlation between visible energy and entropy for training-data temperatures above T c .Ising models were simulated on a 10×10 grid, and sampled via the Swendsen-Wang algorithm with 10k steps burn-in and 100k training pa erns drawn every 100 samples.13-unit patches were presented to the RBM for training.All units are in bits.

Figure S2 :
Figure S2: Regularization reduces sparsity and prevents emergence of variability suppression as encoding strategy.A weak L2 penalty per weight (normalised by the size of the weight matrix for each model) promotes a higher-entropy hidden layer, and prevents emergence of criticality in large models.egure shows parameter sensitivity (le column), pa ern distributions for the hidden layer (hidden) and the full model (joint) as a function of rank (middle column; both as in Figure2, main text), and the entropy of the hidden layer conditioned on the stimulus, plo ed as function of the stimulus energy (right column; as in Figure3main text).isanalysis shows that the FIM is less sparse than for unregularized models, with a more uniform utilization of the parameters also in large models.As a result, the state probability distributions for large models tend more towards uniform, and the model remains in the disordered phase, except close to the optimal model size (around 30 hidden units), where Zipf's law is still observed.However, regularization consistently prevents the suppression of encoding entropy at all model sizes investigated (up to 120 hidden units).Notably, the e ect shown here is very similar to the regularizing in uence of a large batch size during optimization.As in Figures2 and 3in the main text, the models had 13 visible units.