Minimum Description Length Codes Are Critical

In the Minimum Description Length (MDL) principle, learning from the data is equivalent to an optimal coding problem. We show that the codes that achieve optimal compression in MDL are critical in a very precise sense. First, when they are taken as generative models of samples, they generate samples with broad empirical distributions and with a high value of the relevance, defined as the entropy of the empirical frequencies. These results are derived for different statistical models (Dirichlet model, independent and pairwise dependent spin models, and restricted Boltzmann machines). Second, MDL codes sit precisely at a second order phase transition point where the symmetry between the sampled outcomes is spontaneously broken. The order parameter controlling the phase transition is the coding cost of the samples. The phase transition is a manifestation of the optimality of MDL codes, and it arises because codes that achieve a higher compression do not exist. These results suggest a clear interpretation of the widespread occurrence of statistical criticality as a characterization of samples which are maximally informative on the underlying generative process.


Introduction
It is not infrequent to find empirical data which exhibits broad frequency distributions in the most disparate domains.Broad distributions manifest in the fact that if outcomes are ranked in order of decreasing frequency of their occurrence, then the rank frequency plot spans several orders of magnitude on both axes.Figure 1 reports few cases (see caption for details), but many more have been reported in the literature (see e.g., [1,2]).A straight line in the rank plot corresponds to a power law frequency distribution, where the number of outcomes that are observed k times behave as m k ∼ k −µ−1 (with 1/µ being the slope of the rank plot).Yet, as Figure 1 shows, empirical distributions are not always power laws, even though they are broad nonetheless.Countless mechanisms have been advanced to explain this behaviour [1,2,3,4,5,6].It has recently been suggested that broad distributions arise from efficient representations, i.e., when the data samples relevant variables, which are those carrying the maximal amount of information on the generative process [7,8,9].Such Maximally Informative Samples (MIS) are those for which the entropy of the frequency with which outcomes occur-called relevance in [8,9]-is maximal at a given resolution, which is measured by the number of bits needed to encode the sample (see Section 1.1).MIS exhibit power law distributions with the exponent µ governing the tradeoff between resolution and relevance [9].This argument for the emergence of broad distributions is independent of any mechanism or model.A direct way to confirm this claim is to check 1 arXiv:1809.00652v2[stat.ME] 2 Oct 2018 that samples generated from models that are known to encode efficient representations are actually maximally informative.In this line, [10] found strong evidence that MIS occur in the representations that deep learning extracts from data.This paper explores the same issue in efficient coding as defined in Minimum Description Length [11].Rank plot of the frequencies across a broad range of datasets.Log-log plots of rank versus frequency from diverse datasets: survey of 4962 species of trees across 116 families sampled from the Amazonian lowlands [12], survey of 1053 species of trees across 376 genera and 89 families sampled across a 50 hectare plot in the Barro Colorado Island (BCI), Panama [13], counts indicating the inclusion of each 13,001 LEGO parts on 2613 distributed toy sets [14,15] and the number of genes that are regulated by each of the 203 transcription factors (TFs) in E. coli [16] and 188 TFs in S. cerevisiae (yeast) [17] through binding with transcription factor binding sites (TFBS).
Regarding empirical data as a message sent from nature, we expect it to be expressed in an efficient manner if relevant variables are chosen.This requirement can be made quantitative and precise, in information theoretic terms, following Minimum Description Length (MDL) theory [11].MDL seeks the optimal encoding of data generated by a parametric model with unknown parameters (see Section 1.2).MDL derives a probability distribution over samples that embodies the requirement of optimal encoding.This distribution is the Normalized Maximum Likelihood (NML).This paper studies the NML as a generative process of samples and studies both its typical and atypical properties.In a series of cases, we find that samples generated by NMLs are typically close to being maximally informative, in the sense of [9], and that their frequency distribution is typically broad.In addition, we find that NMLs are critical in a very precise sense, because they sit at a second order phase transition that separates typical from atypical behavior.More precisely, we find that large deviations, for which the resolution attains atypically low values, exhibit a condensation phenomenon whereby all N points in the sample coincide.This is consistent with the fact that NML correspond to efficient coding of random samples generated from a model, so that codes achieving higher compression do not exist.Large deviations enforcing higher compression force parameters to corners of the allowed space where the model becomes deterministic.
The rest of the paper is organized as follows: the rest of the introduction lays the back-ground of what follows by recalling the characterization of samples in terms of resolution and relevance, as in [9], and the derivation of NML in MDL, following [11].Section 2.1 discusses typical properties of NML and Section 2.2 discusses large deviations of the coding cost.We conclude with a series of remarks on the significance of these results.
Setting the Scene Let ŝ = (s (1) , . . ., s (N ) ) be a sample of N observations, s (i) ∈ χ, of a system where χ is a countable finite state space.We define k s as the number of observations in ŝ for which s (i) = s, i.e., the frequency of s.The number of states s that occur k times will be denoted as m k .Both k s and m k depend on the sample ŝ.
We assume that ŝ is generated in a series of independent experiments or observations, all in the same conditions.This is equivalent to taking ŝ as a sequence of N independent draws from an unknown distribution p(s) (i.e., the generative process).

Resolution, Relevance and Maximally Informative Samples
The information content of the sample is measured by the number of bits needed to encode a single data point.This is given by Shannon entropy [18].Taking the frequency k s /N as the probability of point s, this leads to: where the ˆindicates that the entropy is computed from the empirical frequency.This quantity specifies the level of detail of the description provided by the variable s.At one extreme, all the data points are equal, i.e., s (i) = s, ∀i = 1, . . ., N such that m k = 0 for k = 1, . . ., N − 1 and m k=N = 1.With this, one finds that Ĥ[s] = 0. On the other extreme, all the data points are different, i.e., s (i) = s (j) , ∀ i = j, such that m k=1 = N and m k = 0, ∀k > 1.Hence, one finds that Ĥ[s] = log N .This is why we call Ĥ[s] as the resolution, following [9].The resolution clearly depends on the cardinality of χ.Only a part of Ĥ[s] provides information on the generative process p(s) and this is given by the relevance A simple argument, which is elaborated in detail in [9], is that the empirical frequency k s /N is the best estimate of p(s), so conditional on k s , the sample does not contain any further information on p(s).Note that k s is a function of s, which implies quantifies the amount of noise the sample contains.
We call ŝ a Maximally Informative Sample (MIS) if m k is such that the relevance is maximal at a given resolution Ĥ[s] = H 0 .This implies the maximization of the functional over m k , where the Lagrange multipliers µ and λ are adjusted to enforce the conditions Ĥ[s] = H 0 and k km k = N , respectively.As shown in [7,8], MIS exhibit a power law frequency distribution where c is a normalization constant such that k km k = N .As H 0 varies from 0 to log N , MISs trace a curve in the resolution-relevance plane (see solid lines in Figure 2,3 B, C) with µ as the negative slope.As discussed in [10,9], µ quantifies the trade-off between resolution and relevance: a decrease in resolution of one bit leads to an increase of µ bits in relevance.The point µ = 1, which corresponds to Zipf's law, sets the limit beyond which further reduction in Ĥ[s] results in lossy compression, because, for µ < 1, the increase in Ĥ[k] cannot compensate the loss in resolution.

Minimum Description Length and the Normalized Maximum Likelihood
The main insight of MDL is that learning from data is equivalent to data compression [11].
In turn, data compression is equivalent to assigning a probability distribution over the space of samples.This section provides a brief derivation of this distribution whereas the rest of the paper discuss its typical and atypical properties.We refer the interested reader to [11,19] for a more detailed discussion of MDL.
From an information theoretic perspective, one can think of the sample, ŝ, as a message generated by some source (e.g., nature) that we wish to compress as much as possible.This entails translating ŝ in a sequence of bits.A code is a rule that achieves this for any ŝ ∈ χ N and its efficiency depends on whether frequent patterns are assigned short codewords or not.Conversely, any code implies a distribution P (ŝ) over the space of samples and the cost of encoding the sample ŝ under the code P is given by [18] bits (assuming logarithm base two).Optimal compression is achieved when the code P coincides with the data generating process [18].
Consider the situation where the data is generated as independent draws from a parametric model f (s|θ).If the value of θ were known, then the optimal code would be given by P (ŝ) = i f (s (i) |θ) ≡ f (ŝ|θ).MDL seeks to derive P in the case where θ is not known (Indeed, MDL aims at deriving efficient coding under f irrespective of whether f (s|θ) is the "true" generative model or not.This allows one to compare different models and choose the one providing the most concise description of the data).This applies, for example, to the situation where ŝ is a series of experiments or observation aimed at measuring the parameters θ of a theory.
In hind sight, i.e., upon seeing the sample, the best code is f (ŝ| θ), where θ(ŝ) is the maximum likelihood estimator for θ, and it depends on the sample ŝ.Therefore, one can define the regret R, as the additional encoding cost that one needs to spend to encode the sample ŝ, if one uses the code P (ŝ) to compress ŝ, i.e., Notice that min θ [− log f (ŝ|θ)] = − log f (ŝ| θ(ŝ)).R is called regret of P relative to f for sample ŝ because it depends both on P and on ŝ.MDL derives the optimal code, P (ŝ), that minimizes the regret, assuming that for any P the source produces the worst possible sample [11].The solution [20] is called the Normalized Maximum Likelihood (NML).The optimal regret is given by which is known in MDL as the parametric complexity (Notice that e R can be seen as a partition sum.Hence, throughout the paper, we shall refer to the parametric complexity as the UC partition function.).For models in the exponential family, Rissanen showed that the parametric complexity is asymptotically given by [21] where I(θ) is the Fisher information matrix with the matrix elements defined by an expectation θ over the parametric model f (s|θ) (see Appendix A for a simple derivation).The NML code is a universal code because it achieves a compression per data point which is as good as the compression that would be achieved with the optimal choice of θ when one has large enough samples.This is easy to see, because the regret R/N per data point vanishes in the limit N → ∞, hence the NML code achieves the same compression as f (ŝ| θ).
Notice also that the optimal regret, R, in Equation ( 8) is independent of the sample ŝ.It indeed provides a measure of complexity of the model f that can be used in model selection schemes.For exponential families, MDL procedure penalizes models with a cost which equals the one obtained in Bayesian model selection [22] under a Jeffreys prior.Indeed, considering P (ŝ) as a generative model for samples, one can show that the induced distribution on θ is given by Jeffreys prior (see Appendix A).

NML Codes Provide Efficient Representations
In this section we consider P as a generative model for samples and we investigate its typical properties for some representative statistical models.

Dirichlet Model
Let us start by considering the Dirichlet model distribution f (s|θ) = θ s , ∀s ∈ χ.The parameters θ s ≥ 0 are constrained by the normalization condition s∈χ θ s = 1.Let S = |χ| denote the cardinality of χ and define, for convenience, ρ = N/S as the average number of points per state.Because each observation is mutually independent, the likelihood of a sample ŝ given θ = (θ 1 , . . ., θ S ) can be written as where k s is the number of times that the state s occurs in the sample ŝ.From here, it can be seen that θs = k s /N is the maximum likelihood estimator for θ s .Thus, the universal code for the Dirichlet model can now be constructed as which can be read as saying that for each s, the code needs −k s log(k s /N ) + R/N bits.In terms of the frequencies, {k 1 , . . ., k S }, the universal codes can be written as wherein the multinomial coefficient, N !s∈χ ks! , counts the number of samples with a given frequency profile k 1 , . . ., k S .In order to compute the optimal regret R, we have to evaluate the partition function where and The integral in Equation ( 15) is dominated by the value where the function φ attains its saddle point value z * (ρ), which is given by the condition where the average . . .z is taken with respect to the distribution Gaussian integration around the saddle point leads then to where we used the identity Φ The distribution Equation (12) can also be written introducing the Fourier representation of the delta function For typical sequences k 1 , . . ., k S , the integral is also dominated by the value µ = −iz * (ρ) that dominates Equation (15), which means that the distribution factorizes as This means that the NML is, to a good approximation, equivalent to S independent draws from the distribution q(k|z * ) or, equivalently, that the distribution q(k|z * ) is the one that characterizes typical samples.This is fully confirmed by Figure 2A, which compares q(k|z * ) with the empirical distribution of k s drawn from P .For large k, we find q(k|z * ) ∼ e −z * k / √ k, which shows that the distribution of frequencies is broad, with a cutoff at 1/z * .This underlying broad distribution is confirmed by Figure 2B which shows the dependence of the degeneracy m k with the frequency k.
In the regime where ρ 1 and k is large, the cutoff extends to large values of k and we find z * (ρ) 1 2ρ (see Appendix B.1).In addition, the parametric complexity can be computed explicitly via Equation ( 9) in this regime, with the result The coding cost of a typical sample is given by The number of samples with encoding cost E can be computed in the following way.The number of samples that correspond to a given degeneracy m k of the states that occurs k s = k times in ŝ, is given by Therefore, the number of samples with coding cost E is where M(E) is the set of all sequences {m k } that are consistent with samples in χ N and satisfy Equation (26).The last expression assumes log M !M log M − M , which is reasonable for M = km k 1, i.e., when ρ 1.In this regime we expect the sum over M(E) to be dominated by samples with maximal Ĥ[k].Indeed, Figure 2C,D show that samples drawn from P achieve values of Ĥ[k] close to the theoretical maximum, especially in the region ρ 1.

A Model of Independent Spins
In order to corroborate our results for the Dirichlet model, we study the properties of the universal codes for a model of independent spins, i.e., a paramagnet.For a single spin, s = ±1, in a local field h, the probability distribution is given by Thus for a sample ŝ of size N , where m = 1 N M i=1 s (i) is the local magnetization.The maximum likelihood estimate for h is ĥ(m) = tanh −1 m, hence the universal code for a single spin can be written as where R 1 2 log πN 2 (see Appendix B.2).Note that a sample with a magnetization m can be realized by considering the permutation of the up-spins (s = 1, where there are = N +N m 2 of such spins) and the permutation of the down-spins (s = −1, where there are N − of such spins).Consequently, the magnetization for samples drawn from P has a broad distribution given by the arcsin law (see Appendix B.2) It is straightforward to see that the model of a single spin is equivalent to a Dirichlet model with two states χ = {−1, +1}.In terms of the number of up-spins, using m = 2 −N N , the NML for a single spin can be written as The NML for a paramagnet with n independent spins reads as Figure 3 reports the properties of the typical samples of the NML of a paramagnet.We observed that the frequency distribution of typical samples is broad (Figure 3A) and that typical samples attain values of H[k] very close to the maximum for a given value of Ĥ[s] (Figure 3B,C).As the size N of data increases, the NML enters the well-sampled regime where Ĥ[k] Ĥ[s], indicating that the data processing inequality [18] is saturated.In this regime, typical samples are those which maximize the entropy Ĥ[s].

Sherrington-Kirkpatrick Model
In the following sections, we extend our findings to systems of interacting variables (graphical models) and discuss the properties of typical samples drawn from the corresponding NML distribution.We shall first consider models in which the observed variables are interacting either directly (Sherrington-Kirkpatrick model) and then restricted Boltzmann machines, where the variables interact indirectly through hidden variables.
In this section, s = (s 1 , . . ., s n ) is a configuration of n spins s i ∈ {±1}.In the Sherrington-Kirkpatrick (SK) model, the distribution of s, considers all interactions up to two-body where the partition function

Z(J , h)
is a normalization constant which depends on the pairwise couplings, J with J ij = J ji being the coupling strength between s i and s j , and external local fields, h.Thus, given a sample, ŝ = (s (1) , . . ., s (N ) ) of N observations, the likelihood reads as where j are the magnetization and pairwise correlation respectively.Note that all the needed information about the SK model is encapsulated in the free energy, φ(J , h) = log Z(J , h).Indeed, the maximum likelihood estimators for the couplings, Ĵ , and local fields, ĥ, are the solutions of the self-consistency equations ∂φ(J , h) The universal codes for the SK model then reads as However, unlike for the Dirichlet model and the paramagnet model, the UC partition function, e R, for the SK model is analytically intractable (For SK models which possess some particular structures, a calculation of the UC partition function has been done in [23]).To this, we resort to a Markov chain Monte Carlo (MCMC) approach to sample the universal codes (See Appendix C.1). Figure 4A and C shows the properties of the typical samples drawn from the universal codes of the SK model in Equation (42).are normalized by log N and the typical NML samples are compared against maximally informative samples (solid black line) and random samples (dashed black line) as described in Figure 2.

Restricted Boltzmann Machines
We consider a restricted Boltzmann machine (RBM) wherein one has a layer composed of n v independent visible boolean units, v = (v 1 , . . ., v nv ), which are interacting with n h independent hidden boolean units, h = (h 1 , . . ., h n h ), in another layer where v i , h i = 0, 1.
The probability distribution can be written down as where the partition function (see purple lines in Figure 5).In other words, β = 0 marks a localization transition where the symmetry between the states in χ is broken, because one state s is sampled an extensive number of times k s ∝ N .One direct way to see this is to consider the Dirichlet model and use the "tilted" distribution in Equation (56) to compute the distribution of k s following the same steps leading to Equation (19), where again z is fixed by the condition k q β (k|z)k = N/S.For β ≥ 0, we again find, as in Equation ( 22), that k s can be considered as independent draws from the same distribution q β (k|z).For β < 0, we find that the distribution q β (k|z) develops a sharp maximum at k = N indicating that, as mentioned above, the sample concentrates on one state s.This behavior is generic whenever the underlying model f (s|θ) itself localizes for certain values θ of the parameters, i.e., when f (s| θ) = δ s,s .In order to see this, notice that, in general, we can write (57) Thus, by inserting the identity e −N Ĥ[s]+N Ĥ [s] , the NML distribution in Equation ( 7) can be re-cast as where ps = k s /N is the empirical distribution and is a Kullback-Leibler divergence.Now, we observe that where the inequality in Equation ( 61) derives from the fact that θ(ŝ), the maximum likelihood estimator for sample ŝ, is replaced by a generic value θ 0 and consequently, D KL (p|| θ) ≤ D KL (p||θ 0 ).The equality in Equation (62) , instead, derives from the choice θ 0 = θ such that f (s| θ) = δ s,s .Under this choice, only the term corresponding to "localized" samples where s (i) = s 0 for all points in the sample, survive in the sum on ŝ.For such localized samples, Ĥ[s] = D KL (p||θ 0 ) = 0, hence Equation (62) follows.
Because of the logarithmic dependence of the regret R on N (see Equation ( 9)), Equation (62) implies that, for all β, for N 1.Given that Ĥ[s] ≥ 0 in Equation ( 55), then E ≥ 0 and therefore, Equation (53) implies that φ(β) is a non-decreasing function of β.In addition, φ(0) = 0 by Equation (50).Taken together, these facts require that φ(β) = 0 for all values β ≤ 0. On the other hand, for β > 0, the function φ(β) is analytic with all finite derivatives, which corresponds to higher moments of Ĥ[s] under Pβ .Therefore, β = 0, which corresponds to the typical behavior of the NML, coincides with a second order phase transition point because the function φ(β) exhibits a discontinuity in the second derivative.In terms of Pβ (ŝ), the phase transition separates a region (β ≥ 0) where all samples ŝ have a finite probability from a region (β < 0) where only one sample, the one with s (i) = s, ∀i, has non-zero probability and Ĥ[s] = 0.
The phase transition is a natural consequence of the fact that NML provide efficient coding of samples generated from f (s|θ).It states that codes Pβ that achieve a compression different from the one achieved by the NML only exist for higher coding costs.Codes with lower coding cost only describe non-random samples that correspond to deterministic models f (s| θ) = δ s,s .

Discussion
The aim of this paper is to elucidate the properties of efficient representations of data corresponding to universal codes that arise in MDL.Taking NML as a generative model, we find that typical samples are characterized by broad frequency distributions and that they achieve values of the relevance which are close to the maximal possible Ĥ[k].
In addition, we find that samples generated from NML are critical in a very precise sense.If we force NML to use less bits to encode samples, then the code localizes on deterministic samples.This is a consequence of the fact that if there were codes that required fewer bits, then NML would not be optimal.
This contributes to the discussion on the ubiquitous finding of statistical criticality [1,4] by providing a clear understanding of its origin.It suggests that statistical criticality can be related to a precise second order phase transition in terms of large deviations of the coding cost.This phase transition separates random samples that span a large range of possible outcomes (the set χ in the models discussed above) from deterministic ones, where one outcome occurs most of the time.The phase transition is accompanied by a spontaneous symmetry breaking in the permutation between samples.The frequencies of outcomes in the symmetric phase (β ≥ 0) are generated as independent draws from the same distribution, that is sharply peaked for β > 0 as can be checked in the case of the Dirichlet model.Instead, for β < 0, only one state is sampled.In the typical case, β = 0, the symmetry between outcomes is weakly broken, as there are outcomes that occur more frequently than others.At β = 0, the samples maintain the maximal discriminative power over outcomes.This type of phase transitions in large deviations is very generic, and it occurs in large deviations whenever the underlying distribution develops fat tails (see e.g., [27]).
This leads to the conjecture that broad distributions arise as a consequence of efficient coding.More precisely, broad distributions arise when the variables sampled are relevant, i.e., when they provide an optimal representation.This is precisely the point which has been made in [7,8,9].The results in the present paper add a new perspective whereby maximally informative samples can be seen as universal codes.
Hence, the parametric complexity, R = log ŝ f (ŝ| θ(ŝ)), is asymptotically given by Equation (9) when N 1.Notice also that P (ŝ) induces a distribution over the space of parameters θ.With the choice the same procedure as above shows that which is the Jeffreys prior.

B Calculating the Parametric Complexity
In this section, we calculate the parametric complexity for the Dirichlet model for ρ = N/S 1 where N is the number of observations in the sample ŝ and S is the size of the state space χ and the paramagnetic Ising model.

B.1 Dirichlet Model
In the regime where ρ 1 and k large such that we can employ Stirling's approximation, k! = √ 2πkk k e −k , the normalization can be calculated as where is the learning rate parameter.The corresponding gradients for the parameters, w, a and b can then be written down respectively as ∂ log L(θ) ∂ log L(θ) where the first terms denote averages over the data distribution while the second terms denote averages over the model distribution.
Here, we use the contrastive divergence (CD) approach which is a variation of the steepest gradient descent of L(θ).Instead of performing the integration over the model distribution, CD approximates the partition function by averaging over distribution obtained after taking κ Gibbs sampling steps away from the data distribution.
To do this, we exploit the factorizability of the conditional distributions of the RBM.In particular, the conditional probability for the forward propagation (i.e., sampling the hidden variables given the visible variables) from v to h j reads as Similarly, the conditional probability for the backward propagation (i.e., sampling the visible variables from the hidden variables) from h to v i reads as The Gibbs sampling is done by propagating a sample, v (k) = v (k) (0), forward and backward κ times: v (k) (0) → h (k) (0) → v (k) (1) → . . .→ h (k) (κ − 1) → v (k) (κ) → h (k) (κ).And thus, the Gibbs sampling approximates the gradient in Equation (96) as In the CD approach, each parameter update for a batch is called an epoch.While larger κ approximates well the partition function, it also induces an additional computational cost.To find the global minimum more efficiently, we randomly divided the samples into groups of mini-batches.This approach introduces stochasticity and consequently reduces the likelihood of the learning algorithm to be confined in a local minima.However, a mini-batch approach can result in data-biased sampling.To circumvent this issue, we adopted the Persistent CD (PCD) algorithm where the Gibbs sampling extends to several epochs, each using different mini-batches.In the PCD approach, the initial visible variable configuration, v (k) (0), was set to random for the first mini-batch, but the final configurations, (v (k) (κ), h (k) (κ)), of the current batches become the initial configuration for the next mini-batches.In this paper, we performed Gibbs sampling at κ = 10 steps where we update the parameters, θ, are updated at 2500 epochs at a rate = 0.01 with 200 mini-batches per epochs.For other details regarding inference of parameters of the RBM, we refer the reader to [24,25].

C.3 Source Codes
All the calculations in this manuscript were done using personalized scripts written in Python 3. The source codes are accessible online (https://github.com/rcubero/UniversalCodes).
FIGURE 1. Rank plot of the frequencies across a broad range of datasets.Log-log plots of rank versus frequency from diverse datasets: survey of 4962 species of trees across 116 families sampled from the Amazonian lowlands[12], survey of 1053 species of trees across 376 genera and 89 families sampled across a 50 hectare plot in the Barro Colorado Island (BCI), Panama[13], counts indicating the inclusion of each 13,001 LEGO parts on 2613 distributed toy sets[14,15] and the number of genes that are regulated by each of the 203 transcription factors (TFs) in E. coli[16] and 188 TFs in S. cerevisiae (yeast)[17] through binding with transcription factor binding sites (TFBS).

3 SimulationFIGURE 2 .
FIGURE 2. Properties of the typical samples generated from the NML of the Dirichlet model.(A) A plot showing the frequency distribution of the typical samples of the Dirichlet NML code.Given S, the cardinality of the state space, χ, with S = 1.0 × 10 3 (orange dots), 5.0 × 10 3 (green squares), and 1.0 × 10 4 (red triangles), we compute the average frequency distribution across 100 generated samples from the Dirichlet NML of size N = 10S such that the average frequency per state, ρ, is fixed.This is compared against the theoretical calculations (solid black line) for q(k|z * ) in Equation (19).(B) Plot showing the degeneracy, m k , of the frequencies, k, in a representative typical sample of length N = 10 3 generated from the Dirichlet NML code with average frequencies per spike: ρ = 100 (yellow triangle), ρ = 10 (orange x-mark) and ρ = 2 (red cross).The corresponding dashed lines depict the best-fit line.(C-D) Plots of Ĥ[s] versus Ĥ[k] for the typical samples of the Dirichlet NML code.For a fixed size of the data, N (N = 10 3 in C and N = 10 4 in D), we have drawn 100 samples from the Dirichlet NML code varying ρ, ranging from 2 to 100.The results are compared against the Ĥ[k] and Ĥ[s] for maximally informative samples (MIS, solid black line) and random samples (dashed black lines).For the MIS, the theoretical lower bound is reported [8].For the random samples, we compute the averages of Ĥ[s] and Ĥ[k] over 10 7 realizations of random distributions of N balls in L boxes, with L ranging from 2 to 10 7 .Here, each box corresponds to one state s = 1, . . ., L and k s is the number of balls in box s.Note that all the calculated values for Ĥ[k] and Ĥ[s] are normalized by log N .

FIGURE 3 .
FIGURE 3. Properties of typical samples for the NML codes of the paramagnet.(A) Plots showing the degeneracy, m k , of the frequencies, k, in a representative typical sample of length N = 10 4 generated from the NML of a paramagnet with different number of independent spins: n = 4 (blue star), n = 12 (red cross) and n = 20 (yellow diamond).The corresponding dashed lines depict the best-fit line.(B-C) Plots of the Ĥ[k] versus Ĥ[s] of the typical samples generated from the paramagnet NML code for varying sizes of the data, N = 10 4 (B) and N = 10 5 (C), and for varying number of spins, n, ranging from 3 to 20.Given N and n, we compute the Ĥ[k] and Ĥ[s] over 100 realizations of the NML code of a paramagnet.The results are compared against the Ĥ[k] and Ĥ[s] for maximally informative samples (solid black line) and random samples (dashed black line) as described in Figure 2. Note that all the calculated H[k] and H[s] are normalized by log N .

FIGURE 4 .
FIGURE 4. Properties of typical samples for the NML codes of two graphical models: the Sherrington-Kirkpatrick (SK) model and the restricted Boltzmann machine (RBM).Left panels (A,C) show plots of the degeneracy, m k , of the frequency, k, for representative typical samples generated from the NML codes for the SK model (A) and the RBM given a number of hidden variables, n h = 7 (B) for different number of (visible) spins, n.The corresponding dashed lines show the best-fit lines.On the other hand, right panels (B,D) show plots of the Ĥ[k] versus Ĥ[s] of the typical samples drawn from the NML codes for the SK model (B) and the RBM with n h = 7 (D) for N = 10 3 and for varying number of spins, n ranging from 3 to 12. Given N and n of a graphical model, we compute the Ĥ[k] and Ĥ[s] for 100 samples drawn from the respective NML codes through a Markov chain Monte Carlo (MCMC) approach (see Appendix C.1).Note that for the RBM, varying n h do not qualitatively affect the observations made in this paper.As before, the Ĥ[k] and Ĥ[s] are normalized by log N and the typical NML samples are compared against maximally informative samples (solid black line) and random samples (dashed black line) as described in Figure 2.

5 FIGURE 5 .
FIGURE 5. Typical realizations of large deviations from the NML code of the Dirichlet model.For a fixed parameter, β ranging from β = −1 to β = 1, samples are obtained from Pβ in Equation (56) for varying length of the dataset, N (N = 10 4 in solid lines with circle markers and N = 10 5 in dashed lines with square markers).The resolution Ĥ[s] normalized by log N (in green lines) and the maximal frequency k s normalized by N (in purple lines) are calculated as an average over 100 realizations of Pβ given β.The point β = 0 corresponds to the typical samples that are realized from the Dirichlet NML code in Equation (12).