Thermodynamics of the Ising Model Encoded in Restricted Boltzmann Machines

The restricted Boltzmann machine (RBM) is a two-layer energy-based model that uses its hidden–visible connections to learn the underlying distribution of visible units, whose interactions are often complicated by high-order correlations. Previous studies on the Ising model of small system sizes have shown that RBMs are able to accurately learn the Boltzmann distribution and reconstruct thermal quantities at temperatures away from the critical point Tc. How the RBM encodes the Boltzmann distribution and captures the phase transition are, however, not well explained. In this work, we perform RBM learning of the 2d and 3d Ising model and carefully examine how the RBM extracts useful probabilistic and physical information from Ising configurations. We find several indicators derived from the weight matrix that could characterize the Ising phase transition. We verify that the hidden encoding of a visible state tends to have an equal number of positive and negative units, whose sequence is randomly assigned during training and can be inferred by analyzing the weight matrix. We also explore the physical meaning of the visible energy and loss function (pseudo-likelihood) of the RBM and show that they could be harnessed to predict the critical point or estimate physical quantities such as entropy.


Introduction
The tremendous success of deep learning in multiple areas over the last decade has really revived the interplay between physics and machine learning, in particular neural networks [1]. On the one hand, (statistical) physics ideas [2], such as the renormalization group (RG) [3], the energy landscape [4], free energy [5], glassy dynamics [6], jamming [7], Langevin dynamics [8], and field theory [9], shed some light on the interpretation of deep learning and statistical inference in general [10]. On the other hand, machine learning and deep learning tools are harnessed to solved a wide range of physics problems, such as interaction potential construction [11], phase transition detection [12,13], structure encoding [14], physical concepts' discovery [15], and many others [16,17]. At the very intersection of these two fields lies the restricted Boltzmann machine (RBM) [18], which serves as a classical paradigm to investigate how an overarching perspective could benefit both sides.
The RBM uses hidden-visible connections to encode (high-order) correlations between visible units [19]. Its precursor-the (unrestricted) Boltzmann machine-was inspired by spin glasses [20,21] and is often used in the inverse Ising problem to infer physical parameters [22][23][24]. The restriction of hidden-hidden and visible-visible connections in RBMs allows for more efficient training algorithms and, therefore, leads to recent applications in Monte Carlo simulation acceleration [25], quantum wavefunction representation [26,27], and polymer configuration generation [28]. Deep neural networks formed by stacks of RBMs have been mapped onto the variational RG due to their conceptual similarity [29]. RBMs are also shown to be equivalent to tensor network states from quantum many-body physics [30] and interpretable in light of statistical thermodynamics [31][32][33]. As simple as it seems, energy-based models like the RBM could eventually become the building blocks of autonomous machine intelligence [34].
Besides the above-mentioned efforts, the RBM has also been applied extensively in the study of the minimal model for second-order phase transition-the Ising model. For the small systems under investigation, it was found that RBMs with an enough hidden units can encode the Boltzmann distribution, reconstruct thermal quantities, and generate new Ising configurations fairly well [35][36][37]. The visible → hidden → visible · · · generating sequence of the RBM can be mapped onto an RG flow in physical temperature (often towards the critical point) [38][39][40][41][42]. However, the mechanism and power of the RBM to capture physics concepts and principles have not been fully explored. First, in what way is the Boltzmann distribution of the Ising model learned by the RBM? Second, can the RBM learn and even quantitatively predict the phase transition without extra human knowledge? An affirmative answer to the second question is particularly appealing, because simple unsupervised learning methods such as principal component analysis (PCA) using configuration information alone do not provide quantitative prediction for the transition temperature [43][44][45] and supervised learning with neural networks requires human labeling of the phase type or temperature of a given configuration [46,47].
In this article, we report a detailed numerical study on RBM learning of the Ising model with a system size much larger than those used previously. The purpose is to thoroughly dissect the various parts of the RBM and reveal how each part contributes to the learning of the Boltzmann distribution of the input Ising configurations. Such understanding allows us to extract several useful machine learning estimators or predictors for physical quantities, such as entropy and phase transition temperature. Conversely, the analysis of a physical model helps us to obtain important insights about the meaning of RBM parameters and functions, such as the weight matrix, visible energy, and pseudo-likelihood. Below, we first introduce our Ising datasets and the RBM and its training protocols in Section 2. We then report and discuss the results of the model parameters, hidden layers, visible energy, and pseudo-likelihood in Section 3. After the conclusion, more details about the Ising model and the RBM are provided in the Appendices A-C. Sample codes of the RBM are shared on GitHub at https://github.com/Jing-DS/isingrbm (accessed on 18 November 2022).

Dataset of Ising Configurations Generated by Monte Carlo Simulations
The Hamiltonian of the Ising model with N = L d spins in a configuration s = [s 1 , s 2 , · · · , s N ] T on a d-dimensional hypercubic lattice of linear dimension L in the absence of a magnetic field is where the spin variable s i = ±1 (i = 1, 2, · · · , N), the coupling parameter J > 0 (set to unity) favors ferromagnetic configurations (parallel spins), and the notation i, j means to sum over nearest neighbors [48]. At a given temperature T, the configuration s drawn from the sample space of 2 N states follows the Boltzmann distribution where Z T = ∑ After being fully equilibrated, M = 50,000 configurations at each T are collected into a dataset D T for that T. For 2d systems, we also use a dataset D ∪T consisting of 50,000 configurations per temperature from all Ts.
Analytical results of the thermal quantities of the 2d Ising model, such as internal energy E , (physical) entropy S, heat capacity C V , and magnetization m , are well known [50][51][52][53]. Numerical simulation methods and results of the 3d Ising model have also been reported [54]. The thermodynamic definitions and relations used in this work are summarized in Appendix A.

Restricted Boltzmann Machine
The restricted Boltzmann machine (RBM) is a two-layer energy-based model with n h hidden units (or neurons) h i = ±1 (i = 1, 2, · · · , n h ) in the hidden layer, whose state vector is h = [h 1 , h 2 , · · · , h n h ] T , and n v visible units v j = ±1 (j = 1, 2, · · · , n v ) in the visible layer, whose state vector is Figure 1) [55]. In this work, the visible layer is just the Ising configuration vector, i.e., v = s, with n v = N. We chose the binary unit {−1, +1} (instead of {0, 1}) to better align with the definition of Ising spin variable s i . The total energy E θ (v, h) of the RBM is defined as T is the visible bias, c = [c 1 , c 2 , · · · , c n h ] T is the hidden bias, and is the interaction weight matrix between visible and hidden units. Under this notation, each row vector w T i (of dimension n v ) is a filter mapping from the visible state v to a hidden unit i, and each column vector w :,j (of dimension n h ) is an inverse filter mapping from the hidden state h to a visible unit j. All parameters are collectively written as θ = {W, b, c}. "Restricted" refers to the lack of interaction between hidden units or between visible units.
The joint distribution for an overall state (v, h) is where the partition function of the RBM: The learned model distribution for visible state v is from the marginalization of p θ (v, h): where the visible energy (an effective energy for visible state v (often termed as "free energy" in the machine learning literature)): The conditional distributions to generate h from v, p θ (h|v), and to generate v from h, , can be written as products: because h i are independent of each other (at fixed v) and v j are independent of each other (at fixed h). It can be shown that where the sigmoid function σ(z) = 1 1+e −z (Appendix B).

Loss Function and Training of RBMs
∼ p D (v)), the goal of RBM learning is to find a model distribution p θ (v) that approximates p D (v). In the context of this work, the data samples vs are Ising configurations, and the data distribution p D (v) is or is related to the Ising-Boltzmann distribution p T (s).
Based on maximum likelihood estimation, the optimal parameters θ * = arg min θ L(θ) can be found by minimizing the negative log likelihood: which serves as the loss function of RBM learning. Note that the partition function Z θ only depends on the model, not on the data. Since the calculation of Z θ involves summation over all possible (v, h) states, which is not feasible, L(θ) cannot be evaluated exactly, except for very small systems [56]. Special treatments have to be devised, for example by mean-field theory [57] or by importance sampling methods [58]. An interesting feature of the RBM is that, although the actual loss function L(θ) is not accessible, its gradient: can be sampled, which enables a gradient descent learning algorithm. From step t to step t + 1, the model parameters are updated with learning rate η as To evaluate the loss function, we used its approximate, the pseudo-(negative log) likelihood [59]: where the notation: is the conditional probability for component v i given that all the other components v j (j = i) are fixed [37]. Practically, to avoid the time-consuming sum over all visible units it is suggested to randomly sample one i 0 ∈ {1, 2, · · · , n v } and estimate that: if all the visible units are on average translation-invariant [60]. To monitor the reconstruction error, we also calculated the cross-entropy CE between the initial configuration v and the For both 2d and 3d Ising systems, we first trained single-temperature RBMs (T-RBM). M = 50,000 Ising configurations at each T forming a dataset D T are used to train one model such that there are n T T-RBMs in total. While n v = N, we tried various numbers of hidden units with n h = 400, 900, 1600, 2500 in 2d and n h = 400, 900, 1600 in 3d. For 2d systems, we also trained an all-temperature RBM (∪T-RBM) for which 50,000 Ising configurations per temperature are drawn to compose a dataset D ∪T of M = 50,000 n T = 8 × 10 5 samples. The number of hidden units for this ∪T-RBM is n h = 400, 900, 1600. Weight matrix W is initialized with Glorot normal initialization [61] (b and c are initialized as zero). Parameters are optimized with the stochastic gradient descent algorithm of learning rate η = 1.0 × 10 −4 and batch size 128. The negative phase (model term) of the gradient ∇ θ E θ (v) v∼p θ is calculated using CD-k Gibbs sampling with k = 5. We stopped the training until L and CE converged, typically at 100-2000 epochs (see the Supplementary Materials). Three Nvidia GPU cards (GeForce RTX 3090 and 2070) were used to train the model, which took about two minutes per epoch for a M = 50,000 dataset.

Results and Discussion
In this section, we investigate how the RBM uses its weight matrix W and hidden layer h to encode the Boltzmann distributed states of the Ising model and what physical information can be extracted from machine learning concepts such as the visible energy and loss function.

Filters and Inverse Filters
It can be verified that the trained weight matrix elements W ij of a T-RBM follow a Gaussian distribution of zero mean with the largest variance at T ∼ T c (Figure 2a) [62]. The low temperature distribution here is different from the uniform distribution observed in [35], which results from the uniform initialization scheme used there. This suggests that the training of RBMs could converge to different minima when initialized differently. According to Equation (10), the biases c i and b j can be associated with the activation threshold of a hidden unit and a visible unit, respectively. For example, whether a hidden unit is activated (h i = +1) or anti-activated (h i = −1) depends on whether the incoming signal  (Figure 2b,c). A non-zero mean can be caused by an unbalanced dataset with an unequal number of m > 0 and m < 0 Ising configurations. The corresponding filter or inverse filter sum may also be distributed with a non-zero mean in order to compensate the asymmetric bias, as will be shown next. Since v = s is an Ising configuration with ±1 units in our problem, w T i v will be more positive (or negative) if the components of w T i better match (or anti-match) the signs of the spin variables. In this sense, we can think of w T i as a filter extracting certain patterns in Ising configurations. Knowing the representative spin configurations of the Ising model below, close to, and above the critical temperature T c , we expect that w T i (i = 1, 2, · · · , n h ) wrapped into an L d arrangement exhibits similar features. In Figure 3a Figure 3b). This suggests that the peak of the variance as a function of temperature coincides with the Ising phase transition (inset of Figure 3b). More detailed results about the 2d and 3d Ising models are in the Supplementary Materials.
of the filter sum as a function of temperature.
(c) PDF of the distribution of the n v = 4096 inverse filter sums (normalized by n h ). Inset: correlation between a pair of inverse filters w :,j and w :,j (normalized by auto-correlation) as a function of spin-spin distance r jj .
When a hidden layer h is provided, the RBM reconstructs the visible layer v by applying the n v inverse filters w :,j (j = 1, 2, · · · , n v ) on h. The distribution of the inverse filter sum sum(w :,j ) = n h ∑ i=1 W ij is Gaussian with a mean close to zero (Figure 3c), where a large deviation from zero mean is accompanied by a non-zero average bias ∑ j b j /n v , as mentioned above (Figure 2b). We find that this is a result of the unbalanced dataset, which has ∼60% m < 0 Ising configurations. Because the activation probability of a visible unit v j is determined by w :,j , the correlation between visible units (Ising spins) is reflected in the correlation between inverse filters. This is equivalent to the analysis of the n v × n v matrix W T W or its eigenvectors as in [38,42], whose entries are the inner product w T :,j w :,j of inverse filters. We can therefore locate the Ising phase transition by identifying the temperature with the strongest correlation among the w :,j s, e.g., the peak of w T :,j w :,j at a given distance r jj (inset of Figure 3c). See the Supplementary Materials for results in 2d and 3d.
In contrast, the filters of the ∪T-RBM trained from 2d Ising configurations at all temperatures have background patterns like the high temperature T-RBM (in the paramagnetic phase). A clear difference is that most ∪T-RBM filters have one large domain of positive or negative elements (Figure 4a), similar to the receptive field in a deep neural network [29]. This domain randomly covers an area of the visual field of the L × L Ising configuration (see the Supplementary Materials for all the n h filters). The existence of such domains in the filter causes the filter sum and the corresponding bias c i to be positive or negative with a bimodal distribution (Figure 4b,c). The inverse filter sum and its corresponding bias b j still have a Gaussian distribution, although the unbalanced dataset shifts the mean of b j away from zero.

Hidden Layer
Whether a hidden unit uses +1 or −1 to encode a pattern of the visible layer v is randomly assigned during training. In the former case, the filter w T i matches the pattern (w T i v is positive); in the latter case, the filter anti-matches the pattern (w T i v is negative). For a visible layer v of magnetization m, the sign of w T i v and the encoding h i is largely determined by the sign of sum(w T i ) ( Table 1). Since the distribution of sum(w T i ) is symmetric about zero, the hidden layer of a T-RBM roughly consists of an equal number of +1 and −1 units-the "magnetization" m h = 1 h i of the hidden layer is always close to zero and its average m h ≈ 0. The histogram of m h for all hidden encodings of visible states is expected to be symmetric about zero ( Figure 5). We found that, for the smallest n h , the histogram of m h at temperatures close to T c is bimodal due to the relatively large randomness of small hidden layers. As more hidden units are added, the two peaks merge into one and the distribution of m h becomes narrower. This suggests that a larger hidden layer tends to have a smaller deviation from m h = 0. Table 1. When sum(w T i ) > 0, a visible layer pattern v with magnetization m > 0 (or m < 0) is more likely to be encoded by a hidden unit h i = +1 (or h i = −1). When sum(w T i ) < 0, the encoding is opposite.
The order of the h i = ±1 sequence in each hidden encoding h is arbitrary, but relatively fixed once the T-RBM is trained. The permutation of hidden units together with their corresponding filters (swap the rows of the matrix W) results in an equivalent T-RBM. Examples of hidden layers of T-RBMs with n h = 400 at different temperatures are shown in the inset of Figure 5, where the vector h is wrapped into a 20 × 20 arrangement. Note that there are actually no spatial relationships between different hidden units, and any apparent pattern in this 2d illustration is an artifact of the wrapping protocol.  (1) → · · · until the steady state is achieved [31]. Based on the above-mentioned observations, we can design an algorithm to initialize h (0) that better captures the hidden encoding of visible states (equilibrium Ising configurations), thus enabling faster convergence of the Markov chain. After choosing a low temperature T L and a high temperature T H , we generate the hidden layer as follows: This will be an encoding of an m > 0 ferromagnetic configuration. To encode of an m < 0 ferromagnetic configuration, just flip the sign of h i . • At high T ≥ T H > T c , randomly assign h i = +1 or −1 with equal probability. This will be an encoding of a paramagnetic configuration with m ≈ 0. • At intermediate T L < T < T H , to encode an m > 0 Ising configuration, if sum(w T i ) > 0, assign h i = +1 with probability p h ∈ (0.5, 1.0) and h i = −1 with probability 1 − p h ; if sum(w T i ) < 0, assign h i = −1 with probability p h ∈ (0.5, 1.0) and h i = +1 with probability 1 − p h . p h is a predetermined parameter, and the above two algorithms are just the special cases with p h = 1.0 (T ≤ T L ) and p h = 0.5 (T ≥ T H ), respectively. In practice, one may approximately use p h = ( |m| + 1)/2 or use linear interpolation within T L < T < T H , p h = 0.5 + 0.5(T − T L )/(T H − T L ). Below, we compare the (one-step) reconstructed thermal quantities using two different initial hidden encodings with results from a conventional multi-step Markov chain ( Figure 6). The hidden encoding methods proposed here are quite reliable at low and high T, but less accurate at T close to T c .

5) (stepwise). Reconstruction by a seven-step
Markov chain from random h (0) is compared (v (7) ). Analytical and Monte Carlo simulation results are also shown.

Visible Energy
When a T-RBM for temperature T is trained, we expect that p θ (v) ≈ p D (v) ≈ p T (s)the Boltzmann distribution at that T. Although formally related to the physical energy in the Boltzmann factor (with temperature absorbed), the visible energy E θ (v) of an RBM should be really considered as the negative log (relative) probability of a visible state v. For single-temperature T-RBMs, the mean visible energy E θ (v) increases monotonically with temperature (except for the largest n h , which might be due to overfitting) (Figure 7a,b). The value of E θ (v) and its trend, however, cannot be used to identify the physical phase transition. In fact, E θ (v) can differ from the reduced Hamiltonian H(s) k B T by an arbitrary (temperature-dependent) constant while still maintaining the Boltzmann distribution p θ (v) ≈ p T (s) (if the partition function Z θ is calibrated accordingly).
The trend of E θ (v) for T-RBMs can be understood by considering following approximate forms. First, due to the symmetry of +1 and −1, the biases b j and c i are all close to zero. A constrained T-RBM with zero bias has a visible energy: that approximates the visible energy of the full T-RBM, i.e., E θ (v) ≈ E W (v). Next, unless w T i v is close to zero, one of the two exponential terms in Equation (17) always dominates Equation (18) can further be approximated by setting v = 1 with all v j = +1, i.e., In summary, E W (v), E W (v), and E W (1) are all good approximations to the original E θ (v) (Figure 7a). The increase of mean E θ (v) with temperature coincides with the increase of − sum(w T i ) with temperature, which is evident from Figure 3b. At fixed temperature, the decrease of E θ (v) with n h is a consequence of the sum in the definition of visible energy.
The variance E 2 θ − E θ 2 is a useful quantity for phase transition detection, because it reflects the fluctuation of the probability p θ (v). In both low T ferromagnetic and high T paramagnetic regimes, p θ (v) is relatively homogeneous among different states. When T is close to T c , the variance of p θ (v) and E θ (v) is expected to peak (Figure 7d,e). The abnormal rounded (and even shifted) peaks at large n h could be a sign of overfitting. For the all-temperature ∪T-RBM, the Ising phase transition can be revealed by either the sharp increase of the mean E θ (v) or the peak of the variance E 2 θ − E θ 2 (Figure 7c,f). However, this apparent detection can be a trivial consequence of the special composition of the dataset D ∪T , which contains Ising configurations at different temperatures in equal proportion. Only configurations at a specific T are fed into the model to calculate the average quantity at that T. Technically, a visible state v in D ∪T is not subject to the Boltzmann distribution at any specific temperature. Instead, the true ensemble of D ∪T is a collection of n T different Boltzmann-distributed subsets. Many replicas of the same or similar ferromagnetic states are in D ∪T , giving rise to a large multiplicity, high probability, and low visible energy for such states. In comparison, high temperature paramagnetic states are all different from each other and, therefore, have low p θ (v) (high E θ (v)) for each one of them. Knowing this caveat, one should be cautious when monitoring the visible energy of a ∪T-RBM to detect phase transition, because changing the proportion of Ising configurations at different temperatures in D ∪T can modify the relative probability of each state.

Pseudo-Likelihood and Entropy Estimation
The likelihood L(θ) defined in Equation (11) is conceptually equivalent to the physical entropy S defined by the Gibbs entropy formula, apart from the Boltzmann constant k B difference (Appendix A). However, just as entropy S cannot be directly sampled, the exact value of L(θ) is not accessible. In order to estimate S, we calculated the pseudo-likelihood L(θ) instead, which is based on the mean-field-like approximation . Similar ideas to estimate the free energy or entropy were put forward with the aid of variational autoregressive networks [63] or neural importance sampling [64]. The true and estimated entropy of the 2d and 3d Ising models using T-RBMs with different n h are shown in Figure 8a,b. As a comparison, we also considered a "pseudo-entropy" with a similar approximation: where the conditional probability: and the ensemble average · · · s∼p T is taken over states obtained from Monte Carlo sampling. In both 2d and 3d, S is lower than the true S, especially at high T, because a mean-field treatment tends to underestimate fluctuations. While increasing model complexity by adding hidden units is usually believed to reduce the reconstruction error, e.g., of energy and heat capacity [35,36] (see also the Supplementary Materials), a recent study suggested that a trade-off could exist between the accuracy of different statistical quantities [65]. Here, we found that the pseudo-likelihood of T-RBMs with the fewest hidden units in our trials (n h = 400) appears to provide the best prediction for entropy. Increasing n h leads to larger deviations from the true S at higher T. The decreasing of L with n h at fixed temperature agrees with the trend of the visible energy. A lower E θ (v) corresponds to a higher p θ (v) and, thus, a lower L according to its definition. The surprisingly good performance of L in approximating S could be due to the fact that visible units v i in RBMs are only indirectly correlated through hidden units, which collectively serve as an effective mean-field on each visible unit. We also calculated L(θ) with the all-temperature ∪T-RBM in 2d (Figure 8c). Compared with single-temperature T-RBMs of the same n h (Figure 8a), the ∪T-RBM predicts higher L(θ) with considerable deviations even at low T. The trend of L(θ) also agrees with that of E θ (v) (Figure 7c).
The knowledge of the entropy allows us to estimate the phase transition point according to the thermodynamic relation C V = T dS dT . We constructed this estimated C V as a function of temperature using L(θ) and its numerical fitting, whose peaks are expected to be located at T c (Figure 9). The predicted T c s are compared with the results from entropy and pseudo-entropy, as well as the Monte Carlo simulation results for our finite systems and the known exact values for infinite systems in Table 2. It can be seen that single-temperature T-RBMs capture the transition point fairly well within an error of about 1-3%.

Conclusions
In this work, we trained RBMs using equilibrium Ising configurations in 2d and 3d collected from Monte Carlo simulations at various temperatures. For single-temperature T-RBMs, the filters (row vectors) and the inverse filters (column vectors) of the weight matrix exhibit different characteristic patterns and correlations, respectively, below, around, and above the phase transition. These metrics, such as filter sum fluctuation and inverse filter correlation, can be used to locate the phase transition point. The hidden layer h on average contains an equal number of +1 and −1 units, whose variance decreases as more hidden units are added. The sign of a particular hidden unit h i is determined by the signs of the filter sum sum(w T i ) and the magnetization m of the visible pattern. However, there is no spatial pattern in the sequence of positive and negative units in a hidden encoding.
The visible energy reflects the relative probability of visible states in the Boltzmann distribution. Although the mean of visible energy is not directly related to the (physical) internal energy and does not reveal a clear transition, its fluctuation, which peaks at the critical point, can be used to identify the phase transition. The value and trend of the visible energy can be understood from its several approximation forms, in particular the sum of the absolute value of filter sums. The pseudo-likelihood of RBMs is conceptually related to and can be used to estimate the physical entropy. Numerical differentiation of the pseudolikelihood provides another estimator of the transition temperature because it provides an estimate of the heat capacity. All these predictions about the critical temperature were made by unsupervised RBM learning, for which human labeling of the phase types is not needed.
As a comparison, we also trained an all-temperature ∪T-RBM, whose dataset is a mixture of Boltzmann-distributed states over a range of temperatures. Each filter of this ∪T-RBM is featured by one large domain in its receptive field. Although the visible energy and pseudo-likelihood of the ∪T-RBM show a certain signature of the phase transition, one should be cautious, as this detection could be an artifact of the composition of the dataset. Changing the proportions of Ising configurations at different temperatures could bias the probability and the transition learned by the ∪T-RBM.
By extracting the underlying (Boltzmann) distribution of the input data, RBMs capture the rapid (phase) transition of such a distribution as the tuning parameter (temperature) is changed, without knowledge of the physical Hamiltonian. Information about the distribution is completely embedded in the configurations and their frequencies in the dataset. It would be interesting to see if such a general scheme of RBM learning can be extended to study other physical models of phase transition.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

RBM
restricted Boltzmann machine T-RBM single-temperature restricted Boltzmann machine ∪T-RBM all-temperature restricted Boltzmann machine

Appendix A. Statistical Thermodynamics of Ising Model
In this Appendix, we review the statistical thermodynamics of the Ising model covered in this work. The internal energy at a given temperature: where · · · means to take the thermal average over equilibrated configurations. The heat capacity is where β = 1 k B T , and the heat capacity per spin (or specific heat) is c V = C V /N. The magnetization per spin: In small finite systems, because flips from m to −m configurations are common, we need to take the absolute value |m| before the thermal average: The physical entropy can be defined using the Gibbs entropy formula: For the 2d Ising model, the critical temperature solved from sinh 2 J analytical results of the 2d Ising model are expressed as: magnetization per spin [52]: internal energy per spin [50]: specific heat [53]: and the partition function per spin (or free energy per spin f = F/N) [51]: The equation for entropy can be obtained from thermodynamic relation F = E − TS.
For the 3d Ising model, m , E , and c V can be calculated directly from Monte Carlo sampling [54]. The numerical prediction for the critical temperature is T c ≈ 4.511 J k B [66].
Special techniques are needed to compute the free energy or entropy. We used the thermodynamic integration in the high temperature regime: in the low temperature regime, since S(T → 0) = 0 and C V (T → 0) → 0 for the Ising model.

Appendix B. Energy and Probability of RBMs
In this Appendix, we review the derivations of the energy and probability of RBMs, which can be found in the standard machine learning literature [67]. The visible energy E θ (v): The conditional probability: where the h-independent constant Ω θ (v) = e −b T v−E θ (v) = ∑ h e c T h+h T Wv such that from which it can be recognized that p θ (h i |v) ∝ e h i (ci+w T i v) . The single-unit conditional probability: Other relations of p θ (h i = −1|v), p θ (v j = 1|h) and p θ (v j = −1|h) can be found similarly.
which has components: To evaluate the expectation value ∇ θ E θ (v) , in the positive phase, v can be directly drawn from the dataset, while in the negative phase, v must be sampled from the model distribution p θ (v). In practice, as an approximation, the Markov chain Monte Carlo (MCMC) method is used to generate v states that obey the distribution p θ (v), such that Using the conditional probability, p θ (h|v) and p θ (v|h), we can generate a sequence of states: As t → ∞, the MCMC converges with (v (t) , h (t) ) ∼ p θ (v, h) and v (t) ∼ p θ (v). The Markov chain starting from a random v (0) takes many steps to equilibrate. There are two ways to speed up the sampling [68]: Step contrastive divergence (CD-k): For each parameter update, draw v (0) (or a minibatch) from the training data D = [v 1 , v 2 , · · · , v M ] T and run Gibbs sampling for k steps. Even CD-1 can work reasonably well. • Persistent contrastive divergence (PCD-k): Always keep the same MC during the entire training process. For each parameter update, run this persistent MC for another k steps to collect v states.