Information Perspective to Probabilistic Modeling: Boltzmann Machines versus Born Machines

We compare and contrast the statistical physics and quantum physics inspired approaches for unsupervised generative modeling of classical data. The two approaches represent probabilities of observed data using energy-based models and quantum states, respectively. Classical and quantum information patterns of the target datasets therefore provide principled guidelines for structural design and learning in these two approaches. Taking the Restricted Boltzmann Machines (RBM) as an example, we analyze the information theoretical bounds of the two approaches. We also estimate the classical mutual information of the standard MNIST datasets and the quantum Rényi entropy of corresponding Matrix Product States (MPS) representations. Both information measures are much smaller compared to their theoretical upper bound and exhibit similar patterns, which imply a common inductive bias of low information complexity. By comparing the performance of RBM with various architectures on the standard MNIST datasets, we found that the RBM with local sparse connection exhibit high learning efficiency, which supports the application of tensor network states in machine learning problems.


Introduction
The fruitful interplay between statistical physics and machine learning dates back to at least the early studies of spin glasses and neural networks [1,2]. The two fields share common interests on the emergent collective behavior of complex systems with a large number of degrees of freedom. In particular, unsupervised generative modeling is closely related to the inverse statistical problems [3], where one infers the parameters of a model based on observations. The model can generate new samples according to the learned probability distribution, hence the name generative modeling. Inspired by the statistical physics, one can model the data probability according to the Boltzmann distribution with an energy function of the observed variables where Z = ∑ v e −E(v) , the partition function, is the normalization factor of the probability density.
The functional form of E(v), which is usually called the energy function in statistical physics, is typically predetermined to deliver certain prior knowledge about the data. Structured probabilistic models of the form in Equation (1) are collectively denoted as energy-based models [4], in which the prominent examples are the Boltzmann Machine [5]. Figure 1a shows an example of the Restricted Boltzmann Machines (RBM) [6] whose energy function reads E RBM (v) = − ∑ i a i v i − ∑ j ln(1 + e b j +∑ i v i W ij ) [7].  On the other hand, by exploiting the inherent probabilistic nature of quantum mechanics, one can model the probability distribution of classical data using a quantum pure state where N = ∑ v |Ψ(v)| 2 is the normalization factor. The square ensures the positivity of the probability.
In particular, Equation (2) translates the generative modeling of probability density to the problem of learning a quantum state. In fact, the necessity of this quantum interpretation was also anticipated in earlier machine learning literature. The mathematical structure of quantum mechanics appears naturally when one explores more flexible models than Equation (1) while still attempts to ensure the positivity of the probability density [26,27]. We call these approaches Born Machines to acknowledge the probabilistic interpretation of the quantum mechanics [28]. Besides preparing the quantum state using an actual quantum devise [29], one can employ many of the efficient classical ansatz developed by quantum physicists to represent the quantum many-body state. An example is the matrix product states (MPS) shown in Figure 1b, which is one of the simplest tensor network states. The MPS parameterizes the wavefuntion amplitude of a state of N variables as To illustrate the contrast between Boltzman Machines and Born Machines, you can imagine that Boltzman Machines maps the probability distribution of the images into a very complex potential function. The sample with high probability corresponds to the local minimum of the potential function. The process of the image generating is similar to statistical mechanics hopping between different local minimums. In Born Machines, the probability distribution of images is mapped to a pure quantum state in Hilbert space. Each basis of this Hilbert space corresponds to one of all possible image patterns. The projection of the pure state on each basis corresponds to the probability of each image pattern. Therefore, the process of the image generating is more like the collapse of a quantum state.
Both Equations (1) and (2) allow one to import insights and concepts of statistical and quantum physics, such as symmetry, locality, sparseness and entanglement [30], to unsupervised generative modeling. Physical considerations can be used to assess the complexity, such as entropy or mutual information [31,32], of the dataset and the representational power of the corresponding models [9,10,18].
Moreover, one can employ the mathematical and computational tools developed for statistical and quantum physics for machine learning. For example, mean-field theory and Markov chain Monte Carlo methods originate from statistical physics research are by now standard tools for learning structured probabilistic models [33]. Furthermore, we anticipate that approaches in quantum physics such as tensor networks and quantum algorithms will play an increasingly significant role in generative modeling through the quantum inspired representation of probabilities (Equation (2)).
The purpose of this paper is to compare and contrast the Boltzmann Machines (Equation (1)) and Born Machines (Equation (2)) approaches for probabilistic modeling, therefore build up a unified view and motivate future studies, especially for the quantum machine learning algorithm [24] and the potential of applying the tensor network method into the machine learning problems [17,20,21]. Classical and quantum information theories provide crucial guidelines for such comparison. Classical information theory lays a common foundation for many problems in machine learning and statistical physics [34,35]. On the other hand, quantum information theory has played a crucial role in characterizing, modeling and simulating quantum states of matter [36]. For example, many methods with polynomial parameters can successfully model quantum states with exponential possibilities. It turns out the reason is many of the physically interesting quantum states only occupy a tiny corner of the Hilbert space, which fulfills the area law of the entanglement entropy [37]. Similar observations were independently made in the machine learning community [4,30] that the natural images encountered in machine learning applications occupy a negligible proportion of the volume of all possible images. This should result in the sparseness of the classic information of the images. If this is true, then modeling the probability distribution of classical dataset in terms of the quantum states would become reasonable (Equation (2)), insights for modeling quantum states [36,37] can be transferred into generative modeling of classical data.
Although early works have noticed this similarity, further research on them has not progressed much, because calculating the information of high-dimensional data such as pictures is very challenging and still a cutting-edge research area. In this work, we try to deepen the similarity between the classical Boltzmann model and the quantum Born model from the information perspective. In the theoretical analysis aspect, we point out the formalism similarity of classical and quantum information, which could brings the same statistical bias. Furthermore, we prove an inequality to make the theoretical analysis of information more useful for model design. In the numerical experiments, we applied Annealed Importance Sampling, MPS and RBM with only local sparse connection to further verify that the information in dataset of natural images is indeed local and sparse.
The organization of this paper is as follows. Section 2 defines the complexity of a dataset from the classical and quantum information theoretical perspectives. Section 3 discusses the implication of the information theoretic considerations on the probabilistic modeling using the restricted Boltzmann machines. These two sections together point out the similarities in the formulas of the information quantity of those two models and proved the inequalities between them. Section 4 compares the information measures in natural images with the information upper bond, which further illustrate from the perspective of natural images data that the two models have similar characterizations. Then, we carried out numerical experiments on the standard MNIST dataset to support our claims that the information in natural images data is sparse and local. Finally, Section 5 summarizes our main points and outlook for future directions.

Complexity of Dataset: Classical Mutual Information and Quantum Entanglement Entropy
Modeling data probability using an energy based model (Equation (1)) calls for a classical information theoretical analysis. Mutual information (MI) is a fundamental information theoretical concept which quantifies the complexity of probability distribution π(v) associated with the dataset. Assuming x ∈ X and y ∈ Y are two subset of the variables and v = x ∪ y, their marginal probability distributions are π(x) = ∑ y∈Y π(x, y), and π(y) = ∑ x∈X π(x, y), respectively. The MI reads The MI measures the amount of information shared between the two sets of variables. MI is zero only for independent variables. In this sense, the MI is a stronger criterion than the correlation of variables since having zero correlation does not necessarily imply vanishing MI. The MI can be used as the objective functions in machine learning applications [38][39][40]. Here, we adopt a different point view, which treats MI as a complexity measure of the dataset to be modeled.
On the other hand, if we view the target dataset as snapshots of the same quantum state collapsed on a fixed basis (Equation (2)), it is natural to measure its complexity using the second Rényi entanglement entropy is the reduced density matrix, and Ψ(v = x ∪ y) is the probability amplitude associated with the probability, such that p(v) in Equation (2) approaches to the data probability distribution π(v). The second Rényi entanglement entropy is a lower bound of the von Neumann entanglement entropy To reveal connection of the classical and quantum information theoretical measures, we write the MI as and the second Rényi entropy as where the expected value · · · x,y is with respect to the dataset probability π(x, y).
There are apparent similarities between Equations (5) and (6). Both equations contain swap ratios of probability or probability amplitude [41,42]. To illustrated the effect of the swap ratio, Figure 2a shows two samples from the MNIST dataset ((x, y) and (x , y )) and Figure 2b,c shows the corresponding swapped images ((x , y) and (x, y )) for up/down and checkerboard bipartitions. The ratio in Equations (5) and (6) would be smaller if the swapped images are less likely to appear in the original dataset π(v), and therefore make larger contribution to the mutual information or the entanglement entropy. Earlier work argued that the dominant correlations in the natural datasets encountered in physics and machine learning applications are the local ones due to the physical law of the nature [43]. Therefore, it is natural to expect that the checkerboard bipartition ( Figure 2c) has higher MI and entanglement entropy compared to the up/down bipartition (Figure 2b) because of strong local correlations between nearby pixels of natural images. Similar discussions on the information measures of different bipartitions were also considered in machine learning [19] and in quantum physics [44,45] studies.
The formal similarity between Equations (5) and (6) underlines the analogy between modeling classical data and modeling quantum states [17][18][19][20][21][22][23][24]. Quantum entanglement entropy is not merely a "metaphorical vehicle" to measure the complexity of classical dataset, but is also of practical relevance if one models the data using the quantum approach (Equation (2)). Since the general theories about the entanglement entropy scaling for various quantum states [37] are very instructive for estimating required resources to model the target quantum states, developing similar theory for typical datasets in machine learning would be very helpful for selecting generative models.  (5) and (6)  There are nevertheless differences in the two information measures of Equations (5) and (6). First, the swap operation in Equation (5) is defined for the probability density other than the quantum wavefunction. The probability amplitude may contain phase information which is however irrelevant to probabilistic modeling of the dataset [20]. Second, the logarithmic functions is sandwiched between two expectations in Equation (5), which hiders direct Monte Carlo estimate of the MI similar to the Rényi entanglement entropy [41,42]. To circumvent this difficulty, one may consider computing alternative quantities such as the Rényi mutual information [46].

Probabilistic Modeling Using Restricted Boltzmann Machine
As a concrete example, we consider the RBM [47] for probabilistic modeling. RBM is a prominent approach for generative modeling with deep connections to statistical physics. It has also played an important role in the recent resurgence of deep learning [48,49]. Recently, the RBMs have attracted heated attentions in the quantum many-body physics community. Viewed as a variational ansatz for quantum states [8,50], the representational power of RBM was investigated from a quantum entanglement [9] and computational complexity theory [10] perspectives. Moreover, its connection to the tensor network states was explored extensively [11,[14][15][16]18]. Besides representing quantum states, RBMs also find applications in identifying order parameters, quantum error correction and accelerating Monte Carlo simulations [51][52][53][54][55][56]. The later applications adopted the conventional usage of the RBMs, i.e., modeling probability density of observed data.
Conventionally, the RBM models probability distribution of data via an energy-based model with hidden units. By tracing out the hidden variables, the RBM represents a probability distribution of the visible variables. RBM can in principle approximate any probability density by using a sufficiently large number of hidden units [7,[57][58][59][60][61]. However, one should note that these theorems mostly concern about the worst cases and do not take into account of typical distributions of interests. It is thus crucial to exploit the inductive bias of the RBM in terms of the information measures and match them to the characteristics of the target dataset. To do this, we define the mutual information I RBM and entanglement entropy S R(vN) RBM of the RBM analogously to Equations (3) and (4), except that we now use the probability density p(v) and the corresponding probability amplitude of the RBM.
Given an RBM architecture, one can identify two set of visible variables X , Y are connected via a minimal set of hidden variables Z (see Figure 3). The variables X and Y are independent once all the values of Z are given. This conditional independence property is denoted symbolically as X ⊥Y |Z in the probabilistic graphical model notation [33]. The MI between the regions X and Y can be captured by the RBM is bounded by the size of the intermediate region where |Z | denotes the number of hidden units in the set Z. The factor ln 2 is due to the binarization of the data in this paper. The first inequality follows directly from the data-processing inequality [62], which states that the information cannot be increased through a random channel. Alternatively, one can show that I RBM (X : Y ) ≤ I RBM (X : Y ∪ Z ) using the strong subadditivity property of the MI [63] and note that I RBM (X : Y ∪ Z ) = I RBM (X : Z ) [64]. The second inequality in Equation (7) uses the fact that mutual information is bounded by the size of the subsystem. We note that the mutual information of target data is used for structural learning of fully visible probabilistic graphical model of tree structures [65]. For RBMs, information theoretical studies have mostly focused on the MI between the visible and hidden variables [66][67][68]. According to Equation (7) On the other hand, one can repurpose the RBM to represent the quantum state [8], i.e., the probability amplitude shown in Equation (2). In terms of the entanglement entropy, the representational power of RBM is also limited by its connectivity [9][10][11]18], Equations (7) and (8) quantify the expressibility of the RBM in terms of information theoretical measures solely by its architecture. For an RBM with dense connection, the region Z will span to all the hidden units irrespective of the information pattern of the target dataset. Information perspective provides a guiding principle for RBM architecture design conditioned on the typical information pattern of the target dataset. Equation (7) shows that two sets of visible variables of an RBM should connect to at least I(X : Y ))/ ln 2 hidden neurons to adequately capture the MI of the dataset. We anticipate that the MI of natural images and physical model should be much smaller than the maximum value min (|X |, |Y |) ln 2 due to physical nature of the probability distributions. The connectivity of the RBM puts a constraint on the maximum information that can be captured, therefore limits its expressibility. Conversely, this also provides an inductive bias towards natural datasets of low information complexity. Interpreting the generative modeling in terms of capturing the MI or entanglement of the target dataset sheds new light on the learning process.
An important question relevant to quantum machine learning is to identify realistic datasets which are significantly easier to model in the quantum approach than the classical approach [70]. In light of the above discussion, one is inclined to look for those cases in datasets where the entanglement entropy lower bound in Equation (8) is much smaller than the classical mutual information lower bound in Equation (7). We verified numerically that in general there is no definite inequality between Equations (5) and (6). Therefore, it would be interesting to construct explicit examples where the quantum approach requires fewer resources.
One should nevertheless be careful when drawing the analogy between modeling quantum states and classical datasets. For example, it is sometimes argued that one needs deep neural nets to model classical dataset with critical correlations in analog to critical quantum systems which can only be captured by hierarchical tensor networks [71]. However, the scaling behavior of the mutual information of a critical classical system is different from the entanglement entropy of a critical quantum system. The MI of statistical physics model with short-range interactions scales only with the boundary size between subsystems [64], which holds irrespective whether the system is critical or not. As a concrete example, the critical Ising model only requires a shallow RBM to be modeled exactly [18], which is in line with the area law scaling of its mutual information [72,73]. Other results [55] also showed that deep architectures do not seem to exhibit advantages for modeling the critical Ising data.
We also mensioned the work [12] using an RBM to model the probability of a quantum state on a fixed basis for quantum state tomography. Since the approach corresponds to Equation (1), the required resources are determined by the Shannon mutual information of quantum states [74,75], instead of the entanglement entropy. The two entropies exhibit similar scaling behavior for the examples discussed in Refs. [74,75]. However, in general, this may not be the case. Thus, it remains open to see whether it is advantageous to use an RBM to model the probability or using a complex valued RBM to model the quantum state directly.

Information Pattern of MNIST Dataset and Its Implication to Generative Modeling
We considered generative modeling the MNIST dataset by exploiting the information pattern of the target dataset. First, it is extremely challenging to accurately compute the MI of high-dimensional distribution such as the images with 784 pixels. We employed the approach of Ref. [76] to estimate the MI of the MNIST dataset. The approach is based on the nearest neighbor estimate of the Shannon entropy widely adopted in the statistics literature. Figure 4a shows how the MI increases as one cuts into the center of the image. The vertical and horizontal bipartition of the images exhibits a similar behavior of MI. The MI between the margin and the remaining part of the image is zero since the margin of the MNIST image is always fixed.
The maximum of MI is much smaller than its theoretical limit in Equation (7), which indicates densely connected RBM has significant redundancy. Meanwhile, MI of the checkerboard bipartitions (e.g., Figure 2c) is significantly higher than the left/right or up/down bipartitions. This suggests that introducing hidden units which couple to the nearby pixels are more efficient in capturing the MI of the MNIST dataset.
One should note that the MI estimator [76] is only approximate, especially for highly dependent variables [77]. It is generally a difficult task to estimate the MI of image dataset rigorously. On the other hand, estimating the Reńyi entropy (6) is feasible by using tensor network [17,20,21] or Monte Carlo approaches [41,42]. We estimated the Reńyi entropy of the MNIST dataset in Figure 4b. First, we employed the approach of [20] to train an MPS on 10,000 MNIST images. When the MPS is trained well, we calculate its reduced density matrix (ρ X ) x,x = ∑ y∈Y Ψ(x, y)Ψ(x , y) at each bipartition position. The goodness of the learning is measured by the negative log-likelihood (NLL) evaluated on the where |D| = 10000. In this case, the NLL of MPS for training data is 39.634. Then, we obtained the second Reńyi entropy (Equation (4)). The second Reńyi entropy can be used as a measure of the correlation between two parts of the pictures by left-right bipartition or up-down bipartition, depending on the lay out of the MPS on the 2D image. The value of Reńyi entropy in the MNIST dataset has reached the one of the 2D quantum spin system computed using DMRG [78], which shows a similar level of complexity. Interestingly, not only its value, but also the distribution of the Reńyi entropy exhibit similar behavior as the mutual information, which confirms the connection between Equations (5)   For sparsely connected model, the relatively higher values of the checkboard bipartition than the up-down/left-right bipartition suggests the local connections are more important for capturing the dataset probability distribution. If this is correct, we could expect a sparsely connected model with only sparse local connection can perform relatively well than the sparse random connected model. This assumption also can be easily checked by numerical experiments.
We confirm these two assumptions by training RBMs with the same number of parameters but with different connection architectures and different number of hidden neurons. Dense connection means that the visible and hidden units of the RBM are fully connected and hidden units will be less than visible units. For 1D, 2D and Random RBM, they have the same number of hidden and visible neurons. Random means that we randomly connect the visible and hidden neurons. While 1D connection means that each hidden neurons of the RBM is connected only to a 2l 1 + 1 fragment of the entire image vector, where l 1 is denoted as the 1D connection length, see Figure 3, 2D RBM means that each hidden neuron connects to a small 2(l 2 + 1) × (l 2 + 1) window of a 2D image, where l 2 is denoted 2D connection length.
The goodness of the learning is measured by the NLL evaluated on the test dataset , where |D| is the size of the test set. To compute the NLL one has to estimate the intractable partition function, for which we employed the Annealed Importance Sampling approach [79,80]. The estimated NLL provides an up bound of the entropy of the dataset, which also bounds the MI between two arbitrary division of the variables, i.e. L ≥ I(X : Y ).
One clearly sees that, in Figure 5, the RBM structure with local connections which respects the 2D nature of the images reaches the lowest NLL quickly, and with the least number of parameters, while the NLL of the RBM with 1D connections exhibits an abrupt drop when the two nearby pixels from the different row are connected. The RBM with dense connections performs even worse than the random connections with the same number of parameters.
To further reveal the connection between the captured MI and the quality of the learned RBM, we calculated the MI of different 1D and 2D sparse RBMs ( Figure 6) via Equation (5). The MI of 1D sparse RBMs exhibits an abrupt increase at l 1 = 14, which is just the moment when the fragment is long enough to be able to include two neighboring pixels from different row. Compared with Figure 5, the abrupt increase of MI corresponds exactly to the sudden drop of NLL. Meanwhile, 2D sparse RBMs captures more MI than the RBM with 1D connections. From these comparisons, we see that learning via minimizing the NLL is also a process of learning MI of the dataset.
Our results on the NLL of sparse RBM, and the surprisingly small MI and Renyi entropy, are consistent with the previous experiments on RBMs with sparse connections [81][82][83][84]. Earlier work [85] showed the neural network works fine, even with 80% randomly dropped out. Moreover, A sparsely connected RBM with small-world network structure and found that it performs well compared with a densely connected RBM is also been proposed [86]. Since RBMs with local and sparse connections has close connections to the tensor networks which can be handled in practical computation [18], these results support the applications of tensor network states in practical machine learning problems [17,20,21]. In those applications, it is natural to adopt Equation (2) and the associated quantum information perspectives in machine learning problems.

Summary
In summary, revealing the similarity of the two information theoretical measures Equations (5) and (6) suggests that the statistical physics and quantum physics inspired approaches for generative modeling using Boltzmann Machines (Equation (1) and Born Machines (Equation (2)) have similar inductive biases. Therefore, successful wavefunction representations in quantum physics have the potential to be good generative models for machine learning, and vice versa.
Classical and quantum information theories shed light on the expressibility and architecture design of generative models. Our discussions and numerical experiments suggest that it is rewarding to design architectures which take into information pattern of the target dataset. In particular, imposing locality greatly increases the learning efficiency by exploiting the mutual information structure of the typical dataset. This is akin to the success of the convolutional neural network structure for discriminative tasks.
Besides the expressibility issue discussed in this paper, learning and sampling of the energy-based models can be slow due to the intractable partition functions. Conventionally, this is solved by using Markov chain Monte Carlo sampling or mean-field theory approaches. The quantum representation offers an alternative solution to these problems. For example, modeling the probability amplitude as matrix product states or tree tensor networks offer advantages in efficient learning and sampling [17,20,21]. Moreover, representing the probability distribution using a quantum state [24,87] obviously permits efficient sample generation by simply performing measurement to the quantum state.
In the above discussions, we have separated the energy-based models and quantum state representations for probabilistic modeling of classical data. Nevertheless, a combined approach with a "quantum statistical model" is also possible, in which one models the classical probability density using mixed quantum states. In this respect, the quantum Boltzmann machines [88][89][90] can be viewed as an example. Finally, this paper focuses on modeling the probability of data without labels. In a general setting, one could also model the joint probability distribution of the data and label. In this case, one can generate samples conditioned on the class label and elaborate on the entanglement entropy of each class individually [21].