Learning Algorithm of Boltzmann Machine Based on Spatial Monte Carlo Integration Method

: The machine learning techniques for Markov random ﬁelds are fundamental in various ﬁelds involving pattern recognition, image processing, sparse modeling, and earth science, and a Boltzmann machine is one of the most important models in Markov random ﬁelds. However, the inference and learning problems in the Boltzmann machine are NP-hard. The investigation of an effective learning algorithm for the Boltzmann machine is one of the most important challenges in the ﬁeld of statistical machine learning. In this paper, we study Boltzmann machine learning based on the (ﬁrst-order) spatial Monte Carlo integration method, referred to as the 1-SMCI learning method, which was proposed in the author’s previous paper. In the ﬁrst part of this paper, we compare the method with the maximum pseudo-likelihood estimation (MPLE) method using a theoretical and a numerical approaches, and show the 1-SMCI learning method is more effective than the MPLE. In the latter part, we compare the 1-SMCI learning method with other effective methods, ratio matching and minimum probability ﬂow, using a numerical experiment, and show the 1-SMCI learning method outperforms them.


Introduction
The machine learning techniques for Markov random fields (MRFs) are fundamental in various fields involving pattern recognition [1,2], image processing [3], sparse modeling [4], and Earth science [5,6], and a Boltzmann machine [7][8][9] is one of the most important models in MRFs. The inference and learning problems in the Boltzmann machine are NP-hard, because they include intractable multiple summations over all the possible configurations of variables. Thus, one of the major challenges of the Boltzmann machine is the design of the efficient inference and learning algorithms that it requires.
Various effective algorithms for Boltzmann machine learning were proposed by many researchers, a few of which are mean-field learning algorithms [10][11][12][13][14][15], maximum pseudo-likelihood estimation (MPLE) [16,17], contrastive divergence (CD) [18], ratio matching (RM) [19], and minimum probability flow (MPF) [20,21]. In particular, the CD and MPLE methods are widely used. More recently, the author proposed an effective learning algorithm based on the spatial Monte Carlo integration (SMCI) method [22]. The SMCI method is a Monte Carlo integration method that takes spatial information around the region of focus into account; it was proven that this method is more effective than the standard Monte Carlo integration method. The main target of this study is Boltzmann machine learning based on the first-order SMCI (1-SMCI) method, which is the simplest version of the SMCI method. We refer it to as the 1-SMCI learning method in this paper.
It was empirically shown through the numerical experiments that Boltzmann machine learning based on the 1-SMCI learning method is more effective than MPLE in the case where no model error exists, i.e., in the case where the learning model includes the generative model [22]. However, the theoretical reason for this was not revealed at all. In this paper, theoretical insights into the effectiveness of the 1-SMCI learning method as compared to that of MPLE are given from an asymptotic point of view. The theoretical results obtained in this paper state that the gradients of the log-likelihood function obtained by the 1-SMCI learning method constitute a quantitatively better approximation of the exact gradients than those obtained by the MPLE method in the case where the generative model and the learning model are the same Boltzmann machine (in Section 4.1). This is one of the contributions of this paper. In the previous paper [22], the 1-SMCI learning method was compared with only the MPLE. In this paper, we compare the 1-SMCI learning method with other effective learning algorithms, RM and MPF, through numerical experiments, and show that the 1-SMCI learning method is superior to them (in Section 5). This is the second contribution of this paper.
The remainder of this paper is organized as follows. The definition of Boltzmann machine learning and a briefly explanation of the MPLE method are given in Section 2. In Section 3, we explain Boltzmann machine learning based on the 1-SMCI method: reviews of the SMCI and 1-SMCI learning methods are presented in Sections 3.1 and 3.2, respectively. In Section 4, the 1-SMCI learning method and MPLE are compared using two different approaches, the theoretical approach (in Section 4.1) and the numerical approach (in Section 4.2), and the effectiveness of the 1-SMCI learning method as compared to the MPLE is shown. In Section 5, we numerically compare the 1-SMCI method with other effective learning algorithms and observe that the 1-SMCI learning method yields the best approximation. Finally, the conclusion is given in Section 6.

Boltzmann Machine Learning
Consider an undirected and connected graph, G = (V, E), with n nodes, where V := {1, 2, . . . , n} is the set of labels of nodes and E is the set of labels of undirected links; an undirected link between nodes i and j is labeled (i, j). Since an undirected graph is now considered, labels (i, j) and (j, i) indicate the same link. On undirected graph G, we define a Boltzmann machine with random variables x := {x i ∈ X | i ∈ V}, where X is the sample space of the variable. It is expressed as [7,9] where Z(w) is the partition function defined by where ∑ x is the multiple summation over all the possible realizations of x; i.e., ∑ x = ∏ i∈V ∑ x i ∈X .
Here and in the following, if x i is continuous, ∑ x i ∈X is replaced by integration. w := {w ij ∈ (−∞, ∞) | (i, j) ∈ E} represents the symmetric coupling parameters (w ij = w ji ). Although a Boltzmann machine can include a bias term, e.g., ∑ i∈V b i x i , in its exponent, it is ignored in this paper for the sake of the simplicity of arguments. Suppose that a set of N data points corresponding to x, D : is obtained. The goal of Boltzmann machine learning is to maximize the log-likelihood with respect to w, that is, the maximum likelihood estimation (MLE). The Boltzmann machine, with w that maximizes Equation (2), yields the distribution most similar to the data distribution (also referred to as the empirical distribution) in the perspective of the measure based on Kullback-Leibler divergence (KLD). This fact can be easily seen in the following. The empirical distribution of D is expressed as where δ(a, b) is the Kronecker delta function: δ(a, b) = 1 when a = b, and δ(a, b) = 0 when a = b. The KLD between the empirical distribution and the Boltzmann machine in Equation (1), can be rewritten as D KL [Q D P BM ] = −l(w; D) + C, where C is the constant unrelated to w. From this equation, we determine that w that maximizes the log-likelihood in Equation (2) minimizes the KLD.
Since the log-likelihood in Equation (2) is the concave function with respect to w [8], in principle, we can optimize the log-likelihood using a gradient ascent method. The gradient of the log-likelihood with respect to w ij is where E BM [· · · | w] := ∑ x (· · · )P BM (x | w) is the expectation of the assigned quantity over the Boltzmann machine in Equation (1). In the optimal point of the MLE, all the gradients are zero, and therefore, from Equation (5), the optimal w is the solution to the simultaneous equations When the data points are generated independently from a Boltzmann machine, P BM (x | w gen ), defined on the same graph as the Boltzmann machine we use in the learning, i.e., the case without the model error, the solution to the MLE, w MLE , converges to w gen as N → ∞ [23]. In other words, the MLE is asymptotically consistent. However, it is difficult to compute the second term in Equation (5), because the computations of these expectations need the summation over O(2 n ) terms. Thus, the exact Boltzmann machine learning, i.e., the MLE, cannot be performed. As mentioned in Section 1, various approximations for Boltzmann machine learning were proposed by many authors, such as the mean-field learning methods [10][11][12][13][14][15] and the MPLE [16,17], CD [18], RM [19], MPF [20,21] and SMCI [22] methods. In the following, we briefly review the MPLE method.
In MPLE, we maximize the following pseudo-likelihood [16,17,24] instead of the true log-likelihood in Equation (2). where The conditional distribution in the above equation is the conditional distribution in the Boltzmann machine expressed by where where ∂(i) ⊆ V is the set of labels of nodes directly connected to node i; i.e., ∂(i) The derivative of the pseudo-likelihood with respect to w ij is where and where, for In order to fit the magnitude of the gradient to that of the MLE, we use half of Equation (10) as the gradient of the MPLE The order of the total computational complexity of the gradients in Equation (11) is O(N|E|), where |E| is the number of links in G(V, E). The pseudo-likelihood is also the concave function with respect to w, and therefore, one can optimize it using a gradient ascent method. The typical performance of the MPLE method is almost the same as or slightly better than that of the CD method in Boltzmann machine learning [24].
From Equation (13), the optimal w in the MPLE is the solution to the simultaneous equations By comparing Equation (6) with Equation (14), it can be seen that the MPLE is the approximation . Many authors proved that the MPLE is also asymptotically consistent (for example, [24][25][26]), that is, in the case without model error, the solution to the MPLE, w MPLE , converges to w gen as N → ∞. However, the asymptotic variance of the MPLE is larger than that of the MLE [25].

Boltzmann Machine Learning Based on Spatial Monte Carlo Integration Method
In this section, we present the reviews of both the SMCI method and the application of the first-order of the SMCI method to Boltzmann machine learning, i.e., the 1-SMCI learning method.

Spatial Monte Carlo Integration Method
Assume that we have a set of i.i.d. sample points, S := {s (µ) | µ = 1, 2, . . . , N}, where s (µ) := {s (µ) i ∈ X | i ∈ V}, drawn from a Boltzmann machine, P BM (x | w), by using a Markov chain Monte Carlo (MCMC) method. Suppose that we want to know the expectation of a function f ( In the standard Monte Carlo integration (MCI) method, we approximate the desired expectation by the simple average of the given sample points S: where Q S (x) is the distribution of the sample points, which is defined in the same manner as Equation (3), and where, for a set A ⊆ V, s A is the µ-th sample point corresponding to x A ; i.e., s The SMCI method considers spatial information around x C , in contrast to the standard MCI method. For the SMCI method, we define the neighboring regions of the target region, C ⊆ V, as follows. The first-nearest-neighbor region, N 1 (C), is defined by Therefore, when Similarly, the second-nearest-neighbor region, N 2 (C), is defined by In a similar manner, for k ≥ 1, we define the k-th-nearest-neighbor region, N k (C), by where R k (C) := k m=0 N m (C) and N 0 (C) := C. An example of the neighboring regions in a square-grid graph is shown in Figure 1. By using the conditional distribution, and the marginal distribution, the desired expectation can be expressed as where, for a set A ⊆ V, ∑ x A = ∏ i∈A ∑ x i ∈X . In Equation (19), we used the Markov property of the Boltzmann machine: In the k-th-order SMCI (k-SMCI) method, (21) is approximated by The k-SMCI method takes the spatial information up to the (k − 1)-th-nearest-neighbor region into account, and it approximates the outside of it (namely, the k-th-nearest-neighbor region) by the sample distribution. For the SMCI method, two important facts were theoretically proven [22]: (i) the SMCI method is asymptotically better than the standard MCI method and (ii) a higher-order SMCI method is better asymptotically than a lower-order one.

Boltzmann Machine Learning Based on First-Order SMCI Method
Applying the 1-SMCI method to Boltzmann machine learning is achieved by approximating the intractable expectations, E BM [x i x j | w], by the 1-SMCI method in Equation (22) with k = 1. Although Equation (22) requires sample points S drawn from P BM (x | w), as discussed in the previous section, we can avoid the sampling by using dataset D instead of S [22]. We approximate Since where tanh −1 (x) is the inverse function of tanh(x). By using the 1-SMCI learning method, the true gradient, ∆ MLE ij (w; D), is thus approximated as and therefore, the optimal w in this approximation is the solution to the simultaneous equations: The order of the total computational complexity of the gradients in Equation (23) is O(N|E|), which is the same as that of the MPLE. The solution to Equation (27) is obtained by a gradient ascent method with the gradients in Equation (26).

Comparison of 1-SMCI Learning Method and MPLE
It was empirically observed in some numerical experiments that the 1-SMCI learning method discussed in the previous section is better than MPLE in the case without model error [22]. In this section, first we provide some theoretical insights into this observation, and then some numerical comparisons of the two methods in the cases with and without model error.

Comparison from Asymptotic Point of View
Suppose that we want to approximate the expectation E BM [x i x j | w] in a Boltzmann machine, and assume that the data points are generated independently from the same Boltzmann machine. Here, we re-express m MPLE ij (w; D) in Equation (11) and m 1SMCI ij (w; D) in Equation (23) as respectively, where are justified (the former equation can also be justified by using the correlation equality [27]). Therefore, from the law of large numbers, m MPLE ij (w; D) = m 1SMCI ij (w; D) = E BM [x i x j | w] in the limit of N → ∞. This implies that, in the case without model error, the 1-SMCI learning method has the same solution to the MLE in the limit of N → ∞.
respectively, for N 1. For the two variances, we obtain the following theorem.
. Next, we consider the differences between the true gradient in Equation (5) and the approximate gradients in Equations (13) and (26) for w ij : For the gradient differences in Equations (32) and (33), we obtain the following theorem.
Theorem 2. For a Boltzmann machine, P BM (x | w), defined in Equation (1), the inequality is satisfied for all (i, j) ∈ E when N → ∞, where D is the set of N data points generated independently from P BM (x | w).
The proof of this theorem is given in Appendix B. Theorem 2 states that it is likely that the magnitude of e 1SMCI ij (w; D) is smaller than (or equivalent to) that of e MPLE ij (w; D) when the data points are generated independently from the same Boltzmann machine and when N 1. Suppose that N data points are generated independently from a generative Boltzmann machine, P BM (x | w gen ), defined on G gen , and that a learning Boltzmann machine, defined on the same graph as the generative Boltzmann machine, is trained using the generative data points. In this case, since there is no model error, the solutions of the MLE, the MPLE and the 1-SMCI learning methods are expected to approach w gen as N → ∞, that is ∆ MLE The theoretical results presented in this section have not reached a rigid justification of the effectiveness of the 1-SMCI learning method, because some issues still remain, for instance: (i) since we do not specify whether the problem of solving Equation (27), i.e., a gradient ascent method with the gradients in Equation (26), is a convex problem or not, we cannot remove the possibility of existence local optimal solutions which degrade the performance of the 1-SMCI learning method, (ii) although we discussed the asymptotic property of ∆ 1SMCI ij (w; D) for each link separately, a joint analysis of them is necessary for a more rigid discussion, and (iii) a perturbative analysis around the optimal point is completely lacking. However, we can expect that they constitute evidence that is important for gaining insight into the effectiveness.

Numerical Comparison
We generated N data points from a generative Boltzmann machine, P BM (x | w gen ) and then trained a learning Boltzmann machine of the same size as the generative Boltzmann machine using the generated data points. The coupling parameters in the generative Boltzmann machine, w gen , were generated from a unique distribution, U[−λ, λ].
First, we consider the case where the graph structures of the generative Boltzmann machine and the learning Boltzmann machine are the same: a 4 × 4 square grid, that is the case "without" model error. Figure 2a shows the mean absolute errors between the solutions to the MLE and the approximate methods (the MLPE and the 1-SMCI learning method), ∑ (i,j)∈E |w MLE ij − w approx ij |/|E|, against N. Here, we set λ = 0.3. Since the size of the Boltzmann machine used here is not large, we can obtain the solution to the MLE. We observe that the solutions to the two approximate methods converge to the solution to the MLE as N increases, and the 1-SMCI learning method is better than the MPLE as the approximation of the MLE. These results are consistent with the results obtained in [22] and the theoretical arguments in the previous section.
Next, we consider the case in which the graph structure of the generative Boltzmann machine is fully connected with n = 16 and that in which the learning Boltzmann machine is again a 4 × 4 square grid, namely the case "with" model error. Thus, this case is completely outside the theoretical arguments in the previous section. Figure 2b shows the mean absolute errors between the solution to the MLE and that to the approximate methods against N. Here, we set λ = 0.2. Unlike the above case, the solutions to the two approximate methods do not converge to the solution to the MLE because of the model error. The 1-SMCI learning method is again better than the MPLE as the approximation of the MLE in this case.
By comparing Figure 2a,b, we observed that the 1-SMCI learning method in (b) is worse than in (a). The following reason can be considered. In Section 3.2, we replaced S, which is the sample points drawn from the Boltzmann machine, by D in order to avoid the sampling cost. However, this replacement implies the assumption of the case "without" model error, and therefore, it is not justified in the case "with" model error.

Numerical Comparison with Other Methods
In this section, we demonstrate a numerical comparison of the 1-SMCI learning method with other approximation methods, RM [19] and MPF [20,21]. The orders of the computational complexity of these two methods are the same as that of the MPLE and 1-SMCI learning methods. The four methods were implemented by a simple gradient ascent method, w ij + η∆ ij , where η > 0 is the learning rate.
As described in Section 4.2, we generated N data points from a generative Boltzmann machine, P BM (x | w gen ) and then trained a learning Boltzmann machine of the same size as the generative Boltzmann machine using the generated data points. The coupling parameters in the generative Boltzmann machine, w gen , were generated from U[−0.3, 0.3]. The graph structures of the generative Boltzmann machine and of the learning Boltzmann machine are the same: a 4 × 4 square grid. Figure 3 shows the learning curves of the four methods. The horizontal axis represents the number of the step, t, of the gradient ascent method, and the vertical axis represents the mean absolute errors between the solution to the MLE, w MLE , and the values of the coupling parameters at the step, w (t) . In this experiment, we set η = 0.2, and the values of w were initialized as zero. Since the vertical axises in Figure 3 represents the the mean absolute error from the solution to the MLE, the lower one is the better approximation of the MLE. We can observe that the MPF shows the fastest convergence and the MPLE, RM, and MPF converge to almost the same values, while the 1-SMCI learning method converges to the lowest values. This concludes that, among the four methods, the 1-SMCI learning method is the best as the approximation of the MLE. However, the 1-SMCI learning method has a drawback. The MPLE, RM, and MPF are convex optimization problems and they have unique solutions, whereas, we do not specify whether the 1-SMCI learning method is a convex problem or not in the present stage. We cannot eliminate the possibility of the existence of local optimal solutions that degrade the accuracy of approximation.
As mentioned above, the orders of the computational complexity of these four methods, the MPLE, RM, MPF, and 1-SMCI learning methods, are the same, O(N|E|). However, it is important to check the real computational times of these methods. Table 1 shows the total computational times needed for the one-time learning (until convergence), where the setting of the experiment is the same as that of Figure 3b. The MPF method is the fastest, and the 1-SMCI learning method is the slowest which is about 6-7 times slower than the MPF method.

Conclusions
In this paper, we examined the effectiveness of Boltzmann machine learning based on the 1-SMCI method proposed in [22] where, by numerical experiments, it was shown that the 1-SMCI learning method is more effective than the MPLE in the case where no model error exists. In Section 4.1, we gave the theoretical results for the empirical observation from the asymptotic point of view. The theoretical results improved our understanding of the advantage of the 1-SMCI learning method as compared to the MPLE. The numerical experiments in Section 4.2 showed that the 1-SMCI learning method is a better approximation of the MPLE in the case with and without model error. Furthermore, we compared the 1-SMCI learning method with the other effective methods, RM and MPF, using the numerical experiments in Section 5. The numerical results showed that the 1-SMCI learning method is the best method. However, issues related to the 1-SMCI learning method still remain. Since the objective function of the 1-SMCI learning method, e.g., Equation (7) for the MPLE, is not revealed, it is not straightforward to specify whether the problem of solving Equation (27), i.e., a gradient ascent method with the gradients in Equation (26), is a convex problem or not. This is one of the most challenging issues of the method. As shown in Section 4.2, the performance of the 1-SMCI learning method decreases when model error exists, i.e., when the learning Boltzmann machine does not include the generative model. The decrease may be caused by the replacement of the sample points, S, by the data points, D, as discussed in the same section. It is expected that combining the 1-SMCI learning method with an effective sampling method, e.g., the persistent contrastive divergence [28], relaxes the problem of the performance degradation.
The presented the 1-SMCI learning method can be applied to other types of Boltzmann machines, e.g., restricted Boltzmann machine [1], deep Boltzmann machine [2,29]. Although we focused on the Boltzmann machine learning in this paper, the SMCI method can be applied to various MRFs [22]. Hence, there are many future directions of application of the SMCI: for example, graphical LASSO problem [4], Bayesian image processing [3], Earth science [5,6] and brain-computer interface [30][31][32][33].
From Equations (A2) and (A3), we obtain: where we use the relation M i (x ∂(i) , w) = ∑ x i ∈X x i P BM (x i | x N 1 ({i}) , w). From this equation, we obtain Finally, from Equations (A1) and (A5), the inequality: is obtained.