## 1. Introduction

The machine learning techniques for Markov random fields (MRFs) are fundamental in various fields involving pattern recognition [

1,

2], image processing [

3], sparse modeling [

4], and Earth science [

5,

6], and a Boltzmann machine [

7,

8,

9] is one of the most important models in MRFs. The inference and learning problems in the Boltzmann machine are NP-hard, because they include intractable multiple summations over all the possible configurations of variables. Thus, one of the major challenges of the Boltzmann machine is the design of the efficient inference and learning algorithms that it requires.

Various effective algorithms for Boltzmann machine learning were proposed by many researchers, a few of which are mean-field learning algorithms [

10,

11,

12,

13,

14,

15], maximum pseudo-likelihood estimation (MPLE) [

16,

17], contrastive divergence (CD) [

18], ratio matching (RM) [

19], and minimum probability flow (MPF) [

20,

21]. In particular, the CD and MPLE methods are widely used. More recently, the author proposed an effective learning algorithm based on the spatial Monte Carlo integration (SMCI) method [

22]. The SMCI method is a Monte Carlo integration method that takes spatial information around the region of focus into account; it was proven that this method is more effective than the standard Monte Carlo integration method. The main target of this study is Boltzmann machine learning based on the first-order SMCI (1-SMCI) method, which is the simplest version of the SMCI method. We refer it to as the 1-SMCI learning method in this paper.

It was empirically shown through the numerical experiments that Boltzmann machine learning based on the 1-SMCI learning method is more effective than MPLE in the case where no model error exists, i.e., in the case where the learning model includes the generative model [

22]. However, the theoretical reason for this was not revealed at all. In this paper, theoretical insights into the effectiveness of the 1-SMCI learning method as compared to that of MPLE are given from an asymptotic point of view. The theoretical results obtained in this paper state that the gradients of the log-likelihood function obtained by the 1-SMCI learning method constitute a quantitatively better approximation of the exact gradients than those obtained by the MPLE method in the case where the generative model and the learning model are the same Boltzmann machine (in

Section 4.1). This is one of the contributions of this paper. In the previous paper [

22], the 1-SMCI learning method was compared with only the MPLE. In this paper, we compare the 1-SMCI learning method with other effective learning algorithms, RM and MPF, through numerical experiments, and show that the 1-SMCI learning method is superior to them (in

Section 5). This is the second contribution of this paper.

The remainder of this paper is organized as follows. The definition of Boltzmann machine learning and a briefly explanation of the MPLE method are given in

Section 2. In

Section 3, we explain Boltzmann machine learning based on the 1-SMCI method: reviews of the SMCI and 1-SMCI learning methods are presented in

Section 3.1 and

Section 3.2, respectively. In

Section 4, the 1-SMCI learning method and MPLE are compared using two different approaches, the theoretical approach (in

Section 4.1) and the numerical approach (in

Section 4.2), and the effectiveness of the 1-SMCI learning method as compared to the MPLE is shown. In

Section 5, we numerically compare the 1-SMCI method with other effective learning algorithms and observe that the 1-SMCI learning method yields the best approximation. Finally, the conclusion is given in

Section 6.

## 2. Boltzmann Machine Learning

Consider an undirected and connected graph,

$G=(V,E)$, with

n nodes, where

$V:=\{1,2,\dots ,n\}$ is the set of labels of nodes and

E is the set of labels of undirected links; an undirected link between nodes

i and

j is labeled

$(i,j)$. Since an undirected graph is now considered, labels

$(i,j)$ and

$(j,i)$ indicate the same link. On undirected graph

G, we define a Boltzmann machine with random variables

$\mathit{x}:=\{{x}_{i}\in \mathcal{X}\mid i\in V\}$, where

$\mathcal{X}$ is the sample space of the variable. It is expressed as [

7,

9]

where

$Z(\mathit{w})$ is the partition function defined by

where

${\sum}_{\mathit{x}}$ is the multiple summation over all the possible realizations of

$\mathit{x}$; i.e.,

${\sum}_{\mathit{x}}={\prod}_{i\in V}{\sum}_{{x}_{i}\in \mathcal{X}}$. Here and in the following, if

${x}_{i}$ is continuous,

${\sum}_{{x}_{i}\in \mathcal{X}}$ is replaced by integration.

$\mathit{w}:=\{{w}_{ij}\in (-\infty ,\infty )\mid (i,j)\in E\}$ represents the symmetric coupling parameters (

${w}_{ij}={w}_{ji}$). Although a Boltzmann machine can include a bias term, e.g.,

${\sum}_{i\in V}{b}_{i}{x}_{i}$, in its exponent, it is ignored in this paper for the sake of the simplicity of arguments.

Suppose that a set of

N data points corresponding to

$\mathit{x}$,

$\mathcal{D}:=\{{\mathbf{x}}^{(\mu )}\mid \mu =1,2,\dots ,N\}$ where

${\mathbf{x}}^{(\mu )}:=\{{\mathrm{x}}_{i}^{(\mu )}\in \mathcal{X}\mid i\in V\}$, is obtained. The goal of Boltzmann machine learning is to maximize the log-likelihood

with respect to

$\mathit{w}$, that is, the maximum likelihood estimation (MLE). The Boltzmann machine, with

$\mathit{w}$ that maximizes Equation (

2), yields the distribution most similar to the data distribution (also referred to as the empirical distribution) in the perspective of the measure based on Kullback–Leibler divergence (KLD). This fact can be easily seen in the following. The empirical distribution of

$\mathcal{D}$ is expressed as

where

$\delta (a,b)$ is the Kronecker delta function:

$\delta (a,b)=1$ when

$a=b$, and

$\delta (a,b)=0$ when

$a\ne b$. The KLD between the empirical distribution and the Boltzmann machine in Equation (

1),

can be rewritten as

${D}_{\mathrm{KL}}[{Q}_{\mathcal{D}}\parallel {P}_{\mathrm{BM}}]=-l(\mathit{w};\mathcal{D})+C$, where

C is the constant unrelated to

$\mathit{w}$. From this equation, we determine that

$\mathit{w}$ that maximizes the log-likelihood in Equation (

2) minimizes the KLD.

Since the log-likelihood in Equation (

2) is the concave function with respect to

$\mathit{w}$ [

8], in principle, we can optimize the log-likelihood using a gradient ascent method. The gradient of the log-likelihood with respect to

${w}_{ij}$ is

where

${\mathrm{E}}_{\mathrm{BM}}[\cdots \mid \mathit{w}]:={\sum}_{\mathit{x}}(\cdots ){P}_{\mathrm{BM}}(\mathit{x}\mid \mathit{w})$ is the expectation of the assigned quantity over the Boltzmann machine in Equation (

1). In the optimal point of the MLE, all the gradients are zero, and therefore, from Equation (

5), the optimal

$\mathit{w}$ is the solution to the simultaneous equations

When the data points are generated independently from a Boltzmann machine,

${P}_{\mathrm{BM}}(\mathit{x}\mid {\mathit{w}}^{gen})$, defined on the same graph as the Boltzmann machine we use in the learning, i.e., the case without the model error, the solution to the MLE,

${\mathit{w}}^{\mathrm{MLE}}$, converges to

${\mathit{w}}^{\mathrm{gen}}$ as

$N\to \infty $ [

23]. In other words, the MLE is asymptotically consistent.

However, it is difficult to compute the second term in Equation (

5), because the computations of these expectations need the summation over

$O({2}^{n})$ terms. Thus, the exact Boltzmann machine learning, i.e., the MLE, cannot be performed. As mentioned in

Section 1, various approximations for Boltzmann machine learning were proposed by many authors, such as the mean-field learning methods [

10,

11,

12,

13,

14,

15] and the MPLE [

16,

17], CD [

18], RM [

19], MPF [

20,

21] and SMCI [

22] methods. In the following, we briefly review the MPLE method.

In MPLE, we maximize the following pseudo-likelihood [

16,

17,

24] instead of the true log-likelihood in Equation (

2).

where

${\mathit{x}}_{\mathcal{A}}:=\{{x}_{i}\mid i\in \mathcal{A}\subseteq V\}$ is the variables in

$\mathcal{A}$ and

$-\mathcal{A}:=V\setminus \mathcal{A}$; i.e.,

${\mathit{x}}_{-\left\{i\right\}}=\mathit{x}\setminus \left\{{x}_{i}\right\}$. The conditional distribution in the above equation is the conditional distribution in the Boltzmann machine expressed by

where

where

$\partial (i)\subseteq V$ is the set of labels of nodes directly connected to node

i; i.e.,

$\partial (i):=\{j\mid (i,j)\in E\}$. The derivative of the pseudo-likelihood with respect to

${w}_{ij}$ is

${m}_{ij}^{\mathrm{MPLE}}(\mathit{w};\mathcal{D})$ is defined by

where

and where, for a set

$\mathcal{A}\subseteq V$,

${\mathbf{x}}_{\mathcal{A}}^{(\mu )}$ is the

$\mu $-th data point corresponding to

${\mathit{x}}_{\mathcal{A}}$; i.e.,

${\mathbf{x}}_{\mathcal{A}}^{(\mu )}=\{{\mathrm{x}}_{i}^{(\mu )}\mid i\in \mathcal{A}\subseteq V\}$. When

$\mathcal{X}=\{-1,+1\}$,

${M}_{i}({\mathit{x}}_{\partial (i)},\mathit{w})=\mathrm{tanh}{U}_{i}({\mathit{x}}_{\partial (i)},\mathit{w})$. In order to fit the magnitude of the gradient to that of the MLE, we use half of Equation (

10) as the gradient of the MPLE

The order of the total computational complexity of the gradients in Equation (

11) is

$O(N|E|)$, where

$\left|E\right|$ is the number of links in

$G(V,E)$. The pseudo-likelihood is also the concave function with respect to

$\mathit{w}$, and therefore, one can optimize it using a gradient ascent method. The typical performance of the MPLE method is almost the same as or slightly better than that of the CD method in Boltzmann machine learning [

24].

From Equation (

13), the optimal

$\mathit{w}$ in the MPLE is the solution to the simultaneous equations

By comparing Equation (

6) with Equation (

14), it can be seen that the MPLE is the approximation of the MLE such that

${\mathrm{E}}_{\mathrm{BM}}[{x}_{i}{x}_{j}\mid \mathit{w}]\approx {m}_{ij}^{\mathrm{MPLE}}(\mathit{w};\mathcal{D})$. Many authors proved that the MPLE is also asymptotically consistent (for example, [

24,

25,

26]), that is, in the case without model error, the solution to the MPLE,

${\mathit{w}}^{\mathrm{MPLE}}$, converges to

${\mathit{w}}^{\mathrm{gen}}$ as

$N\to \infty $. However, the asymptotic variance of the MPLE is larger than that of the MLE [

25].

## 4. Comparison of 1-SMCI Learning Method and MPLE

It was empirically observed in some numerical experiments that the 1-SMCI learning method discussed in the previous section is better than MPLE in the case without model error [

22]. In this section, first we provide some theoretical insights into this observation, and then some numerical comparisons of the two methods in the cases with and without model error.

#### 4.1. Comparison from Asymptotic Point of View

Suppose that we want to approximate the expectation

${\mathrm{E}}_{\mathrm{BM}}[{x}_{i}{x}_{j}\mid \mathit{w}]$ in a Boltzmann machine, and assume that the data points are generated independently from the same Boltzmann machine. Here, we re-express

${m}_{ij}^{\mathrm{MPLE}}(\mathit{w};\mathcal{D})$ in Equation (

11) and

${m}_{ij}^{1\mathrm{SMCI}}(\mathit{w};\mathcal{D})$ in Equation (23) as

respectively, where

Since

${\mathbf{x}}^{(\mu )}$s are the i.i.d. points sampled from

${P}_{\mathrm{BM}}(\mathit{x}\mid \mathit{w})$,

${\rho}_{ij}^{\mathrm{MPLE}}({\mathbf{x}}^{(\mu )},\mathit{w})$ and

${\rho}_{ij}^{1\mathrm{SMCI}}({\mathbf{x}}^{(\mu )},\mathit{w})$ can also be regarded as i.i.d. sample points. Thus,

${m}_{ij}^{\mathrm{MPLE}}(\mathit{w};\mathcal{D})$ and

${m}_{ij}^{1\mathrm{SMCI}}(\mathit{w};\mathcal{D})$ are the sample averages over the i.i.d. points. One can easily verify that the two equations

${\sum}_{\mathit{x}}{P}_{\mathrm{BM}}(\mathit{x}\mid \mathit{w}){\rho}_{ij}^{\mathrm{MPLE}}(\mathit{x},\mathit{w})={\mathrm{E}}_{\mathrm{BM}}[{x}_{i}{x}_{j}\mid \mathit{w}]$ and

${\sum}_{\mathit{x}}{P}_{\mathrm{BM}}(\mathit{x}\mid \mathit{w}){\rho}_{ij}^{1\mathrm{SMCI}}(\mathit{x},\mathit{w})={\mathrm{E}}_{\mathrm{BM}}[{x}_{i}{x}_{j}\mid \mathit{w}]$ are justified (the former equation can also be justified by using the correlation equality [

27]). Therefore, from the law of large numbers,

${m}_{ij}^{\mathrm{MPLE}}(\mathit{w};\mathcal{D})={m}_{ij}^{1\mathrm{SMCI}}(\mathit{w};\mathcal{D})={\mathrm{E}}_{\mathrm{BM}}[{x}_{i}{x}_{j}\mid \mathit{w}]$ in the limit of

$N\to \infty $. This implies that, in the case without model error, the 1-SMCI learning method has the same solution to the MLE in the limit of

$N\to \infty $.

From the central limit theorem, the distributions of

${m}_{ij}^{\mathrm{MPLE}}(\mathit{w};\mathcal{D})$ and

${m}_{ij}^{1\mathrm{SMCI}}(\mathit{w};\mathcal{D})$ asymptotically converge to Gaussians with mean

${\mathrm{E}}_{\mathrm{BM}}[{x}_{i}{x}_{j}\mid \mathit{w}]$ and variances

respectively, for

$N\gg 1$. For the two variances, we obtain the following theorem.

**Theorem** **1.** For a Boltzmann machine, ${P}_{\mathrm{BM}}(\mathit{x}\mid \mathit{w})$, defined in Equation (1), the inequality ${v}_{ij}^{\mathrm{MPLE}}(\mathit{w})\ge {v}_{ij}^{1\mathrm{SMCI}}(\mathit{w})$ is satisfied for all $(i,j)\in E$ and for any N. The proof of this theorem is given in

Appendix A. Theorem 1 states that the variance in the distribution of

${m}_{ij}^{\mathrm{MPLE}}(\mathit{w};\mathcal{D})$ is always larger than (or equal to) that of

${m}_{ij}^{1\mathrm{SMCI}}(\mathit{w};\mathcal{D})$. This means that, when

$N\gg 1$, the distribution of

${m}_{ij}^{1\mathrm{SMCI}}(\mathit{w};\mathcal{D})$ converges to a Gaussian around the mean value (i.e., the exact expectation) that is sharper than a Gaussian to which the distribution of

${m}_{ij}^{\mathrm{MPLE}}(\mathit{w};\mathcal{D})$ converges, and therefore, it is likely that

${m}_{ij}^{1\mathrm{SMCI}}(\mathit{w};\mathcal{D})$ is closer to the exact expectation than

${m}_{ij}^{\mathrm{MPLE}}(\mathit{w};\mathcal{D})$ when

$N\gg 1$; that is,

${m}_{ij}^{1\mathrm{SMCI}}(\mathit{w};\mathcal{D})$ is a better approximation of

${\mathrm{E}}_{\mathrm{BM}}[{x}_{i}{x}_{j}\mid \mathit{w}]$ than

${m}_{ij}^{\mathrm{MPLE}}(\mathit{w};\mathcal{D})$.

Next, we consider the differences between the true gradient in Equation (

5) and the approximate gradients in Equations (

13) and (26) for

${w}_{ij}$:

For the gradient differences in Equations (32) and (33), we obtain the following theorem.

**Theorem** **2.** For a Boltzmann machine, ${P}_{\mathrm{BM}}(\mathit{x}\mid \mathit{w})$, defined in Equation (1), the inequalityis satisfied for all $(i,j)\in E$ when $N\to \infty $, where $\mathcal{D}$ is the set of N data points generated independently from ${P}_{\mathrm{BM}}(\mathit{x}\mid \mathit{w})$. The proof of this theorem is given in

Appendix B. Theorem 2 states that it is likely that the magnitude of

${e}_{ij}^{1\mathrm{SMCI}}(\mathit{w};\mathcal{D})$ is smaller than (or equivalent to) that of

${e}_{ij}^{\mathrm{MPLE}}(\mathit{w};\mathcal{D})$ when the data points are generated independently from the same Boltzmann machine and when

$N\gg 1$.

Suppose that N data points are generated independently from a generative Boltzmann machine, ${P}_{\mathrm{BM}}(\mathit{x}\mid {\mathit{w}}^{\mathrm{gen}})$, defined on ${G}^{\mathrm{gen}}$, and that a learning Boltzmann machine, defined on the same graph as the generative Boltzmann machine, is trained using the generative data points. In this case, since there is no model error, the solutions of the MLE, the MPLE and the 1-SMCI learning methods are expected to approach ${\mathit{w}}^{\mathrm{gen}}$ as $N\to \infty $, that is ${\Delta}_{ij}^{\mathrm{MLE}}({\mathit{w}}^{\mathrm{gen}};\mathcal{D})$, ${\Delta}_{ij}^{\mathrm{MPLE}}({\mathit{w}}^{\mathrm{gen}};\mathcal{D})$, and ${\Delta}_{ij}^{1\mathrm{SMCI}}({\mathit{w}}^{\mathrm{gen}};\mathcal{D})$ are expected to approach zero as $N\to \infty $. Consider the case in which N is very large and ${\Delta}_{ij}^{\mathrm{MLE}}({\mathit{w}}^{\mathrm{gen}};\mathcal{D})=0$. From the statement in Theorem 2, ${\Delta}_{ij}^{1\mathrm{SMCI}}({\mathit{w}}^{\mathrm{gen}};\mathcal{D})$ is statistically closer to zero than ${\Delta}_{ij}^{\mathrm{MPLE}}({\mathit{w}}^{\mathrm{gen}};\mathcal{D})$. This implies that the solution of the 1-SMCI learning method converges to that of the MLE faster than the MPLE.

The theoretical results presented in this section have not reached a rigid justification of the effectiveness of the 1-SMCI learning method, because some issues still remain, for instance: (i) since we do not specify whether the problem of solving Equation (27), i.e., a gradient ascent method with the gradients in Equation (26), is a convex problem or not, we cannot remove the possibility of existence local optimal solutions which degrade the performance of the 1-SMCI learning method, (ii) although we discussed the asymptotic property of ${\Delta}_{ij}^{1\mathrm{SMCI}}(\mathit{w};\mathcal{D})$ for each link separately, a joint analysis of them is necessary for a more rigid discussion, and (iii) a perturbative analysis around the optimal point is completely lacking. However, we can expect that they constitute evidence that is important for gaining insight into the effectiveness.

#### 4.2. Numerical Comparison

We generated N data points from a generative Boltzmann machine, ${P}_{\mathrm{BM}}(\mathit{x}\mid {\mathit{w}}^{\mathrm{gen}})$ and then trained a learning Boltzmann machine of the same size as the generative Boltzmann machine using the generated data points. The coupling parameters in the generative Boltzmann machine, ${\mathit{w}}^{\mathrm{gen}}$, were generated from a unique distribution, $U[-\lambda ,\lambda ]$.

First, we consider the case where the graph structures of the generative Boltzmann machine and the learning Boltzmann machine are the same: a

$4\times 4$ square grid, that is the case “without” model error.

Figure 2a shows the mean absolute errors between the solutions to the MLE and the approximate methods (the MLPE and the 1-SMCI learning method),

${\sum}_{(i,j)\in E}|{w}_{ij}^{\mathrm{MLE}}-{w}_{ij}^{\mathrm{approx}}|/|E|$, against

N. Here, we set

$\lambda =0.3$. Since the size of the Boltzmann machine used here is not large, we can obtain the solution to the MLE. We observe that the solutions to the two approximate methods converge to the solution to the MLE as

N increases, and the 1-SMCI learning method is better than the MPLE as the approximation of the MLE. These results are consistent with the results obtained in [

22] and the theoretical arguments in the previous section.

Next, we consider the case in which the graph structure of the generative Boltzmann machine is fully connected with

$n=16$ and that in which the learning Boltzmann machine is again a

$4\times 4$ square grid, namely the case “with” model error. Thus, this case is completely outside the theoretical arguments in the previous section.

Figure 2b shows the mean absolute errors between the solution to the MLE and that to the approximate methods against

N. Here, we set

$\lambda =0.2$. Unlike the above case, the solutions to the two approximate methods do not converge to the solution to the MLE because of the model error. The 1-SMCI learning method is again better than the MPLE as the approximation of the MLE in this case.

By comparing

Figure 2a,b, we observed that the 1-SMCI learning method in (b) is worse than in (a). The following reason can be considered. In

Section 3.2, we replaced

$\mathcal{S}$, which is the sample points drawn from the Boltzmann machine, by

$\mathcal{D}$ in order to avoid the sampling cost. However, this replacement implies the assumption of the case “without” model error, and therefore, it is not justified in the case “with” model error.

## 5. Numerical Comparison with Other Methods

In this section, we demonstrate a numerical comparison of the 1-SMCI learning method with other approximation methods, RM [

19] and MPF [

20,

21]. The orders of the computational complexity of these two methods are the same as that of the MPLE and 1-SMCI learning methods. The four methods were implemented by a simple gradient ascent method,

${w}_{ij}^{(t+1)}\leftarrow {w}_{ij}^{(t)}+\eta {\Delta}_{ij}$, where

$\eta >0$ is the learning rate.

As described in

Section 4.2, we generated

N data points from a generative Boltzmann machine,

${P}_{\mathrm{BM}}(\mathit{x}\mid {\mathit{w}}^{\mathrm{gen}})$ and then trained a learning Boltzmann machine of the same size as the generative Boltzmann machine using the generated data points. The coupling parameters in the generative Boltzmann machine,

${\mathit{w}}^{\mathrm{gen}}$, were generated from

$U[-0.3,0.3]$. The graph structures of the generative Boltzmann machine and of the learning Boltzmann machine are the same: a

$4\times 4$ square grid.

Figure 3 shows the learning curves of the four methods. The horizontal axis represents the number of the step,

t, of the gradient ascent method, and the vertical axis represents the mean absolute errors between the solution to the MLE,

${\mathit{w}}^{\mathrm{MLE}}$, and the values of the coupling parameters at the step,

${\mathit{w}}^{(t)}$. In this experiment, we set

$\eta =0.2$, and the values of

$\mathit{w}$ were initialized as zero.

Since the vertical axises in

Figure 3 represents the the mean absolute error from the solution to the MLE, the lower one is the better approximation of the MLE. We can observe that the MPF shows the fastest convergence and the MPLE, RM, and MPF converge to almost the same values, while the 1-SMCI learning method converges to the lowest values. This concludes that, among the four methods, the 1-SMCI learning method is the best as the approximation of the MLE. However, the 1-SMCI learning method has a drawback. The MPLE, RM, and MPF are convex optimization problems and they have unique solutions, whereas, we do not specify whether the 1-SMCI learning method is a convex problem or not in the present stage. We cannot eliminate the possibility of the existence of local optimal solutions that degrade the accuracy of approximation.

As mentioned above, the orders of the computational complexity of these four methods, the MPLE, RM, MPF, and 1-SMCI learning methods, are the same,

$O(N|E|)$. However, it is important to check the real computational times of these methods.

Table 1 shows the total computational times needed for the one-time learning (until convergence), where the setting of the experiment is the same as that of

Figure 3b.

The MPF method is the fastest, and the 1-SMCI learning method is the slowest which is about 6–7 times slower than the MPF method.

## 6. Conclusions

In this paper, we examined the effectiveness of Boltzmann machine learning based on the 1-SMCI method proposed in [

22] where, by numerical experiments, it was shown that the 1-SMCI learning method is more effective than the MPLE in the case where no model error exists. In

Section 4.1, we gave the theoretical results for the empirical observation from the asymptotic point of view. The theoretical results improved our understanding of the advantage of the 1-SMCI learning method as compared to the MPLE. The numerical experiments in

Section 4.2 showed that the 1-SMCI learning method is a better approximation of the MPLE in the case with and without model error. Furthermore, we compared the 1-SMCI learning method with the other effective methods, RM and MPF, using the numerical experiments in

Section 5. The numerical results showed that the 1-SMCI learning method is the best method.

However, issues related to the 1-SMCI learning method still remain. Since the objective function of the 1-SMCI learning method, e.g., Equation (

7) for the MPLE, is not revealed, it is not straightforward to specify whether the problem of solving Equation (27), i.e., a gradient ascent method with the gradients in Equation (26), is a convex problem or not. This is one of the most challenging issues of the method. As shown in

Section 4.2, the performance of the 1-SMCI learning method decreases when model error exists, i.e., when the learning Boltzmann machine does not include the generative model. The decrease may be caused by the replacement of the sample points,

$\mathcal{S}$, by the data points,

$\mathcal{D}$, as discussed in the same section. It is expected that combining the 1-SMCI learning method with an effective sampling method, e.g., the persistent contrastive divergence [

28], relaxes the problem of the performance degradation.

The presented the 1-SMCI learning method can be applied to other types of Boltzmann machines, e.g., restricted Boltzmann machine [

1], deep Boltzmann machine [

2,

29]. Although we focused on the Boltzmann machine learning in this paper, the SMCI method can be applied to various MRFs [

22]. Hence, there are many future directions of application of the SMCI: for example, graphical LASSO problem [

4], Bayesian image processing [

3], Earth science [

5,

6] and brain-computer interface [

30,

31,

32,

33].