2.1. Restricted Boltzmann Machines
Let us start with a brief introduction of the RBM [
1,
2,
3]. As shown in 
Figure 1, the RBM is composed of two layers; the visible layer and the hidden layer. Possible configurations of the visible and hidden layers are represented by the random binary vectors, 
 and 
, respectively. The interaction between the visible and hidden layers is given by the so-called weight matrix 
, where the weight 
 is the connection strength between a visible unit 
 and a hidden unit 
. The biases 
 and 
 are applied to visible unit 
i and hidden unit 
j, respectively. Given random vectors 
 and 
, the energy function of the RBM is written as an Ising-type Hamiltonian.
        
        where the set of model parameters is denoted by 
. The joint probability of finding 
 and 
 of the RBM in a particular state is given by the Boltzmann distribution
        
        where the partition function, 
, is the sum over all possible configurations. The marginal probabilities 
 and 
 for visible and hidden layers are obtained by summing up the hidden or visible variables, respectively,
        
The training of the RBM is to adjust the model parameter 
 such that the marginal probability of the visible layer 
 becomes as close as possible to the unknown probability 
 that generate the training data. Given identically and independently sampled training data 
, the optimal model parameters 
 can be obtained by maximizing the likelihood function of the parameters, 
, or equivalently by maximizing the log-likelihood function 
. Maximizing the likelihood function is equivalent to minimizing the Kullback–Leibler divergence or the relative entropy of 
 from 
 [
15,
16]
        
        where 
 is an unknown probability that generates the training data, 
. Another method of monitoring the progress of training is the cross-entropy cost between the input visible vector 
 and a reconstructed visible vector 
 of the RBM,
        
The stochastic gradient ascent method for the log-likelihood function is used to train the RBM. Estimating the log-likelihood function requires the Monte-Carlo sampling for the model probability distribution. Well-known sampling methods are the contrastive-divergence, denoted by CD-
k, and the persistent contrastive divergence PCD-
k. For details of the RBM algorithm, please see References  [
2,
3,
4]. Here, we employ the CD-
k method.
  2.2. Free Energy, Entropy, and Internal Energy
From a physics point of view, the RBM is a finite classical system composed of two subsystems, similar to an Ising spin system. The training of the RBM is considered the driving of the system from an initial equilibrium state to the target equilibrium state by switching the model parameters. It may be interesting to see how thermodynamic quantities such as free energy, entropy, internal energy, and work change as the training progresses.
It is straightforward to write down various thermodynamic quantities for the total system. The free energy 
F is given by the logarithm of the partition function 
Z,
        
The internal energy 
U is given by the expectation value of the energy function 
The entropy 
S of the total system comprising the hidden and visible layers is given by
        
Here, the convention of 
 is employed if 
 [
17]. The free energy  (
7) is related to the difference between the internal energy (
9) and the entropy (
10)
        
        where 
T is set to 1.
Generally, it is very challenging to calculate the thermodynamic quantities, even numerically. The number of possible configurations of 
N visible units and 
M hidden units grow exponentially as 
. Here, for a feasible benchmark test, the 
 bar-and-stripe data are considered  [
18,
19]. 
Figure 2 shows the 6 possible 
 bar-and-stripe patterns out of 16 possible configurations, which will be used as the training data in this work. We take the sizes of the visible and the hidden layers as 
 and 
, respectively. One may take a larger size of hidden layers, i.e., 
M = 8 or 10, but it does not make an appreciable difference in our results. 
M = 6 is not a choice of magic number but was used as an example since we were rather limited in our capacity of numerical computation. In order to understand better how the RBM is trained, the thermodynamic quantities are calculated numerically for this small benchmark system.
Figure 3 shows how the weight 
, the bias 
 on the visible unit 
i and the bias 
 on the hidden unit 
j change as the training goes on. The weights 
 are clustered into three classes. The evolution of the bias 
 on the visible layer is somewhat different from that of the bias 
 on the hidden layer. The change in 
 is larger than that in 
. 
Figure 4 shows the change in the marginal probabilities 
 of the visible layer and 
 of the hidden layer before and after training. Note that the marginal probability 
 after training is not distributed exclusively over six possible outcomes corresponding to the training data set in 
Figure 2.
 Typically, the progress of learning of the RBM is monitored by the loss function. Here, the Kullback–Leibler divergence, Equation (
5), and the reconstructed cross entropy, Equation (
6), are used. 
Figure 5 plots the reconstructed cross entropy 
C, the Kullback–Leibler divergence 
, the entropy 
S, the free energy 
F, and the internal energy 
U as a function of the epoch. As shown in 
Figure 5a, it is interesting to see that even after a large number of epochs 
, the cost function 
C continues approaching zero while the entropy 
S and the Kullback–Leibler divergence 
 become steady. On the other hand, the free energy 
F continues decreasing together with the internal energy 
U, as depicted in 
Figure 5b. The Kullback–Leibler divergence is a well-known indicator of the performance of RBMs. Then, our result implies that the entropy may be another good indicator to monitor the progress of the RBM while other thermodynamic quantities may be not.
In addition to the thermodynamic quantities of the total system of the RBM, Equations (
7)–(
9), it is interesting to see how the two subsystems of the RBM evolve. Since the RBM has no intra-layer connection, the correlation between the visible layer and the hidden layer may increase as the training proceeds. The correlation between the visible layer and the hidden layer can be measured by the difference between the total entropy and the sum of the entropies of the two subsystems. The entropies of the visible and hidden layers are given by
        
The entropy 
 of the visible layer is closely related to the Kullback–Leibler divergence of 
 to an unknown probability 
 which produces the data. Equation (
5) is expanded as
        
The second term  depends on the parameter . As the training proceeds,  becomes close to  so the behavior of the second term is very similar to that of the entropy  of the visible layer. If the training is perfect, we have  that leads to  while  remains nonzero.
The difference between the total entropy and the sum of the entropies of subsystems is written as
        
Equation (
14) tells us that if the visible random vector 
 and the hidden random vector 
 are independent, i.e., 
, then the entropy 
S of the total system is the sum of the entropies of subsystems. In general, the entropy 
S of the total system is always less than or equal to the sum of the entropy of the visible layer, 
, and the entropy of the hidden layer, 
,
20],
        
This is called the subadditivity of entropy, one of the basic properties of the Shannon entropy, which is also valid for the von Neumann entropy [
17,
21]. This property can be proved using the log inequality, 
. In another way, Equation (
15) may be proved by using the log-sum inequality, which states that for the two sets of nonnegative numbers, 
 and 
,
        
In other words, Equation (
14) can be regarded as the negative of the relative entropy or Kullback–Leibler divergence of the joint probability 
 to the product probability 
,
        
For the 2
 bar-and-stripe pattern, the entropies of visible and hidden layers, 
 are calculated numerically. 
Figure 6 plots the entropies, 
, 
S, and the Kullback–Leibler divergence 
 as a function of the epoch. 
Figure 6a shows that the Kullback–Leibler divergence, 
 becomes saturated, though above zero, as the training proceeds. Similarly, the entropy 
 of the visible layer is saturated. This implies that the entropy of the visible layer, as well as the total entropy shown in 
Figure 5, can be a better indicator of learning than the reconstructed cross entropy 
C, Equation (
6). The same can also be said about the entropy of the hidden layer, 
. If some information measures such as entropy and the Kullback–Leibler divergence become steady, one may presume the training has been done.
The difference between the total entropy and the sum of the entropies of the two subsystems, 
, becomes less than 0, as shown in 
Figure 6b. Thus, it demonstrates the subadditivity of entropy, i.e., the correlation between the visible and the hidden layer as the training proceeds. As it is saturated just as the total entropy and the entropies of the visible and hidden layers after a large number of epochs, the correlation between the visible layer and the hidden layer can also be a good quantifier of the RBM progress.
  2.3. Work, Free Energy, and Jarzynski Equality
The training of the RBM may be viewed as driving a finite classical spin system from an initial equilibrium state to a final equilibrium state by changing the system parameters 
 slowly. If the parameters 
 are switched infinitely slowly, the classical system remains in a quasi-static equilibrium. In this case, the total work done on the systems is equal to the Helmholtz free energy difference between the before-training and the after-training, 
 For switching 
 at a finite rate, the system may not evolve immediately to an equilibrium state, the work done on the system depends on a specific path of the system in the configuration space. Jarzynski [
22,
23] proved that for any switching rate, the free energy difference 
 is related to the average of the exponential function of the amount of work 
W over the paths
        
The RBM is trained by changing the parameters 
 through a sequence 
, as shown in 
Figure 3. To calculate the work done during the training, we perform the Monte-Carlo simulation of the trajectory of a state 
 of the RBM in configuration space. From the initial configuration, 
 which is sampled from the initial Boltzmann distribution, Equation (
2), the trajectory 
 is obtained using the Metropolis–Hastings algorithm of the Markov chain Monte-Carlo method [
24,
25]. Assuming the evolution is Markovian, the probability of taking a specific trajectory is the product of the transition probabilities at each step,
        
The transition 
 can be implemented by the Metropolis–Hastings algorithm based on the detailed balance condition for the fixed parameter 
,
        
The work done on the RBM at epoch 
i may be given by
        
The total work 
 performed on the system is written as [
26]
        
Given the sequence of the model parameter 
, the Markov evolution of the visible and hidden vectors 
 may be considered the discrete random walk. Random walkers move to the points with low energy in configuration space. 
Figure 7 shows the heat map of energy function 
 of the RBM for the 
 bar-and-stripe pattern after training. One can see the energy function has deep levels at the visible vectors corresponding to the bar-and-stripe patterns of the training data set in 
Figure 2, representing a high probability of generating the trained patterns. Furthermore, note that the energy function has many local minima. 
Figure 8 plots a few Monte-Carlo trajectories of the visible vector 
 as a function of the epoch. Before training, the visible vector 
 is distributed over all possible configurations, represented by the number 
. As the training progresses, the visible vector 
 becomes trapped into one of the six possible outcomes 
.
In order to examine the relation between work done on the RBM during the training and the free energy difference, the Monte-Carlo simulation is performed to calculate the average of the work over paths generated by the Metropolis–Hastings algorithm of the Markov chain Monte-Carlo method. Each path starts from an initial state sampled from the uniform distribution over the configuration space, as shown in 
Figure 4a. Since the work done on the system depends on the path, the distribution of the work is calculated by generating many trajectories. 
Figure 9 shows the distribution of the work over 50000 paths at 5000 training epochs. The Monte-Carlo average of the work is 
, and its standard deviation is 
. The distribution of the work generated by the Monte-Carlo simulation is well fitted to the Gaussian distribution, as depicted by the red curve in 
Figure 9. This agrees with the statement in Reference [
23] that for the slow switching of the model parameters the probability distribution of work is approximated to the Gaussian.
We perform the Monte-Carlo calculation of the exponential average of work, 
 to check the Jarzynski equality, Equation (
18). The free energy difference can be estimated as
        
        where 
 is the number of the Monte-Carlo samplings. At a small epoch number, the Monte-Carlo estimated value of the free energy difference is close to 
 calculated from the partition function. However, this Monte-Carlo calculation gives rise to the poor estimation of the free energy difference if the epoch is greater than 5000. This numerical errors can be explained by the fact that the exponential average of the work is dominated by rare realization [
27,
28,
29,
30,
31]. As shown in 
Figure 9, the distribution of work is given by the Gaussian distribution 
 with the mean 
 and the standard deviation 
. If the standard deviation 
 becomes larger, the peak position of 
 moves to the long tail of the Gaussian distribution. So the main contribution of the integration of 
 comes from the rare realizations. 
Figure 10 shows that the standard deviation 
 grows with the epoch, so the error of the Monte-Carlo estimation of the exponential average of the work grows quickly.
If 
, the free energy is related to the average of work and its variance as
        
Here, the case is the opposite, the spread of the value of work is large, i.e., 
, so the central limit theorem does not work and the above equation can not be applied [
32]. 
Figure 10 shows how the average of work, 
, over the Markov chain Monte-Carlo paths changes as a function of the epoch. The standard deviation of the Gaussian distribution of the work also grows as a function of the training epoch. The free energy difference between before-training and after-training is called the reversible work 
. The difference between the actual work and the reversible work is called the dissipative work, 
 [
26]. As depicted in 
Figure 10, the magnitude of the dissipative work grows with the training epoch.