Average Contrastive Divergence for Training Restricted Boltzmann Machines

This paper studies contrastive divergence (CD) learning algorithm and proposes a new algorithm for training restricted Boltzmann machines (RBMs). We derive that CD is a biased estimator of the log-likelihood gradient method and make an analysis of the bias. Meanwhile, we propose a new learning algorithm called average contrastive divergence (ACD) for training RBMs. It is an improved CD algorithm, and it is different from the traditional CD algorithm. Finally, we obtain some experimental results. The results show that the new algorithm is a better approximation of the log-likelihood gradient method and outperforms the traditional CD algorithm.


Introduction
The learning of restricted Boltzmann machines (RBMs) has been an important and hot topic in machine learning.The learning is an inference process of the model parameters.The general learning algorithm, for example the gradient method, is challenging for training RBMs.Hinton proposed a learning algorithm called the contrastive divergence (CD) algorithm [1].The CD algorithm has become a popular way to train this model [1][2][3][4][5][6][7].Recently, more and more researchers have studied the properties of the CD algorithm [6,[8][9][10][11][12].Bengio and Delalleau [6] have given the bias of the expectation of the CD approximation of the log-likelihood gradient for RBMs.Fischer and Igel [13] gave the upper bound on the bias.
This paper provides two main contributions.One is to provide an analysis of the CD algorithm.We derive the bias of the CD approximation of the log-likelihood gradient and provide an analysis of the bias and the approximation error of CD.We generalize the conclusions of Bengio and Delalleau [6].Our analysis of the approximation error explicitly shows that the expectation of CD is closer to the log-likelihood gradient than CD; the idea of our new learning algorithm is derived from the conclusion.The other is to propose a new algorithm that is called the average contrastive divergence (ACD) algorithm for training RBMs.We show that ACD is a better approximation of the log-likelihood gradient than CD.The ACD algorithm is superior to the traditional CD algorithm.
The rest part of the paper is organized as follows.In Section 2, we introduce the CD algorithm and give some analysis results of CD.In Section 3, we propose a new algorithm, called ACD, for training RBMs and provide a theoretical analysis of ACD.In Section 4, we show that the ACD algorithm is superior to the traditional CD with some experiments.We draw some conclusions in the final section.

Contrastive Divergence Algorithm
Consider a probability distribution over a vector x: p(x; w) = ∑ h e −ε(x,h;w) Z(w) where w is the model parameter, Z(w) = ∑ x,h e −ε(x,h;w) is a normalization constant and ε(x, h; w) is an energy function.
Learning the parameters of the model is an important area.The common learning method is the gradient method.The log-likelihood gradient of the model parameter w given a training datum x (0) is: The first term can be computed exactly; however, the second term is intractable, because its complexity is exponential in the size of the smallest layer.Obtaining unbiased estimates of the log-likelihood gradient using Markov chain Monte Carlo (MCMC) methods typically requires many sampling steps.However, it has been shown that estimates obtained after running the chain for just a few steps can be sufficient for the training of the model.This leads to contrastive divergence (CD) learning.
The idea of k-step contrastive divergence learning (CD-k) is simple: instead of approximating the second term in the log-likelihood gradient by a sample for the RBM distribution (which would require running a Markov chain until the stationary distribution is reached), a Markov chain is run for only k steps.The Markov chain is derived by Gibbs sampling, so it is also called Gibbs chain.The Gibbbs chain is initialized with a training example x (0) of the training set and yields the sample x (k) after k steps.Each step t consists of sampling h (t) from p(h; w|x (t) ) and subsequently sampling x (t+1) from p(x; w|h (t) ).The gradient Equation (2) with regard to w of the log-likelihood for one training example x (0) is approximated by: The expectation of CD (ECD) can be ascribed by: where p k ( x, h; w) is the empirical distribution function on the samples obtained by the data x (0) and running the Markov chain forward for k steps, We can obtain the following theorem using the definition of CD, ECD and the log-likelihood gradient.In this paper, we consider the case where both x and h can only take a finite number of values.We assume that there is no pair (x, h) such that p(x|h; w) = 0 or p(h|x; w) = 0.This ensures that the Markov chain associated with Gibbs sampling is irreducible, and there exists a unique stationary distribution to which the chain converges.We also assume that ∂ε(x, h; w)/∂w is bounded, where w = (∑ n i=1 w 2 i ) 1/2 , • stands for the Euclidean norm in n .
Proof.Using Equations ( 2) and ( 4), we have: Using Equations ( 2) and (3), we have: ∂w .We have: Using the definition of p k ( x, h; w), we have: Since ∂ε( x, h; w)/∂w is bounded and x and h can only take a finite number of values, so v k converges to zero as k goes to infinity.
The theorem is proven.
Theorem 1 gives the bias of the CD approximation of the log-likelihood gradient; the bias converges to zero as k goes to infinity.Meanwhile, Theorem 1 gives the approximation error of the CD approximation of the log-likelihood gradient; the error includes two terms v k and u k ; v t is the approximation error of the ECD approximation of the log-likelihood gradient (that is also the bias of CD approximation of the log-likelihood gradient); u k is a stochastic term; the expectation of the stochastic term is zero.Theorem 1 shows that ECD is closer to the log-likelihood gradient than CD.

Contrastive Divergence Algorithm for RBMs
The RBM structure is a bipartite graph consisting of one layer of observable variables The model distribution is given by p(x, h) = e −ε(x,h;w) /Z(w), where Z(w) = ∑ x,h e −ε(x,h;w) , and the energy function is given by: being real-valued parameters, which are denoted by w.
There are some theoretical results about the CD algorithm for training RBMs [6,[8][9][10]12].The theoretical results from Bengio and Delalleau [6] give a good understanding of the CD approximation and the corresponding bias by showing that the log-likelihood gradient can, based on a Markov chain, be expressed as a sum of terms containing the k-th sample: Theorem 2. (Bengio and Delalleau, 2009) For a converging Gibbs chain x (0) ⇒ h (0) ⇒ x (1) ⇒ h (1) ⇒ • • • starting at data point x (0) , the log-likelihood gradient can be written as: and the final term converges to zero as k goes to infinity.
The first two terms in Equation ( 6) just correspond to the expectation of CD (ECD), and the bias of the CD approximation of the log-likelihood gradient is given by the final term; Fischer and Igel have given a bound of the bias [13].The theorem gives the bias of the CD approximation of the log-likelihood gradient for RBMs; however, Theorem 1 gives the bias of the CD approximation of the log-likelihood gradient for the energy model.Meanwhile, Theorem 1 gives the approximation error of the CD approximation of the log-likelihood gradient.Theorem 2 could be considered as a corollary of Theorem 1. Next, we give the proof of the conclusion.Theorem 3. Theorem 2 is the corollary of Theorem 1.
Proof.In order to prove that Theorem 2 is the corollary of Theorem 1, it is enough to prove Taking conditional expectations with respect to p(x (k) |x (0) ), Since: ∂ε( x, h; w) ∂w , so we have: The proof is completed.
Using Theorem 1, we have the following corollary.
Corollary 1.For a converging Gibbs chain x (0) ⇒ h (0) ⇒ x (1) ⇒ h (1) ⇒ • • • starting at data point x (0) , the log-likelihood gradient can be written as: where u k is defined in Theorem 1, and the final term converges to zero as k goes to infinity.
The first two terms in Equation ( 7) just correspond to the CD approximation, and the approximation error of the CD approximation of the log-likelihood gradient for RBMs is given by the final two terms.

Average Contrastive Divergence Algorithm
The empirical comparisons of the CD approximation and the true log-likelihood gradient for RBMs show that the bias can lead to a convergence to parameters that do not reach the maximum likelihood.More recently proposed learning algorithms try to obtain better approximations of the log-likelihood gradient [14][15][16][17][18].In this section, we propose a new algorithm for training RBMs.In Section 2, we know that ECD is closer to the log-likelihood gradient than the traditional CD.It is unfortunate that we cannot calculate ECD as calculating the log-likelihood gradient for the actual problem.We know the fact that the average value of a random variable is approximate to the expectation of the random variable.Hence, we could look for a quality to approximate ECD.This leads to our new learning algorithm, called average contrastive divergence (ACD).
) end for end for end for The idea of average contrastive divergence learning (ACD-k-l) is as follows: to approximate the second term in the log-likelihood gradient by the average of l samples for a k-step Gibbs distribution.The samples for the k-step Gibbs distribution of ACD and CD are the same.lThe Gibbs chain is initialized with a training datum x (0) of the training set and yields the sample x (k) after k steps (each step t consists of sampling h (t) from p(h; w|x (t) ) and subsequently sampling x (t+1) from p(x; w|h (t) )).lThe k-step Gibbs chain repeats l times.We have samples x (k,1) , x (k,2) • • • x (k,l) .The gradient (2) with regard to w of the log-likelihood for the training data x (0) is approximated by: ACD k,l (w, x (0) ) = − ∑ h p(h; w|x (0) ) ∂ε(x (0) , h; w) ∂w In order to further understand the ACD algorithm, we give the bias of the ACD-k-l approximation of the log-likelihood gradient by the following theorem.Theorem 4. For a converging Gibbs chain x (0) ⇒ h (0) ⇒ x (1) ⇒ h (1) ⇒ • • • starting at data point x (0) , the log-likelihood gradient can be written as: where: v k is defined in Theorem 1, E p(x (k) |x (0) ) [z k ] = 0, and v k converges to zero as k goes to infinity.
Proof.Using Equations ( 2) and ( 8), we have: where: We have: Since x (k,i) and x (k) have the same distribution, we have: Then, we have: By the proof of Theorem 1, we have that v k converges to zero as k goes to infinity.The theorem is proven.
The theorem gives the bias of the ACD approximation of the log-likelihood gradient; the bias is v t ; the bias converges to zero as k goes to infinity.Meanwhile, the theorem gives the approximation error of the ACD approximation of the log-likelihood gradient, which is denoted by Error ACD ; the Error ACD is ||v k + z k ||.We can obtain the approximation error of the CD approximation of the log-likelihood gradient from Theorem 1, which is denoted by Error CD ; the Error CD is ||v k + u k ||.The following theorem gives the relationship between Error ACD and Error CD .
Proof.Using the definition of z k , we have: The fourth and fifth equalities made use of the fact that x (k) and x (k,i) are two independent identically-distributed random variables.
Using the definition of u k , we have: Then, we have: according to Theorems 1 and 4, we have: Then, we have: The theorem is proven.
Intuitively, the smaller the approximation error of the log-likelihood gradient estimation, the higher the chance of converging to a maximum likelihood solution quickly.Still, even small deviations of a few gradient components can deteriorate the learning process.An important task of proposing a new learning algorithm is to obtain a better approximation of the log-likelihood gradient.We know that ACD and CD have the same bias from Theorems 1 and 4. Theorem 5 gives the relationship of Error CD and Error ACD .Since l ≥ 1 and due to the definition of || • ||, we can see that the value of Error 2 CD is not smaller than that of Error 2 ACD with probability one.The conclusion of the theorem shows that ACD is a better approximation than the traditional CD.

Experiments
This section will present some experiments illustrating the ACD algorithm.In the first two experiments, we train an RBM with 12 visible units and 10 hidden units, so that the log-likelihood gradient could be calculated exactly.Then, in the third experiment, we consider the Mixed National Institute of Standards and Technology (MNIST) data task by using the RBM with 500 hidden units.

The Artificial Data
Popular methods to train RBMs include CD and persistent contrastive divergence (PCD); PCD is also known as stochastic maximum likelihood [14,19].Since ACD, CD and PCD are biased with respect to the log-likelihood gradient, now we investigate empirically the approximation errors of these algorithms.In our experiments, ACD, CD, PCD and the log-likelihood gradient are tested under exactly the same conditions (unless otherwise stated).It is known that the log-likelihood gradient is intractable for regular-sized RBMs, because its complexity is exponential in the size of the smallest layer, so we consider the small RBM with 12 visible units and 10 hidden units in this section.In our experiments, we randomly generate 100 data points and use 10 data points in each gradient estimate.We consider the square of approximation error (the approximation error has the same results) in order to illustrate Theorem 5. We also assume the bias parameters c i = b j = 0 for all i and j; the learning rate is 0.01.
It is known that CD-k is closer to the log-likelihood gradient as k is larger.In the case of the same number of iterations, ACD-1-k and CD-k have same computational complexity.We give the results of 10 iterations.More iterations can be considered, which will require more training time.However, 10 iterations is enough to illustrate the approximation error of these algorithms.Figure 1 shows the approximation error of ACD and CD.The results show that the approximation error of ACD is smaller than that of CD.We can see that ACD is a better approximation of the log-gradient than CD from Figure 1, the experimental results are consistent with the conclusion of Theorem 5.In the case of the same number of iterations, the computational complexity of CD-20 is greater than ACD-1-10; however, Figure 1 shows that the approximation error of ACD-1-10 is smaller than CD-20, even if ACD-1-2 has a smaller approximation error than CD-20.One may find that the approximation error is very small as the number of iterations is small.The reason is that all algorithms are tested under exactly the same conditions.The initialized values of the parameters are the same.Figure 2 shows the approximation errors of ACD and PCD.There are similar experiment results about PCD and ACD.The results show that ACD has smaller approximation errors than PCD.

The MNIST Task
The dataset is the MNIST dataset of handwritten digital images [20].The images are 28 by 28 pixels, and the dataset consists of 60,000 training cases and 10,000 test cases.We use the mini-batch strategy for learning by only using a small number of training cases for each gradient estimate.We used 100 training points in each mini-batch for most datasets.Following [14,18,21,22], we set the number of hidden units to 500 in our experiments.One of the evaluations is how well the learned RBM models the test data, i.e., log-likelihood.This is intractable for a regular size of RBMs, because the time complexity of that computation is exponential in the size of the smallest layer (visible and hidden).Salakhutdinov and Murray [23] showed that a Monte Carlo-based method, annealed importance sampling (AIS), can be used to efficiently estimate the normalization constant Z of RBMs [16,[23][24][25][26].We adopt AIS in our experiment, as well.
The CD algorithm and the PCD algorithm have become two popular methods for training RBMs.Tieliman and Hinton proposed an improved PCD algorithm called fast PCD (FPCD) [15].The FPCD algorithm attempts to improve upon PCD's mixing properties by introducing a group of additional parameters called fast parameters that are only used for sampling.FPCD tries to get out of any single mode of the distribution by these fast learning parameters and achieves better results in approximating the RBMs' gradient.We consider the CD-1 algorithm, the CD-10 algorithm, the PCD algorithm, the FPCD algorithm and the ACD-1-10 algorithm for the MNIST task.The results on the MNIST task are shown in Figure 3. Figure 3 gives the average log-likelihood on the test dataset.The lower the average log-likelihood on the test dataset is, the more the contribution of the approximation of the gradient is.It is clear that ACD-1-10 outperforms CD-1, CD-10, PCD and FPCD.In the initial stages of training, the result of ACD-1-10 is close to the other algorithms.ACD-1-10 has better performance than the other algorithms with the increase of training time.