Maximum Entropy Learning with Deep Belief Networks

Conventionally, the maximum likelihood (ML) criterion is applied to train a deep belief network (DBN). We present a maximum entropy (ME) learning algorithm for DBNs, designed specifically to handle limited training data. Maximizing only the entropy of parameters in the DBN allows more effective generalization capability, less bias towards data distributions, and robustness to over-fitting compared to ML learning. Results of text classification and object recognition tasks demonstrate ME-trained DBN outperforms ML-trained DBN when training data is limited.


Introduction
Understanding how a nervous system computes requires determining the input, the output, and the transformations necessary to convert the input into the desired output [1].Artificial neural networks are a conceptual framework that provide insight into how these transformations are carried out, and have also played a crucial factor in the success of many pattern recognition tasks such as for handwriting [2] and object [3] detection.An important feature of neural networks is their ability to capture the underlying regularities in a task domain by representing the input with multiple layers of active neurons.This distributed representation of the input is based on the hierarchal processing and information flow of biological systems [4,5].In a multi-layered network, complex internal representations can also be constructed by repeatedly adjusting the weights of the connections in order to ensure that the output is close to the desired output [6].The method of weight adjustment is called "back-propagation" and is interpreted as a maximum likelihood (ML) procedure if each output is specified by a conditional probability distribution [7].This learning procedure generalizes correctly to new cases after sufficient training with an external supervisor, however the learning time scales poorly and is not incremental in nature [8].In contrast, humans learn concepts with richer representations from fewer examples [9,10] and can even generate their own new examples using top-down processing of high-level representations to disambiguate low-level perceptions [11].Therefore, the long-term goal is to build deep [12] hierarchal structures [13] that are highly flexible [14], efficient to train, and provide generative models that can produce examples with the same probability distribution as the examples it is shown [15].
Recent advances in deep learning have generated much interest in hierarchical generative models such as Deep Belief Networks (DBNs).Although the increased depth of deep neural networks (DNNs) has led to significant performance gains, training becomes difficult where the cost surface is non-convex and high-dimensional with many local minima [16].Therefore, training is often carried out in two phases to (1) initialize and then (2) fine-tune the weights of deep architectures.The f irst phase involves greedy layer-wise unsupervised pretraining where each layer of a DBN learns a non-linear transformation of its own input that captures the main variation in its input [17][18][19][20][21].The DBN is constructed from multi-layers of feature detectors, each characterized by a restricted Boltzmann machine (RBM).Each RBM uses the Contrastive Divergence (CD) [22] approximation of the log-likelihood gradient to accelerate the learning process.By pairing each feed-forward layer with a feed-back layer that reconstructs the input of the layer from its output, the generative model guarantees that most of the information contained in the input is preserved in the output of the layer.After initializing weights, a logistic layer is placed on top to perform discriminative tasks.The second phase involves fine-tuning the combined DBN-DNN with a supervised training criterion for gradient-based optimization of the prediction error.The generative DBN pretraining in the first phase is thought to benefit discriminative DNNs by discovering latent variables [23], providing marginal conditioning [24], disentangling factors of variation [25], optimizing the hidden layers [26], constraining parameter spaces to basins of attraction [16], and providing a regularization mechanism [27] that minimizes variance and introduces a bias towards configurations that are useful for unsupervised learning.The most important conclusion [28] from all of these studies is that DBN pretraining improves robustness compared to random initiation, especially with increasing network depth [24] and at the lower layers of the DNN where the gradient information becomes less informative as it is back-propagated through more layers (referred to as the problem of vanishing gradients [29]).
The present study explores whether the maximum entropy (ME) principle can improve generative DBN architectures for pretraining DNNs.The over-riding principle of ME is that distributions should be as uniform and unbiased as possible when dealing with estimation problems given incomplete knowledge or data [30].ME probability distributions have been widely reported in both living neural networks and artificial neural networks to account for spatial and temporal correlation structure.When applied to neuronal data, ME models have quantitatively described the activity in biological networks [31], ensembles [32], connectivity matrices [33], spike trains [34], and receptive fields [35].When applied to artificial neural networks, the ME principle has been explored with source separation [36], associative memory [37], signal reconstruction [38], and energy contrast functions [39].The ME approach is especially attractive in neural networks since the high parallelism spreads the computational costs over the units to overcome the problems of low speeds and robustness [40,41].The ME model can also be extended as a weight matrix that directly connects the input and output layers [42], and DNNs were recently considered as an ME classifier in which the feature functions are learned (because learning the parameters in the softmax layer is equivalent to training a conditional ME model on features of the input vector) [43,44].In this paper, we specifically analyze how ME learning can be further incorporated into deep architectures by expanding on our preliminary DBN pretraining work [45], which was motivated by a seminal study comparing the benefits of latent maximum entropy in a Boltzmann Machine [46].
Traditionally, the ML learning algorithm is used with RBMs and DBNs [47,48] (denoted ML-RBM and ML-DBN hereinafter).The basic principle of the ML method [49] is that the event with the greatest probability is most likely to occur, so parameters should be chosen that maximize the probability of the observed sample data.The ML estimation possesses several attractive properties including: consistency, asymptotic normality, and efficiency when the sample size tends to infinity [50].However, DNNs that are pretrained with ML-DBNs are prone to be over-trained and mismatched with the test environments [51].In addition, ML-RBMs provide no guarantees that the features implemented by their hidden layers will be related to the supervised learning task [52].Furthermore, ML-RBMs can result in overcomplete representations with redundant hidden units [53].Previous research has addressed the deficiencies of ML-DBNs and ML-RBMs by developing regularization to compensate small perturbations over training samples [51], introducing discriminative components to improve their performance as a classifier [52], adding terms to enforce sparse solutions [54], and even replacing ML-DBNs with ML deep Boltzmann machines (DBMs) [55].Rather than adhering to ML learning methods, our previous work [45] proposed ME learning for RBMs and DBNs (denoted ME-RBM and ME-DBN hereinafter) as the principle is known to be effective for combining different sources of evidence from complex and unstructured natural systems [56].Due to the ability to fit the data without making specific assumptions or having to commit extensively to unseen data, we hypothesized that pretraining with generative ME-DBNs would enable the resulting discriminative DNNs to be more unbiased to data distributions and robust to over-fitting issues compared to those pretrained by ML-DBNs.The preliminary results in several object recognition tasks showed ME-DBNs outperformed ML-DBNs, with noticeable gains when the amount of training data was limited.These conclusions were particularly encouraging as it is known that the outcome of discriminative training often relies on the amount of available training data [52], and in many real world applications it is impossible or expensive to collect labeled training data for rebuilding models [57][58][59].
The most difficult problem in building network architectures is discovering parallel organizations that do not require so much problem-dependent information [15], and the ultimate goal of machine learning research is to produce methods that are capable of learning complex behaviors without human intervention or prior knowledge [14].Moreover, generalization from sparse data is central for human concept-learning [60].Neural recordings from living neural networks have shown the ME approach uniquely specifies higher-order neuronal correlations in a hierarchical organization [31] by assuming that the best model fits the correlations found in the data up to that particular level and is maximally unconstrained for all higher-order correlations [32].Likewise, generative DBN models are intended to characterize higher-order correlation properties as each layer captures the complicated and higher-order correlations between the activities of hidden features in the layer below.In fact, our original ME-DBN learning algorithm can be expanded by incorporating constraints [61] such as weight decay and sparsity during the training process (denoted as constrained maximum entropy (CME) for CME-RBM and CME-DBN hereinafter).Since more complex interactions make it less obvious to construct correlation functions which capture the underlying order among neurons in a network [62], the analysis of ME and CME learning in DBNs could provide insight into the multitude of subsystems in the nervous system that acquires, processes, and communicates information to recode the incoming signals into a more efficient form [63].In this paper, results in both object recognition and text classification demonstrate that DNNs initialized by ME-DBN outperform DNNs initialized by ML-DBN, suggesting ME-DBN provides a better initial starting point for DNN classifiers.Towards the goal of an "efficient coding hypothesis" [64], we also show the inclusion of weight decay and sparsity constraints in CME-DBNs enhances ME learning even further compared to ML learning when training data is limited.
The rest of this paper is organized as follows: Section 2 introduces the relevant background, Section 3 presents the proposed ME and CME algorithms with derivations and justifications.Section 4 presents the experimental setup and results.Section 5 concludes this work.

Preliminaries and Background
This section reviews background related to RBM, DBN, and the proposed ME learning algorithm.

Restricted Boltzmann Machine
The RBM is an undirected bipartite graphical model whose probability distribution is governed by an energy function, E, with the parameter set λ = {W, b, c}: where v i and h j are the binary state of visible unit i and hidden unit j; c i and b j are their corresponding biases; w ij is the weight between visible and hidden unit; V and H are the numbers of visible and hidden units, respectively.The bipartite connection of the graph makes it easy to infer the hidden states given the visible states.The joint probability for {v, h} is formulated as follows: where Z is the partition function, Z = ∑ v,h exp(−E(v, h; λ)), which ensures that the function in Equation ( 2) is a valid probability function.The conditional probability of the RBM is: where sigm(.) is the sigmoid function, sigm(x) = (1 + exp(−x)) −1 .Conventionally, the ML criterion is utilized to compute the parameters in the RBM.The log probability of the training data, given the model parameters, can be derived as follows: log p(v; λ) = log ∑ h p(v, h; λ).Taking the partial derivative of log p(v; λ) with respect to the parameter set λ yields From Equation ( 4), the parameter updating rule becomes: where η denotes the step size, <> data denotes empirical expectation, and <> model denotes expectation with respect to model distribution.
The learning process, given by Equation ( 5), can be interpreted as follows: to perform parameter learning, the probability that the model assigns to the observed data should be increased (positive phase), whereas the probability of what the model believes should be reduced (negative phase) [15].The learning process is finished (meaning that the gradient has vanished) when the predictions of what the model believes becomes the same as the observed training data.However, a critical issue of the learning algorithm is that the computation of the expected values of the model parameters is intractable.Even when the Monte Carlo method is used, it still takes a long time to get the Markov chain to mix.To solve this problem, the CD method was proposed to approximate the learning process.In CDk, the negative-phase samples from the model are drawn by alternating Gibbs sampling only k times rather than until equilibrium is reached.Previous studies confirmed CDk provides satisfactory performance [65,66].

Gaussian Unit RBM for Continuous Data
Equations ( 1)-( 3) can be utilized to estimate model parameters in the RBM when the input is binary data.To model continuous data, the binary visible units can be replaced by linear units with independent Gaussian noise.The energy function in Equation ( 1) is thus modified as follows: where σ is the standard deviation of the Gaussian noise.Accordingly, the conditional probabilities become: where N (.) represents a Gaussian distribution.Since the input data are now modeled using a Gaussian distribution, this RBM is called a Gaussian-Bernoulli RBM [48].

Deep Belief Network
Figure 1 shows an efficient greedy layer-wise learning procedure developed for training DBNs [18].The parameters of the first RBM are estimated using the observed training data.In the following layers, the greedy learning algorithm uses the hidden activation of the previous layer as input data to train another RBM.The previous investigation showed that the variational lower bound of data log-likelihood is guaranteed under this greedy layer-wise learning framework, suggesting that stacking another layer of RBM does not reduce the generative power of the model.A hybrid model is established by stacking up RBMs, as illustrated in Figure 2. The top two layers form the RBM and the lower layers form a directed belief net [16,18].This hybrid model is called a deep belief network (DBN), whose probability function is given by: where L denotes the total number of layers.Now, let D = {h 1 , h 2 , ..., h L } denote the entire set of hidden layers in the DBN, so we then have

Maximum Entropy Learning
ME learning aims to estimate a set of parameters, λ * , which maximizes the entropy of a probabilistic model, p(x; λ), where x is a random variable denoting the data: where x is the observed training data, f r (.) denotes the r-th type of feature of a data instance, and R is the number of distinct features.Equation (11) stands for a constraint that the expected sufficient statistics of the model and the data should match.This prevents the learning results in a naive model (uniform distribution).This ME learning criterion has been extensively adopted in the field of natural language processing (NLP) research [61,[67][68][69] where the input dimension is the size of a typically large vocabulary.In these cases, the training data is often insufficient for a delicate probabilistic model to be estimated.In previous NLP studies, ME learning exhibited a superior capacity to make the model unbiased with respect to unseen data.Therefore, probabilistic models trained by ME learning can offer a more favorable generalization capacity.

Maximum Entropy Deep Belief Network
This section explains the proposed maximum entropy deep belief network (ME-DBN).First, the ME learning criterion is presented for estimating parameters in the RBM.Next, the greedy layer-wise learning method for building a DBN is introduced.Finally, constraint terms are developed to be integrated into the objective function to regularize and encourage the sparsity of DBN model parameters.

Maximum Entropy Restricted Boltzmann Machine
The objective function of the ME-RBM is formulated as, where we have defined three types of features f (v) , f (h) , and f (v, h), each corresponding to the activated state of v, h, and vh T ; µ( ṽ), µ(h), and µ( ṽ, h) correspond to their empirical expectations, which are estimated from the data by where ṽn is the n-th data sample, N is the number of data samples, and (•) T is the transpose operator.
A previous study showed that solving the constrained optimization problem (i.e., Equations ( 12)-( 14)) is equivalent to solving another unconstrained optimization problem [70,71].Therefore, our problem can be solved by applying the expectation maximization (EM) algorithm: where the Q function is the expectation of the complete likelihood (∑ v∈ X log p(v, h; λ)): where X is the whole training set, λ is the parameter set from the previous iteration.
In the E-step, the expectations of the three features are computed, as defined earlier.In the M-step, the Q function is maximized with respect to the model parameters.The computation of expectations in the E-step is easy for the RBM, while maximizing the Q function in the M-step is intractable.Accordingly, some approximation methods are required.The original estimation of Equations ( 12)-( 14) is not a convex optimization problem because there are latent variables in the model, whereas maximizing the Q function itself is a convex problem.Thus, we can apply either iterative scaling [72,73] or generalized EM [74] along with Monte Carlo approximation to solve the problem.In this study, the generalized EM is used, and the overall learning procedure is demonstrated in Algorithm 1.

E-step:
Given the current estimate of λ (t) , compute the empirical expected value for all features given by Equation ( 14).

M-step:
Maximize the Q function iteratively until |λ new − λ old | < , where is a pre-defined value.
For the k-th parameter in λ, namely λ k , the update is done by performing gradient ascent with where µ k denotes the empirical expectation of the k-th parameter in λ, which is obtained from Equations ( 12)-( 14).CD is applied to approximate the expectation in the second term.

Greedy Layer-Wise Learning
For the greedy layer-wise learning of ML-DBN, adding another layer of RBM using previous hidden activations as new data will not reduce the total log-likelihood [16].When using the ME learning algorithm to estimate a DBN, this justification still holds because each RBM is guaranteed to have data log-likelihood at a locally optimal point.Therefore, the variational bound for likelihood holds, suggesting that stacking another layer of RBM will not reduce the total likelihood.The difference between ME-DBN and ML-DBN is that, among the local optima to be determined, ML-DBN chooses the one with a higher likelihood, whereas ME-DBN chooses the one with a higher entropy [70,71].Given that the log-likelihood is also locally optimal in ME-DBN, we can use the same argument to conclude that greedy layer-wise learning in ME-DBN still guarantees the variational lower bound on the data log-likelihood.In addition, we can derive where the inequality holds by the property of entropy (H(a, b, c) ≥ H(a, b) for any random variable (a, b, c).Therefore, whenever an additional layer of RBM is stacked, the entropy of the overall graphical model is increased.The greedy layer-wise learning algorithm could improve both likelihood and entropy.

Imposing Weight Decay and Sparsity Constraints
Regularization and sparsity terms are often used to reduce redundant components and provide a compact representation of models for superior performance in various domains [58,[75][76][77].Here, we incorporate weight decay and sparsity constraints into the learning of model parameters.The weight decay regularizes the learned parameters while the sparsity constraint makes the expected activation of each neuron small.When the two constraints are imposed, the objective function used in the M-step is derived as: where α and β are weight decay and sparse penalty parameters, respectively.In this study, we choose ψ(W) = W 2 , and φ(h, v) = KL(ρ ρj ), where KL(ρ ρj ) represents the Kullback−Leibler (KL) divergence [78] between two Bernoulli distributions: where ρ is the sparsity target, and ρj is estimated by where E[•] denotes the expectation.The derived algorithm, namely Equation ( 18), is denoted as the constrained maximum entropy (CME) algorithm.Notably, we fix the weight decay and sparsity schemes for all layers in Equation ( 18) in this study.One can use different constrained terms in different layers in DBN by introducing more parameters.The comparison of different regularization terms for different layers in DBN will be intestigated in future studies.

Experiment
This section introduces the experimental setup and results.The effectiveness of the proposed ME learning algorithm is evaluated using two benchmark object recognition datasets (MN IST and NORB) and three benchmark text classification datasets (Newsgroup, Sector, and WebKB).

Experiments on Object Recognition
This section presents the object recognition results for DBN estimated by the proposed ML, ME and CME learning algorithms.For the three DBNs, the initial biases were set to zero, and the weights were drawn from N (0, 1).For CME, α and β in Equation (18) were optimized empirically by a 3 × 3 grid search from 0.0 to 0.000001.The same procedure was used to determine the best results for each set of experiments in this study.For each of the three DBNs, a logistic layer was added on the top, and the combined model was treated as a DNN model.The standard back-propagation training method was then applied to the three models to refine their parameters.Additionally, the dropout technique was used to compute the parameters in these networks [79].The training data was used for both initialization and back-propagation processes on these DNN models.In the following discussion, the DNNs initialized by the ML, ME, and CME criteria are denoted as ML-DBN, ME-DBN, and CME-DBN, respectively.The following experiments are divided into two parts.First, feature learning capabilities of ML-RBM, ME-RBM, and CME-RBM are visually compared.Next, the classification performance of DNN initialized by ML-DBN, ME-DBN, and CME-DBN are evaluated.

Datasets and Protocol
The left-and right-hand panels in Figure 3 show samples of the MN IST and NORB datasets, respectively.MN IST is a benchmark computer vision dataset of images of ten handwritten digits (0 to 9), each image containing 28 × 28 gray-scale pixels [2].The training and testing sets contained 60,000 and 10,000 images, respectively.The 10,000 images of the training set were retrained as the validation set.Pre-processing was carried out to divide each pixel by 255 to make it suitable for the RBM.For NORB, the benchmark 3D object recognition dataset, we selected a uniform-normalized set [3] containing 24,300 pairs of stereo images in the training set and 24,300 pairs of stereo images in the test set.The NORB dataset included images in five generic categories: cars, trucks, planes, animals, and humans.Each image has 96 × 96 gray-scale pixel values, so each training example contains 2 × 96 × 96 = 18,432 dimensions.To reduce the computation, the outer part of the image was ignored and only the center 64 × 64 pixels was kept (since objects were mostly in the centers).Then, each image was resized to 48 × 48, yielding 2 × 48 × 48 = 4608 dimensions for each sample.Next, the procedure as suggested in [55] was implemented to train a Gaussian-Bernoulli RBM with a 4608-4000 architecture to convert the gray-scale images into binary variables.Finally, these 4000 variables were treated as the pre-processed data to evaluate the proposed algorithm.

Classification Performance
The second set of experiments evaluates classification performance on the MN IST and NORB datasets.The DBNs were first estimated by ML, ME, and CME using training data without considering any label information.Next, another logistic layer was added on top of each model, and the combined model was treated as a DNN model.Then, the standard back-propagation training process was applied to the three models to refine their model parameters.The DNNs that are initialized by ML, ME, and CME, are still referred to as ML-DBN, ME-DBN, and CME-DBN throughout this paper.DBNs with 784-500-500-2000 and 4000-2000-2000 architecture were used for MN IST and NORB, respectively.
Table 1 shows classification results for ML-DBN, ME-DBN, and CME-DBN under a limited amount of training data.To obtain the results in Table 1, we formed four new training sets by randomly selecting 5%, 10%, 15% and 20% amount of data from the original training set.These four training sets were used to estimate ML-DBN, ME-DBN, and CME-DBN and to conduct back-propagations to obtain three DNNs.Table 1 shows ME-DBN lowered the error rates compared to ML-DBN for small amounts of training data.The results also show that imposing constraints to the objective function (CME-DBN) improved the performance even further.Table 2 evaluates the full set of training data (100%), and compares classification error rates (in %) for support vector machine (SVM), ML-DBN, ME-DBN, and CME-DBN.For MN IST, ME-DBN gives a 1.04% classification error rate, which already outperforms ML-DBN with 1.2% [17], and SVM with 1.4% [80].Next, by using the weight decay and sparsity constraints, CME-DBN provided an even better 0.94% classification error rate.For NORB, ME-DBN yielded a 10.80% classification error rate, which already outperforms ML-DBN with 11.90% [81], and SVM with 11.60% [14].On this database, CME-DBN provided similar improvements over ML-DBN and SVM.By progressively varying the amounts of available training data, the results of Tables 1 and 2 confirms ME-DBN and CME-DBN outperforms ML-DBN by providing a better initial starting point for DNNs on these recognition tasks.

Experiments on Text Classification
In this set of experiments, we evaluate the proposed ME algorithm on three benchmark text classification tasks: Newsgroup, Sector, and WebKB.Similar to the object classification experiments in Section 4.1, α and β in Equation ( 18) are optimized empirically for CME.Also coinciding with Section 4.1, the three DBNs were first individually estimated by ML, ME, and CME, using the training data without any label information.For the three DBNs, the initial biases were set to zeros, and the weights were drawn from N (0, 1).

Datasets and Protocol
The Newsgroup dataset contains around 20,000 articles that are distributed evenly across 20 classes [82].Of this set of documents, 13,000 and 5648 were selected for training and testing, respectively.
For this task, stop-word removal and stemming were performed.A feature selection process was carried out to select the top 3000 words to generate the feature vector for each document.
The Sector dataset contains more than 70 classes, which can be divided into a two-level hierarchy [83].Only the 12 main categories in the top level with 9554 documents were used in the experiments.Finally, 6600 and 2867 documents were selected for training and testing, respectively.Stop-word removal was performed while stemming was not performed.The 3000 words with the highest mutual information were selected to generate the feature vector for each document.
The WebKB dataset contains documents that were collected from web pages from the computer science departments of four universities [84].The documents were divided into seven classes: student, f aculty, sta f f , project, course, department, and other.In this study, the most popular four classes were selected: student, f aculty, project, and course, which amounted to 4199 documents.From these documents, we selected 2900 and 1260 documents for training and testing, respectively.For this task, stop-word removal and stemming were not performed.A feature selection process was carried out on the entire word set by selecting the top 1000 words with the highest mutual information to form the feature vector for each document.
For the above three tasks, we followed the pre-processing steps (stemming, stoplist) as suggested in [69], where word count data were further scaled by dividing the total number of words in a document, subsequently allowing it to be interpreted as the probability for a word to appear in the document.Three DBN models were prepared to test performance, namely ML-DBN, ME-DBN, and CME-DBN.These three DBNs were first pretrained by ML, ME, and CME, respectively.Then, a logistic layer was added on the top of each model, and the standard back-propagation training was applied to refine the parameters.The training data were used for both initialization and back-propagation on these DNN models.The network architectures for Newsgroup, Sector, and WebKB were 3000-1024-1024-20, 3000-2000-1500-12, and 1000-700-500-4, respectively.In addition to the three networks, several popular text classification algorithms are reported for comparison, including naive Bayes, SVM, and MaxEnt.For naive Bayes, a standard multinomial model with Laplace smoothing was used [85].For SVM, a linear kernel support vector classifier was used [86].For the maximum entropy classifier (MaxEnt), standard L2 penalized models were adopted [69].Experiments for naive Bayes, SVM and MaxEnt were conducted using scikit-learn, a machine learning toolkit in Python [87] (Python Software Foundation, Beaverton, WA, USA).

Classification Performance
We first compare ML-DBN and ME-DBN without weight decay and sparsity constraints (by setting α = 0.0 and β = 0.0 in Equation ( 18)).Figures 5-7 display the results of text classification with varying amounts of training data for Newsgroup, Sector, and WebKB, respectively.In each figure, the horizontal-axis and vertical-axis represent the amount of training data and classification accuracy rates, respectively.The results in Figures 5 and 6 show ME-DBN outperformed ML-DBN and all other approaches when training data was limited.Figure 7 shows ME-DBN outperforms MaxEnt, naive Bayes, and linear SVM, but only provided marginal improvements over ML-DBN.Since there are only four document classes in the WebKB dataset of Figure 7, it is possible that over-fitting issues were less problematic compared to the Newsgroup and Sector tasks of Figures 5 and 6.Overall, the results confirm that ME-DBN can offer improved performance over ML-DBN when the amount of training data is limited.Since the ME-DBN and ML-DBN algorithms differed only in their unsupervised pretraining process, the different classification capacities shown in Figures 5-7 could only have resulted from the discovery of different local optima through the unique spaces of the different objective functions.
Table 3 shows classification results for ML-DBN, ME-DBN and CME-DBN under a limited amount of training data.The parameters α and β in Equation (18) were optimized empirically.Table 3 shows that CME-DBN outperforms ML-DBN consistently over the four different amounts of training data (5%, 10%, 15%, and 20%) for all of the Newsgroup, Sector, and WebKB datasets.Table 4 evaluates the full set of training data (100%), and compares classification accuracy rates (in %) for SVM, ML-DBN, ME-DBN and CME-DBN.By progressively varying the amounts of available training data, the results of Tables 3 and 4 confirms ME-DBN and CME-DBN outperforms ML-DBN by providing a better initial starting point for DNNs.These finding were consistent with the recognition results in Tables 1 and 2.

Conclusions
The present study demonstrates the effectiveness of ME learning for modeling DNN parameters using a DBN.Maximizing only the entropy offers robustness to over-fitting by keeping distributions as uniform and unbiased as possible.Object recognition results show ME-DBN consistently outperforms ML-DBN for different amounts of training data, and especially when training data is limited.In addition, we show the integration of weight decay and sparsity constraints (CME) enables even further improvements for ME learning.Text classification results similarly showed that DNNs initialized with ME-DBN and CME-DBN outperform those initialized by ML-DBN.Therefore, DBN training for a variety of tasks can benefit from ME learning and progressive regularization with the CME criteria.

Figure 2 .
Figure 2. Hybrid model of the DBN after greedy layer-wise learning.The top two layers form the RBM and the bottom layers form a directed belief net.

Figure 3 .
Figure 3.Samples from the (a) modified National Institute of Standards and Technology (MN IST) and (b) NYU Object Recognition Benchmark (NORB) databases.

Figure 5 .
Figure 5. Classification accuracies with varying amounts of training sets on the Newsgroup dataset.

Figure 6 .
Figure 6.Classification accuracies with varying amounts of training sets on the Sector dataset.

Figure 7 .
Figure 7. Classification accuracies with varying amounts of training sets on the WebKB dataset.

Table 2 .
Classification error rates (in%) for support vector machine (SVM), ML-DBN, ME-DBN, and CME-DBN using a full set of training data (100%) for the two object classification tasks.