Regularizing Neural Networks via Retaining Confident Connections

Regularization of neural networks can alleviate overfitting in the training phase. Current regularization methods, such as Dropout and DropConnect, randomly drop neural nodes or connections based on a uniform prior. Such a data-independent strategy does not take into consideration of the quality of individual unit or connection. In this paper, we aim to develop a data-dependent approach to regularizing neural network in the framework of Information Geometry. A measurement for the quality of connections is proposed, namely confidence. Specifically, the confidence of a connection is derived from its contribution to the Fisher information distance. The network is adjusted by retaining the confident connections and discarding the less confident ones. The adjusted network, named as ConfNet, would carry the majority of variations in the sample data. The relationships among confidence estimation, Maximum Likelihood Estimation and classical model selection criteria (like Akaike information criterion) is investigated and discussed theoretically. Furthermore, a Stochastic ConfNet is designed by adding a self-adaptive probabilistic sampling strategy. The proposed data-dependent regularization methods achieve promising experimental results on three data collections including MNIST, CIFAR-10 and CIFAR-100.


Introduction
Neural networks (NNs) that consist of multiple hidden layers can automatically learn effective representation for a learning task, such as, speech recognition [1][2][3], image classification [4][5][6], and natural language processing [7].However, a neural network with too many layers or units, especially deep neural networks (DNNs) [8], would easily overfit in the training phase and lead to a poor predictive performance in the testing phase.In order to alleviate the overfitting problem in DNNs, many regularization methods have been developed, including data augmentation [9], early stopping, amending cost functions with weight penalties ( 1 or 2 ), and modifying networks by randomly dropping a certain percentage of units (Dropout [10]) or connections (DropConnect [11]).This paper will focus on the last method.
The Dropout strategy randomly drops units (along with their connections) in a neural network during training.A large sum of sub-networks that randomly dropped units would be trained.While in the testing phase, a scaled-down weights of networks is used to approximatively achieve an averaged prediction from these sub-networks.The performance of DNNs have significantly improved by such ensemble strategy.In DropConnect [11], a network is regularized by randomly drawing a subset of connections independently from a uniform prior in the training phrase and using a Gaussian sampling procedure for activations in the inferencing phrase.Both Dropout and while inferencing DropConnect assume a uniform prior prior for the dropping strategy.
From the density estimation point of view, the hidden layers of DNNs can be interpreted as an attempt to recover a set of parameters for a generative model that describes the underlying distribution of the observed data [12].A well-designed prior is needed for incorporating the degree of importance in describing the data distribution into the network.Specifically, we would need an efficient approach to recognize how much a unit or connection is useful for revealing the underlying structure of the current data.Recently, a general parametric reduction criterion, named the Confident-Information-First principle (CIF) [13], has been proposed, in the theoretical framework of Information Geometry (IG) [14].From a model selection perspective, they proved that both the fully Visible Boltzmann Machine (VBM) and the Boltzmann Machine (BM) with hidden units can be derived from the general multivariate binary distribution using the CIF principle.Such a geometric method offers a more intuitive criterion for parametric reduction on the parameter space.
In this paper, we study the regularization networks for DNNs in the theoretical framework of IG.Every fully connected neighboring layer in a DNN would be assigned a BM that shares the same topology with it [15].The confidence level of each connection in a BM is evaluated according to the sample data, and the confident connection would be retained in the DNN.Differing from the mechanism of uniform prior, which Dropout and DropConnect adopt, Confident Network (ConfNet in short) is a reduced model that reduces the number of free parameters by a data-dependent method.In addition, we adopt a self-adaptive dynamic sampling mechanism to reduce the network, and the re-adjusted network is called the Stochastic Confident Network (Stochastic ConfNet in short).Data-independent regularization and data-dependent are compared in several image datasets.

Parametric Coordinate Systems
A family of probability distributions is considered as a differentiable manifold with certain parametric coordinate systems.Let n be the number of variables and S denote the open simplex of all probability distributions over binary vector x ∈ {0, 1} n .Four basic coordinate systems are often used [16,17] to characterize multivariate binary distributions.
p-coordinates [p]: the probability distribution over 2 n states of x can be completely specified by any 2 n − 1 positive numbers indicating the probability of the corresponding exclusive states on n binary variables.For example, the p-coordinates of n = 2 variables could be [p] = (p 01 , p 10 , p 11 ).Use the capital letters I, J, . . . to index the coordinate parameters of probabilistic distribution.An index I can be regarded as a subset of {1, 2, . . ., n}.Additionally, p I stands for the probability that all variables indicated by I equal to one and the complemented variables are zero.For example, if I = {2} and n = 2, we have: (1) Note that the null set can also be a legal index of the p-coordinates, which indicates the probability that all variables are zero, denoted as p 0...0 .

η-coordinates [η]:
where the index I is the nonempty subset of {1, 2, . . ., n} and the value of X I is given by ∏ i∈I x i and the expectation is taken with respect to the probability distribution over x.For example, if n = 2, the η-coordinates should be [η] = (η 1 , η 2 , η 12 ).

The Fisher Information Matrix
Previously, we have introduced four commonly used coordinates.Another important concept in IG is the Fisher information matrix for parametric coordinates.For a general coordinate system [ξ], the i-th row and j-th column element of the Fisher information matrix for [ξ] (denoted by G ξ ) is defined as the covariance of the scores of [ξ i ] and [ξ j ] [18]: under the regularity condition that the partial derivatives exist.The Fisher information measures the amount of information in the data that a statistic carries about the unknown parameters [19].The Fisher information between θ I and θ J in [θ] is given by [20]: The Fisher information between η I and η J in [η] is given by [20]: where | • | denotes the cardinality operator.
Another important concept related to our analysis is the orthogonality defined by Fisher information.Two coordinate parameters ξ i and ξ j are called orthogonal if and only if their Fisher information vanishes, i.e., g ij = 0, meaning that their influences on the log likelihood function are uncorrelated.Based on G η and G θ , we can calculate the Fisher information matrix G ζ for the mixed coordinates [ζ] l [13]: where , G η and G θ are the Fisher information matrices of [η] and [θ], respectively, I η is the index set of the parameters shared by [η] and [ζ] l , i.e., {η 1 i , . . ., η l i,j,... } and J θ is the index set of the parameters shared by [θ] and [ζ] l , i.e., {θ i,j l+1 , . . ., θ 1,...,n n }.From G ζ , we can see that Fisher information between lower-order η-coordinates η I and higher-order θ-coordinates θ J are all zero in the mixed coordinate [ζ].For example, in [ζ] = (η 1 , η 2 , θ 12 ), the Fisher information between θ 12 and η 1 (or η 2 ) is zero.This indicates that θ 12 is orthogonal to both η 1 and η 2 in the [ζ].The orthogonal property of [ζ] allows us to decompose the distance between two distributions into unrelated parts.

The Fisher Information Distance
The Fisher information distance (FID), i.e., the Riemannian distance induced by the Fisher-Rao metric [14], is adopted as the distance measure between two distributions, since it is shown to be the unique metric meeting a set of natural axioms for the distribution metric [14,21,22], e.g., the invariant property with respect to reparametrizations and the monotonicity with respect to the random maps on variables.Let ξ be the distribution parameters.For two close distributions p 1 and p 2 with parameters ξ 1 and ξ 1 , the Fisher information distance between p 1 and p 2 is: where G ξ is the Fisher information matrix [17], and p 1 and p 2 are close.

Motivation
Considering a fully connected layer of a DNN with input x = [x 1 , x 2 , . . ., x n ] T , weight parameters W (of size n × m) and bias (of size m × 1).The output of this layer, y = [y 1 , y 2 , . . ., y m ] T is computed as: y = a(W T × x + b), where a is a non-linear activation function.

Data-Independent Regularization
Dropout is a widely used form of regularization on the structure of neural networks.It regularizes the network as follows: activation of each output unit is kept with probability p, otherwise set to 0 with probability 1 − p.Given sample x, the activations of output units can be calculated by The operator ⊗ denotes the element-wise product and M is a (m × 1) binary mask vector.The element of M is randomly drawn from Bernoulli prior with parameter p.This breaks up co-adaption of feature detectors since the dropped-out units cannot influence other retained units [23].In addition, the number of trained models is 2 m , and these models share the same parameters; thus, the final trained NN is an ensemble NN.
For DropConnect, connections are chosen at random using the Bernoulli prior during the training stage.The activation function of DropConnect is modified as: where M is a (n × m) binary mask for weights and g is a (m × 1) mask for biases.The way to interpret DropConnect is also model averaging.
Both Dropout and DropConnect provide a data-independent strategy that assumes a uniform prior on the model parameters to sample a sub-network.Such strategy does not take into consideration the importance of the units or connections.In this paper, we focus on the selection of retained connections.Intuitively, the prior probability to keep a certain connection should be proportional to its importance in estimating the data distribution.One simple solution is to estimate the data distribution using both the fully connected NN and the sub-network that removes certain connections.The likelihood loss could be evaluated after removing them.However, it is generally infeasible in practice since we have to investigate all sub-networks.For example, if the NN has K free parameters, we need to exhaustively test all possible sub-networks (2 K − 1) and calculate the log likelihood value.Therefore, a heuristic data-dependent method is imperative to alleviating computation complexity and improving regularization effectiveness.

Data-Dependent Regularization
The problem of modifying networks would be restated in the theoretical framework of IG as follows.A general model S (with K free parameters) could be seen as a K-dimensional manifold.T (with k < K free parameters) is a smoothed sub-manifold of S. The motivation of modifying DNNs is that the original geometric structure of S could be preserved as much as possible after projecting on the sub-manifold T.
Let p t , p s ∈ S be the true distribution and the sampling distribution, respectively.Then, the choice of sub-manifold can be defined as the optimization problem to maximally preserve the expectation of the Fisher information distance with respect to the constraint of the parametric number, when projecting distributions from the parameter space of S onto that of the reduced sub-manifold T: subject to T having k free parameters.(12) The reason for selecting the Fisher information distance as the distance measure has been referred to in Section 2.3.
CIF principle is proposed to solve the optimization problem in Equation ( 12), as follows.The Fisher information distance D(p t , p s ) can be decomposed into the distances of two orthogonal parts [17].Moreover, it is possible to divide the system parameters in S into two categories (corresponding to the two decomposed distances), i.e., the parameters with major variations and the parameters with minor variations, according to their contributions to the whole information distance.The former refers to parameters that are important for reliably distinguishing the true distribution from the sampling distribution, while the parameters with minor contributions can be considered as less reliable.
Regularization of neural networks could be regarded as the optimization problem in Equation ( 12) with p s is known.In a data-dependent regularization, confidence of a connection is defined as its contributions to the whole information distance.The confident connections will be preserved, while less confident connections will be ruled out.By this strategy, we can directly decide the reliable set of parameters from all k-dimensional sub-models by selecting the top-k confident parameters.Then, we only need K trials to conclude to the reliable solution.

Restricted Boltzmann Machine in IG
Restricted Boltzmann Machine (RBM) is a special case of general BM, for which RBM only has connections between visible and hidden units, while it does not have connections between visible and visible units, or hidden and hidden units.Let v = (v i ), v i ∈ {0, 1} be the state of visible units, and h = (h i ), h i ∈ {0, 1} be the state of hidden units.The entire state of an RBM can be represented as {v, h}.The energy function of RBM is as follows: where ξ = {W, b v , b h } are the parameters.W are the connection weights between visible and hidden units.b v and b h are the visible units thresholds and hidden units thresholds, respectively.An RBM produces a stationary distribution p(v, h; ξ) ∈ S vh over {v, h}.Let B denote the manifold with probability distributions p(v, h; ξ) realizable by RBM.The distribution p(v, h; ξ) is as below: where Z is a normalization factor.The θ-coordinates (Equation ( 3)) for B are: Based on the Theorem 1 in [14], W ij (i.e., θ ij ) could be expressed as follows: where the relation holds for any conditions A on the rest variables.However, it is often infeasible for us to calculate the exact value of W ij because of data sparseness.To tackle this problem, we propose to approximate the value of W ij by using the marginal distribution p(v i , h j ) to avoid the effect of condition A. According to Equation (2), Then, the parametrization for RBM in 1-mixed ζ-coordinates could be calculated.

The Confidence of a Connection
Each θ ij corresponds to one connection W ij .The calculations of confidence over individual connections are not conditionally independent, and a greedy way is adapted to measure the confidence of each connections independently.For the purpose of deciding to keep or drop W ij , we consider the two close distributions p 1 and p 2 with coordinates ζ 1 = {η i , η j , θ ij } and ζ 2 = {η i , η j , 0}, respectively.Each p 1 represents the case that W ij is kept and p 2 indicates the case that W ij is dropped.Definition 1.The confidence of θ ij denoted as ρ(θ ij ) can be defined as the Fisher information distance between p 1 and p 2 : where G ζ is the Fisher information matrix of [ζ] in Section 2.2 and g ζ (θ ij ) is the Fisher information for θ ij .
Note that the second equality holds since θ ij is orthogonal to η i and η j .Therefore, the Fisher information distance between two distributions can be decomposed into two independent parts: the information distance contributed by {η i , η j } and {θ ij }.The detailed definition and calculation for the Fisher information matrix [ζ] are described in Equation (5).
From the point of Maximum Likelihood Estimate (MLE) and Akaike information criterion (AIC), respectively, the rationality of retaining confident connections would be interpreted.
The log-likelihood value of a NN mentioned above is: The log-likelihood values of including W ij and excluding W ij in coordinates [ζ], respectively, are l 1 = log p(x; ζ 1 ) and l 2 = log p(x; ζ 2 ).The gap between l 1 and l 2 is estimated as Then, we have where N is the number of samples, and D(•, •) denotes the Kullback-Leibler divergence.The first approximation holds in an asymptotical sense, and the third approximation is entailed by the approximate relation between Kullback-Leibler divergence and Riemannian distance induced by the metric tensor G ζ [24].Obviously, as the number of samples is fixed as N, the confidence of a connection is approximately equal to 2 N times the log-likelihood gap that the connection undertakes.Hence, the higher confidence means the bigger log-likelihood gap, i.e., preserving the more primary likelihood-structure.
In model selection, given a model with K parameters ξ K , the model selection would select a sub-model with k parameters ξ k .AIC is a common model selection criterion.In AIC [25], AIC = −2logL( ξk ) + 2k.L is the likelihood function given the observed samples and ξk is the maximum likelihood estimate of ξ k .Geometrically speaking, when k is fixed, the minimization of AIC, i.e., maximizing the log likelihood logL( ξk ), is asymptotically equivalent to the minimization of the Kullback-Leibler divergence KL(p s ; p ξ k ) [26].The consistency of the data-dependent regularization and AIC is revealed, when k is fixed.

ConfNet
In this section, we introduce the implementation details of ConfNet, as shown in Algorithm 1.As mentioned, the confidence of connections between the fully connected neighboring layers equals to the confidence of connections in corresponding RBM.First, for each input v, we could sample the hidden state h from their activation probabilities.Then, we get an extended dataset of N samples from the joint space of v, h .Let p(v i , h j ) denote the marginal sampling distribution of input unit v i and hidden unit h j .To estimate p(v i , h j ), we need to go through all N samples and count the number of samples for each assignment of v i and h j .For example, p(v i = 0, , etc.The 1-mixed ζ-coordinates for RBM could be estimated from the marginal distribution p(v i , h j ) as mentioned in Section 4.1.Then, the Fisher information is calculated by Equation ( 7), and the confidence is calculated by Equation (15).
Reject null hypothesis:

end 27 return M
For deciding connections to retain or discard, we need to judge whether the confidence of θ ij , i.e., the Fisher information distance in the coordinate direction of {θ ij }, is significant or nonsignificant.We set up the hypothesis test for confidence ρ, i.e., null hypothesis ρ = 0 versus alternative ρ = 0. Based on the analysis in [27], we have N • ρ ∼ χ 2 (1) asymptotically, where the χ 2 (1) is chi-square distribution with degree of freedom 1 and N is the sampling number.Then, we could calculate the confidence probability to reject the null hypothesis as follows: where cd f is the cumulative distribution function.Following the conventions in the hypothesis test, we set the significant level α to 0.05.If (1 − π) • 2 < α, then the connection W ij will be kept in the DNN; otherwise, we will set W ij to zero.Based on a hypothesis test scheme, we could directly derive the mask M.

Stochastic ConfNet
In ConfNet, the confident connections are retrained while the less confident are dropped without discrimination.In addition, due to the fact that the calculations of confidence over individual connections are not conditionally independent, such a greedy strategy usually can only find the local optimal solution instead of a global one.In particular, the final performance is likely to be worse with the greed process if the previous selected sub-network is far away from the optimal.
To relieve such a problem, a probabilistic sampling strategy in ConfNet (Stochastic ConfNet) is adopted to choose a subset of confident connections over a biased distribution.The retained probability of a connection in such a distribution is proportional to the confidence.Based on such a sampling strategy, connections with higher confidence would be more likely to be chosen than the lower confidence ones, while the binary strategy directly drops the connections with lower confidence out.All the connections could be chosen in Stochastic ConfNet, even if it has a low confidence, which leads to a more robust optimal process.
The number of retained connections is dynamic self-adaptive along with the training process.Figure 1 explains the intentions of the self-adaptive sampling probability.The average confidence of all connections tends to be higher and higher with the training process, while the network carries more variations by training.Intuitively, the quality of the connections should be better and better from the random initialization of weights to the well-trained weights.Thus, the self-adaptive sampling probability is in view.Min-Max normalization of confidence ρ is carried out, and then the sampling probability is calculated as follows: where p i and p i−1 are the dropping ratio at the i−th iteration and (i − 1)-th the iteration, respectively, while 1 − p i and 1 − p i−1 are the ratio for retaining connections.β and p * is a constant between 0 and 1.The initial value p 0 could be set as the dropping ratio of ConfNet at the first iteration, or 0.5, for simplicity's sake.

Training with Back-Propagation
In Algorithm 1, a regularization mask M for weights W is calculated.A reduced NN is obtained by W ← M ⊗ W while b remain unchanged.Then, the reduced NN would be trained by back-propagation.The training of ConfNet or Stochastic ConfNet stops when the classification error converges on the training dataset.The above approach is described in Algorithm 2.

Experimental Setup
The MNIST dataset [28] consists of 28 × 28 pixel handwritten digit images.The task is to classify the images into 10 digit classes.Each digit has the 60,000 training images and 10,000 test images.We scale the pixel values to the [0, 1] range before inputting to our models.The model includes an 800-unit fully connected layer, denoted as , and the sigmoid activation function is adopted, with pre-training [15].We also test a model with two 800-unit fully connected layers, denoted as , and sigmoid activation function is adopted, without pre-training.Back-propagation is used to train neural networks with the learning rate setting at 0.1.No data augmentation is utilized in the experiments.
The CIFAR-10 dataset [29] consists of 32 × 32 natural RGB images.The task is to classify the images into 10 classes.Each class has 50,000 training images and 10,000 testing images.We also scale the pixel values to the [0, 1] and subtract the mean value of each channel computed over the dataset for each image.The feature extractor is: a convolutional layer with 32 feature maps and 5 × 5 filters, a Max pooling layer with region3 × 3 and stride 2, a convolutional layer with 32 feature maps and 5 × 5 filters, a Max pooling layer with region 3 × 3 and stride 2, a convolutional layer with 64 feature maps and 5 × 5 filters, and a Max pooling layer with region 3 × 3 and stride 2. After that, we input the extracted features into a fully connected layer with 64 units.Sigmoid activation functions is used.Regularization is applied on the fully connected layer.No pre-training is used.
The CIFAR-100 dataset [29] is similar to the CIFAR-10, while the task is to classify images into 100 classes.Each class has 600 images, including 500 training images and 100 testing images.The feature extractor, parameter and training setting are the same as CIFAR-10 except that there are two fully connected layers with 128 units, respectively, on the top of the feature extractor.Sigmoid activation functions are used.Regularization is applied on both of the two fully connected layers.No pre-training is used.

Experimental Results
The classification results on MNIST, CIFAR-10 and CIFAR-100 are shown in Table 1.On all experimental datasets, the DNNs with regularization achieve better performances than the DNNs without regularization.This could reflect the existence of overfitting during training, and the effectiveness of regularization method (reducing DNN).Moreover, in most cases, the test classification errors of the DNNs with data-dependent regularization (i.e., Confnet, Stochastic ConfNet) are obviously lower than the DNNs with data-independent regularization (i.e., Dropout, DropConnect).Among all experimental datasets, the Stochastic ConfNet gives the best performances, whether for the top-1 classification errors or for the top-5 classification errors.The Confnet also performs well.The results in bold are the results with lowest error rate.On MNIST, the retained probability p of Dropout is 0.5; the retained probability p of DropConnect is 0.5; the hyper-parameters β and p * of Stochastic ConfNet are 0.025 and 0.1, respectively; and the initial value p 0 is 0.6.On CIFAR-10, the retained probability p of Dropout is 0.5; the retained probability p of DropConnect is 0.5; the hyper-parameters β and p * of Stochastic ConfNet are 0.1 and 0.4, respectively; and the initial value p 0 is 0.6.On CIFAR-100, the retained probability p of Dropout is 0.5; the retained probability p of DropConnect is 0.5; the hyper-parameters β and p * of Stochastic ConfNet are 0.1 and 0.4, respectively; and the initial value p 0 is 0.6.
The next experimental analyses are all based on the experimental phenomena on MNIST.Figure 2a displays the train error curve and test error curve along with the training iteration.We can see that the gap between train error curve and test error curve is reduced by regularization, and all DNNs with regularization (Dropout, DropConnect, ConfNet and Stochastic ConfNet) significantly outperform the fully connected DNNs.The ConfNet and Stochastic ConfNet achieve superior performance.Figure 2b shows the performance of different regularization methods with the increase of network size.Seven setups for the number of hidden units, i.e., [200, 400, 600, 800, 1000, 1200, 1400], are tested.For almost all model sizes, ConfNet and Stochastic ConfNet consistently give a lower error rate than the fully connected DNNs, especially for larger model sizes (e.g., 800, 1000).
In Figure 3a, the average confidence of each model increases steadily over epochs, especially ConfNet and Stochastic ConfNet.It is reasonable that the network would carry more useful information with training, and the regularization by ConfNet and Stochastic ConfNet achieve it better.Figure 3b is about the standard deviation of confidence of the connections.The large standard deviation means a strong ability of discovering the significant connections.Stochastic ConfNet is most outstanding on these two indices.A checking experiment that removes confident connections at each epoch is also implemented.Figure 5 shows the results.The test error remains high, and model can not be convergent.At about 150 epochs, the test error fluctuates at around 18.5% test errors.

Conclusions
In conventional regularization methods (Dropout, DropConnect), the network is regularized by randomly drawing a sub-network from a uniform prior.In this paper, we propose a data-dependent regularization method inspired by the Confident Information First (CIF) principle based on Information Geometry.We have proven that retaining highly confident connections means that the more primary likelihood-structure is preserved, and such a strategy is consistent with the AIC.Moreover, a self-adaptive probabilistic sampling process is used to fit the change of confidence along training to obtain more effective regularization.Empirical evaluation results have demonstrated the superiority of our approach.
Our future research will be focused on the following directions.First, a more generalized theory to demonstrate the relationships among CIF, compressed sensing and random projection will be explored.Second, the ConfNet is currently obtained by a greedy algorithm, so the relative error and time complexity need to be measured.Third, there is a sampling process before calculating the confidence of a connection, which has a negative impact on efficiency.Therefore, a more efficient method will be investigated.Fourth, the application and evaluation of the proposed methods on extra large networks and extending them onto the convolution layer will also be considered in future work.

Algorithm 1 : 4 5 Sample h based on the calculated activations 6 D 7 end 8 M
Data-dependent Regularization.Input: A RBM with input {v 1 , v 2 , . . ., v n }, weight parameters W (of size n × m) and bias b (of size m × 1); N input samples D v , significant level α Output: Regularization mask M for weights W 1 generate the extended samples D vh 2 D vh ← {} 3 for v ∈ D v do Calculate the activation of output: y = a(W T × v + b) vh = D vh ∪ < v, h > ← zero matrix 9 for W ij ∈ W do 10 Estimate marginal distribution p(v i , v j ) from D vh 11 --------------Parameterize to ζ−coordinates:

Figure 1 .
Figure 1.The average and standard deviation of confidence of connections.(a) the connections of the fully connected layer on MNIST; (b) the connections of the first fully connected layer on CIFAR10; (c) the connections of the first fully connected layer on CIFAR100; (d) the connections of the second fully connected layer on CIFAR100.

Algorithm 2 : 1 Initialize W and b 2 while classification error converges do 3 Calculate the mask M via Algorithm 1 4 5 Feed
Training with Back-propagation.Input: A fully connected layer of a DNN with input {x 1 , x 2 , . . ., x n }; N input samples D x Output: weights W and biases b Reduce NN by W ← M ⊗ W -forward: y = a(W T × x + b) 6 Differentiate loss with respect to W and b 7 Update W and b using the back-propagated gradients 8 end 9 return W and b

Figure 2 .
Figure 2. The classification performances on MNIST.(a) the classification performances on MNIST as the iterations progress; (b) the classification performances on MNIST for different model sizes.

Figure 3 .
Figure 3.The average and standard deviation of confidence of the connections in the training phase on MNIST.(a) the average confidence of the connections; (b) the standard deviation of confidence of the connections.

Figure 4 ,
Figure 4, the confidence of connections is processed by normalization.The x-axis means confidence with normalization, and distribution of confidence on several different epochs are shown in Figure4.With the increase of training iterations, the distribution curve of confidence moves to the right continually.It means that the confidence becomes larger with training, while the network carries more useful information.

Figure 4 .
Figure 4.The distribution of confidence of connections in the training phase on MNIST.(a) the distribution of confidence of connections in the training phase; (b) the magnified Figure 4a.
test error removing confident Connections train error

Figure 5 .
Figure 5.The classification performances on MNIST with the removal of confident connections.