Smooth Group L 1/2 Regularization for Pruning Convolutional Neural Networks

: In this paper, a novel smooth group L 1/2 ( SGL 1/2 ) regularization method is proposed for pruning hidden nodes of the fully connected layer in convolution neural networks. Usually, the selection of nodes and weights is based on experience, and the convolution filter is symmetric in the convolution neural network. The main contribution of SGL 1/2 is to try to approximate the weights to 0 at the group level. Therefore, we will be able to prune the hidden node if the corresponding weights are all close to 0. Furthermore, the feasibility analysis of this new method is carried out under some reasonable assumptions due to the smooth function. The numerical results demonstrate the superiority of the SGL 1/2 method with respect to sparsity, without damaging the classification performance.


Introduction
CNNs have been widely applied in many applications, such as intelligent information processing, pattern recognition, feature extraction [1][2][3][4][5], etc. As usual, slightly more hidden layer nodes were selected based on experience in neural networks. However, as is well known, too many nodes and weights in a deep network will increase the computational load, memory size and the risk of overfitting [6]. In fact, some hidden layer nodes and weights have little contribution to improving the performance of the network [7]. Therefore, choosing an appropriate number of hidden layer nodes and weights has become an important research topic in optimizing neural networks. Many algorithms have been proposed in order to optimize the number of nodes and weights in the neural network.
As one of the most effective methods to reduce the number of weights in the network, the regularization terms were introduced into the learning process. This is generally realized with L p regularization, which penalizes the sum of the weight norm during training. The L 1 norm is the sum of the absolute values of the elements in a vector, so as to make the weight value close to zero [8]. In [9], the L 1 -norm was combined with the capped L 1 -norm to denote the amount of information extracted through the filter and control regularization. Gou et al. [10] proposed a discriminative collaborative representationbased classification (DCRC) method through L 2 regularizations to improve the classification capabilities. Xu et al. adopted the L 1/2 regularizer to transform a non-convex problem into a series of L 1 regularizer problems, and showed many superior properties, such as robustness, sparsity and oracle properties, compared to the L 1 and L 2 regularizers [11]. In [12], Xiao introduced sparse logistic regression (SLR) based on L 1/2 regularization to impose a sparsity constraint on logistic regression. The algorithms mentioned above successfully optimize the network only by pruning the weights.
Regularization methods have become more impressive than before, but all of them were designed mainly for pruning the superfluous weights, and the node can be deleted only if all its outgoing weights are close to zero. Then, researchers tried to prune the nodes to optimize the neural network. Simon et al. provided a group lasso method, which produced sparse effects both on and within the group, and showed the expected effect of group-wise and within-group sparsity [13][14][15]. Moreover, [16] considered a more general penalty and blended the lasso with the group lasso, which yielded solutions that are sparse at both the group and the individual feature level. For pruning the nodes of the network, the popular group lasso method (GL 2 ) imposes sparsity at the group level, so that either all the weights between nodes in the fully connected layer and all nodes of the output layer approach zero simultaneously, or none of them are close to zero. In other words, the group lasso regularization prunes the nodes of the fully connected layer, but does not prune redundant weights of surviving nodes.
It was shown that combining the L 1/2 regularization with the group lasso (GL 1/2 ) for feedforward neural networks can prune not only hidden nodes but also the redundant weights of the surviving hidden nodes, and can achieve better performance in terms of of sparsity [17]. However, L 1/2 regularization is not smooth at the origin, which results in oscillation during the numerical computation and causes difficulty in the feasibility analysis. To overcome these issues, the regularizer was approximated with a continuous function in our early work [18]. Furthermore, in [19], the smooth L 1/2 was applied to train the Sigma-Pi-Sigma neural network, and achieved better performance regarding both sparsity at the weight level and accuracy compared to the non-smooth L 1/2 .
In this article, we combine the smooth L 1/2 regularization with the group lasso method, and propose a smooth group L 1/2 regularization algorithm. This novel algorithm inherits the advantages of the smooth function and L 1/2 regularization. As an application, the smooth group L 1/2 regularization algorithm is employed for the fully connected layer of CNNs. The main contribution of smooth group L 1/2 is to try to prune unnecessary nodes and control the magnitude of weights for the surviving nodes. In addition, due to the differentiability of the error function with smooth group L 1/2 regularization, it becomes easier to analyze the feasibility of the learning algorithm in theory. In the process of training the network, compared with GL 1 , GL 2 , and GL 1/2 , smooth group L 1/2 regularization can not only prune the nodes and weights (improve the sparsity), but also overcome the oscillation in GL 1/2 . This paper is organized as follows. We first describe the simple process of the convolutional neural network and the smooth group L 1/2 regularization in the next section. Then, in Section 3, the feasibility analysis of the SGL 1/2 algorithm in CNNs is given, in which the training convergence with the SGL 1/2 term is proven theoretically. Numerical comparisons of several methods on four real-world datasets are carried out in Section 4. Finally, some conclusions are drawn in Section 5. In order to highlight the key points of this paper, the theorem proving process is included in the Appendix A.

Brief Description of CNNs and Smooth Group L 1/2 Regularization
In this section, we first demonstrate the simple calculation process of the convolutional neural network. After introducing the penalty term, the SGL 1/2 regularization is briefly described in Section 2.2.

Convolutional Neural Network
CNNs consist of three building blocks: convolution [20], pooling [21] and the fully connected layer [22]. Generally, the convolution filter is set as a symmetric matrix in CNNs. A filter in a convolution layer carries out a convolution operation on input images to obtain new feature maps, which can be expressed as: where x l j is the j-th feature map in the l layer, k l j denotes the convolution filter, the convolution operation is denoted by * , b l j is the bias, and P j is a set of feature maps activated by filter k l j in the l − 1 layer. After x l j is activated by a function, such as ReLU [23,24], the pooling layer uses the max or mean approach to progressively reduce the spatial size of the representation, as shown in the following equation: where the pool function can be selected as the maximum or average as needed and the ReLU function can be written as: A CNN may include several convolution-ReLU-pooling parts. The output of the last pooling layer is flattened as one large vector reshape(pool(ReLU(x l j ))) [25] and is fed to a fully connected layer for classification purposes. The final classification decision is driven by the following equation: where O is the actual output vector, U denotes the weight of the fully connected layer, g(·) represents the activation function, reshape(·) denotes a function to transform a specified matrix into a matrix of specific dimensions. The image is classified to the i-th category if the i-th element of g(·) is the largest one. For a two-classification problem, U degenerates into a vector and the activation function g(·) is generally the sigmoid function.

SGL 1/2 Regularization for Fully Connected Layer
The error in the CNN is usually calculated by the following equation: where J represents the number of samples, T j and O j are the target and actual output vectors of the j-th sample, respectively. Let r be the number of output nodes and q be the number of nodes in the fully connected layer. The error function with a penalty term is defined as where the vector u k is the weight vector connecting the k-th node of the fully connected layer and all output nodes, and λ is the penalty term coefficient. The norm could be the 1-norm, 2-norm or 1 2 -norm. When ||u k || = ∑ r i=1 |u ik |, it is the GL 1 method, while ||u k || = ∑ r i=1 u 2 ik is the GL 2 method. Specifically, we take the 1 2 -norm [26]. Then, where | · | is the normal absolute value function. When the norm takes the 1 2 -norm, we call it the GL 1/2 method. Nevertheless, the partial derivative of E with respect to u ik does not exist at the origin, which creates difficulties for the gradient descent method. Even though the partial derivative is expressed with a piecewise function, it still causes fluctuations in the process of training. In order to overcome this drawback, a SGL 1/2 regularization is proposed in this paper: where f (·) represents a smooth function that approximates | · | (the absolute value function). Specifically, the following piecewise polynomial function is used: where m is a small positive constant. This function f (·) has the following characteristics: The norm is taken as the 1 2 -norm, and the absolute value function in the 1 2 -norm is approximated by a smooth polynomial function near x = 0, which is called the SGL 1/2 method.

Feasibility Analysis of the SGL 1/2 Algorithm in CNNs
Now, it is enough to give the feasibility analysis of the SGL 1/2 algorithm. In order to obtain the convergence results, we first turn the CNN into mathematical formulae. Then, we proceed to give the convergence results.

Transform Convolution and Mean Pooling into Mathematical Equations
In regular neural networks, every layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the next layer before. This operation is easily expressed by multiplying matrices. However, in CNNs, the neurons in one layer do not connect to all the neurons in the next layer but only to a small part of it. The convolution operation is often described graphically. Thus, our first task is to transform the convolution operation into mathematical equations.
Although the convolution filter is usually symmetrical, for universal applicability in the proof, we choose a general matrix. Let an input array be filtered by a 2 × 2 filter, where the padding is 0 and the step is 1. As shown in Figure 1, when the filter slides over the input, a matrix multiplication of a submatrix of the input and the filter is performed and the sum of the convolution moves into the feature map, i.e., the output of this layer.

Input Array
Filter Output To express this operation with mathematical equations, we squash each submatrix that multiplies with the filter into a vector. More specifically, the red square of the input array in Figure 2 is squashed into the vector (x 11 , x 12 , x 21 , x 22 ). Then, we put all squashed vectors into a matrix X in order of the filter sliding, as shown in Figure 2. The filter is also squashed with a vector (v 11 , v 12 , v 21 , v 22 ) T accordingly, and then is repeatedly put into the diagonal position of the matrix V, as shown in Figure 3. Other elements of V are 0. With X and V, the operation of the convolutional layer can be described with the matrix multiplication of X and V, as shown in Figure 4, i.e.,  The mean pooling is assumed to be applied in 2 × 2 patches of the feature map with a stride of (2, 2). It can also be expressed with the matrix multiplication, as shown in Figure 5. Each patch of the feature map is flattened into a vector and all the vectors are merged into a matrix as a reshaped feature map, as shown in Figure 5. The sliding mean window is flattened as a vector ( 1 4 , 1 4 , 1 4 , 1 4 ) T and is repeatedly put into the diagonal position of the mean matrix M, as shown in Figure 5. As in Equation (10), the mean pooling operation can be expressed with the matrix multiplication: Reshaped feature map Pooled map Figure 5. The mean pooling is described with the matrix multiplication of the reshaped feature map and the mean matrix. Now, given an input array X, the processing procedure from the convolution to the output layer can be expressed by mathematical equations. The output of the convolution layer is where the function G means the reshape operation shown in Figure 2. After the ReLU layer, the matrix ReLU(A) is reshaped as G(ReLU(A)). In the pooling layer, the mean function is used Then, the output matrix of the pooling layer is vectorized by column scan and this process is denoted by where the function F denotes the layer vectorized by the column. Finally, the fully connected layer is

Convergence Results
To prove that our proposed method is feasible, here, we give the convergence results. For ease of understanding, we take the simplest single-layer CNN case as an example. This CNN includes one convolution, one pooling and one fully connected layer, where the convolution filter size is (5, 5) and the mean pooling size is (2,2).
Given the training sample set {X j , T j } J j=1 , each X j is assumed to be the 28 × 28 input array and T j is the 10 × 1 vector. According to Equations (12)- (15), the error function of Equation (8) can be expressed as where Training a CNN involves finding a suitable V and U so that E reaches the minimum [27]. For this reason, the gradient descent method [28] is adopted. Notice that the mean matrix M does not need to be trained. In the backpropagation algorithm, V and U are changed according to the gradient descent direction of E. The partial derivative of E with respect to the element u ik of U is as follows: The partial derivative of E with respect to the element v ij of the convolution filter V is the same as the original CNN because the partial derivative of the penalty term in Equation (16) with respect to v ij is zero. That is, where δ is the derivative function of the rectified linear units function: Thus far, we have given the step direction of U and V by (17) and (18), respectively. Now, we proceed to give the step direction of the biases. The partial derivative of the biases can be computed similarly as shown in [29]; the reader can refer to this article for more details.
We combine all weights and biases into a large vector W. Then, the parameter updating algorithm of SGL 1/2 is defined as follows: where η is the learning rate and n is the iteration step. The convergence proof needs some assumptions as follows: (1) ||W n || (n = 0, 1, 2, . . .) are uniformly bounded, where W n is the error of the n-th step.
In addition, if the Assumption (3) also holds, then the strong convergence result holds: (iv) There exists a point W * ∈ Φ 0 such that lim The proof process is not the focus of this article, so we include it in the Appendix A.

Numerical Experiments and Discussion
We evaluate SGL 1/2 in different ways, such as nodes [30] and weights sparsity [31], training and testing accuracy, the norm of weight gradient and the convergence speed, on four typical benchmark datasets: Mnist [32], Letter Recognition [33], Cifar 10 [34] and Crowded Mapping. For parameter sparsity, SGL 1/2 is compared with some conventional and sparse algorithms including GL 1 , GL 2 and GL 1/2 . Moreover, we investigate the test accuracy by comparing SGL 1/2 with the above regularization algorithms.
For the following numerical experiments, we refer to the arithmetic optimization algorithm [35] and adopt a five-fold cross-validation technique [36][37][38]. We randomly divide the dataset into five parts, where the sample size is equal (or almost equal). The network learning of these four algorithms is carried out five times. Each time, one of the five parts is selected in turn as the test sample set, and the other four parts are used as the training sample sets. Then, we rearrange the five-part samples and start the process again. This process is repeated twenty times. The experiment process is given in Algorithm 1.

Algorithm 1
The experiment process.
Step 1: Input the data and calculate the corresponding actual output; Step 2: Calculate the difference between the actual output and the ideal output; Step 3: Give the error function according to these four algorithms; Step 4: Update the weights of the fully connected layer according to the error function of step 3; Step 5: Keep iterating, repeat steps 2-4; Step 6: Calculate the pruned nodes, pruned weights of surviving weights and classification accuracies under these four methods, respectively; Step 7: Contrast.
Finally, for each dataset and algorithm, we obtain one hundred classification results. Each result contains the rate of pruned nodes (Rate of PN) (cf. Equation (20)), the rate of pruned weights of the remaining nodes (Rate of PW) (cf. Equation (21)), training accuracy (Training Acc.) and test accuracy (Test Acc.). The averages of these numerical results are listed in Tables 1-4 for these four datasets.

Rate of PN =
Pruned nodes All nodes , Rate of PW = Pruned weights of surviving nodes Surviving nodes × Output nodes .
For an output node, the ideal output value is 1 or 0. When we evaluate the error between the ideal and real output values, we use the following "40-20-40" standard [39]: The actual output values of the output nodes between 0.00 and 0.40 are regarded as 0, values between 0.60 and 1.00 are regraded as 1, and values between 0.40 and 0.60 are regraded as uncertain and are considered incorrect.

Mnist Problem
MNIST is a dataset for the study of handwritten numeral recognition, which contains 70,000 examples of 28 × 28 pixel images of the digits 0-9. For these four algorithms, we set the learning rate η = 0.03. The maximum iteration training step is 1000. In order to show the sparsity, we give the node sparsity and weight sparsity performances for λ = 0.001, 0.002, 0.003, 0.004, 0.005 of these four algorithms (see Figure 6; y-axis represents the percentage of the number of pruned nodes and pruned weights of the remaining nodes, respectively). The sparsity will become worse when λ > 0.005. Therefore, we choose λ = 0.005 to compare these algorithms. The performances of these four group lasso algorithms are compared in Table 1. We can see that, in terms of the sparsity, the performance of SGL 1/2 is better than GL 2 , GL 1 and GL 1/2 . In terms of of accuracy, SGL 1/2 is also the best. Figure 7a presents the loss functions of these four group lasso algorithms. Obviously, we can see that the SGL 1/2 approach has the lowest error after training, and SGL 1/2 has a large fluctuation during the training process.
We show the gradient norms of GL 1 , GL 2 , GL 1/2 and SGL 1/2 in Figure 7b, where the oscillation [40,41] of GL 1/2 is presented. From Figure 7, we find that the SGL 1/2 regularizer eliminates the oscillation and guarantees the convergence, as predicted in Theorem 1.

Letter Recognition Problem
The Letter Recognition dataset consists of 20,000 samples with 16 attributes. Each 16-dimensional instance within this database represents a capital typewritten letter in one of twenty fonts. For these four algorithms, we set the learning rate η = 0.05. The maximum iteration training step is 1000. In order to show the sparsity, we give the node sparsity and weight sparsity performances for λ = 0.002, 0.004, 0.006, 0.008 of these four algorithms (see Figure 8). The sparsity will become worse when λ > 0.008. Therefore, we choose λ = 0.008 to compare these algorithms. The performances of GL 2 , GL 1 , GL 1/2 and SGL 1/2 are compared in Table 2. We see that, in terms of sparsity, the performance of SGL 1/2 is better than GL 2 , GL 1 and GL 1/2 . In terms of accuracy, SGL 1/2 is also the best among the above-mentioned four algorithms. Figure 9a presents the loss functions of these four group lasso algorithms. Obviously, we can see that the SGL 1/2 approach has the lowest error after training, and SGL 1/2 has a large fluctuation during the training process. We show the gradient norms of GL 1 , GL 2 , GL 1/2 and SGL 1/2 in Figure 9b, where the oscillation of GL 1/2 is presented. From Figure 9, we find that the SGL 1/2 regularizer eliminates the oscillation and guarantees the convergence, as predicted in Theorem 1.

Cifar 10 Problem
The Cifar 10 dataset consists of 60,000 images, each of which is a 32 × 32 color map. This dataset contains 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck), with 6000 images per class. There are 50,000 training images and 10,000 test images. For these four algorithms, we set the learning rate η = 0.03. The maximum iteration training step is 1000. In order to show the sparsity, we give the node sparsity and weight sparsity performances for λ = 0.001, 0.002, 0.003, 0.004, 0.005 of these four algorithms (see Figure 10). The sparsity will become worse when λ > 0.005. Therefore, we choose λ = 0.005 to compare these algorithms. The performances of these four group lasso algorithms are compared in Table 3. We see that, in terms of of sparsity, the performance of SGL 1/2 is better than GL 2 , GL 1 and GL 1/2 . In terms of accuracy, SGL 1/2 is also the best. Figure 11a presents the loss functions of these four group lasso algorithms. Obviously, we can see that the SGL 1/2 approach has the lowest error after training, and SGL 1/2 has a large fluctuation during the training process. We show the gradient norms of GL 1 , GL 2 , GL 1/2 and SGL 1/2 in Figure 11b, where the oscillation of GL 1/2 is presented. From Figure 11, we find that the SGL 1/2 regularizer eliminates the oscillation and guarantees the convergence, as predicted in Theorem 1.

Crowded Mapping
The Crowded Mapping dataset consists of 10,546 samples with 28 attributes, and these samples are divided into six classes. For these four algorithms, we set the learning rate η = 0.05. The maximum iteration training step is 1000. In order to show the sparsity, we give the node sparsity and weight sparsity performances for λ = 0.0015, 0.003, 0.045, 0.006 of these four algorithms (see Figure 12). The sparsity will become worse when λ > 0.006. Therefore, we choose λ = 0.006 to compare these four algorithms. The performances of the GL 1 , GL 2 , GL 1/2 and SGL 1/2 methods are compared in Table 4. We see that, in terms of sparsity, the performance of SGL 1/2 is better than GL 2 , GL 1 and GL 1/2 . In terms of accuracy, SGL 1/2 is also the best among the above-mentioned four algorithms. Figure 13a presents the loss functions of these four group lasso algorithms. Obviously, we can see that the SGL 1/2 algorithm has the lowest error after training, and SGL 1/2 has a large fluctuation during the training process. We show the gradient norms of GL 1 , GL 2 , GL 1/2 and SGL 1/2 in Figure 13b, where the oscillation of GL 1/2 is presented. From Figure 13, we find that the SGL 1/2 regularizer eliminates the oscillation and guarantees the convergence, as predicted in Theorem 1. From the above experiments on the four datasets, it is easy to see that the GL 1 and GL 2 algorithms have better sparsity at the node level, and the GL 1/2 algorithm has better sparsity at the weight level. In some applications, the sparsity at the weight level is also of great significance. If the sparseness of the integrated node and weight level is better, the number of weights that need to be calculated and updated will be reduced in the process of training the CNNs. Furthermore, it also leads to a reduction in the amount of calculation and saves storage space. Compared with the GL 1 and GL 2 algorithms, the SGL 1/2 algorithm has better sparsity at the node level and the weight level, and can also improve the classification performance. Compared with the GL 1/2 algorithm, the theoretical analysis and numerical experiment are carried out to verify that the SGL 1/2 algorithm improves the sparsity at the node level, and at the same time improves the classification performance.

Discussion
Tables 1-4, respectively, show the performance comparison of PN, PW, training accuracies and test accuracies under these four methods. In terms of the sparsity, the PN calculation results of the SGL 1/2 method are much better than the other three methods, especially the GL 1/2 method. As for the PW, although the surviving node of the GL 1/2 has a higher rate of pruned weights of surviving weights, the rate of pruned nodes is too low, such that the sparsity of the GL 1/2 method is still far lower than that of the SGL 1/2 method. In terms of classification accuracy, the SGL 1/2 method is slightly higher than other methods, which means that this method can improve the sparsity without damaging the classification accuracy.
We can find that the specificity of CNNs is not actually used in the experiments, so the SGL 1/2 method can be widely applied to other neural network models.

Conclusions
Our main task was to introduce the SGL 1/2 algorithm. Based on the GL 1 and GL 2 algorithms, replacing 1-norm and 2-norm with 1 2 -norm can greatly improve the sparsity of the network weight level, but it does not help to achieve better sparsity of the node level. The non-smooth penalty term at the origin is the root cause of the poor sparsity of the GL 1/2 algorithm at the node level.
To this end, in this paper, a smooth group L 1/2 (SGL 1/2 ) regularization term is introduced into the batch gradient learning algorithm to prune the CNN. The feasibility analysis of the SGL 1/2 method for the fully connected layer of the CNN is performed. Numerical experiments show that the sparsity and convergence of SGL 1/2 give better results in terms of both the rate of pruned hidden nodes and weights of the remaining hidden nodes compared to GL 1/2 , GL 1 and GL 2 . In addition, the SGL 1/2 regularizer not only overcomes the oscillation phenomenon during the training process, but also achieves better classification performance.
Proof for (ii). From the conclusion of (i), we know that the nonnegative sequence E(W n ) monotonically decreases. Hence, there must exist a E * ≥ 0 such that lim n→∞ E(W n ) = E * .
Before proving (iv), we need to review the following lemma [43]: Proof for (iv). Since the error function E(W) is continuous and differentiable, from Equation (19), Assumption (3) and Lemma A1, we can easily achieve the desired result; there exists a point W * ∈ Φ 0 such that lim n→∞ (W n ) = W * .