Data-Driven Channel Pruning towards Local Binary Convolution Inverse Bottleneck Network Based on Squeeze-and-Excitation Optimization Weights

This paper proposed a model pruning method based on local binary convolution (LBC) and squeeze-and-excitation (SE) optimization weights. We first proposed an efficient deep separation convolution model based on the LBC kernel. By expanding the number of LBC kernels in the model, we have trained a larger model with better results, but more parameters and slower calculation speed. Then, we extract the SE optimization weight value of each SE module according to the data samples and score the LBC kernel accordingly. Based on the score of each LBC kernel corresponding to the convolution channel, we performed channel-based model pruning, which greatly reduced the number of model parameters and accelerated the calculation speed. The model pruning method proposed in this paper is verified in the image classification database. Experiments show that, in the model using the LBC kernel, as the number of LBC kernels increases, the recognition accuracy will increase. At the same time, the experiment also proved that the recognition accuracy is maintained at a similar level in the small parameter model after channel-based model pruning by the SE optimization weight value.


Introduction
With the development and wide application of deep learning, the field of artificial intelligence has undergone tremendous changes in recent years. Owing to higher requirements for model results, deeper and more complex deep learning network structures are proposed [1][2][3][4][5][6][7]. What follows is exponential growth in model parameters and memory requirements, which makes it difficult to implement to various hardware platforms, such as mobile devices [8][9][10]. To improve the calculation speed of the model, in addition to further improving the calculation speed of the hardware, many researchers also try to reduce the number of parameters required by changing the model structure [8,[10][11][12][13]. Moreover, some researchers have focused on the methods of model compression to modify the trained model and compress it to minimize the computational space and time consumption of the model [14][15][16][17]. Model pruning is a type of model compression [18][19][20][21]. It is based on an assumption, or the current consensus, which is the over-parameterization of deep neural networks [22,23]. Over-parameterization means that we need a lot of parameters in the training phase to capture the tiny information in the data, and once the training is completed to the inference phase, we do not need so many parameters. This assumption supports that we can simplify the model before deployment.
Through pruning at different granularities of model parameters, model pruning methods can be divided into unstructured pruning [18,19,21,24,25] and structured pruning [20,26,27]. In unstructured pruning, the weights of the network are pruned at the neuron level. This pruning method has the highest flexibility, but it will cause the weight matrix to be sparse, requiring additional sparse matrix operation libraries or specially Figure 1. The proposed model pruning method is based on SE (squeeze-and-excitation) optimization weight. Build a model with higher accuracy by setting a larger expansion ratio (above model). Calculate the SE weight of the corresponding part of the training data set, and score the depthwise convolution kernel. Cut the corresponding depthwise convolution kernel to get a smaller model (below model). The accuracy of the model remains at a similar level.

Related Work
With the good results of deep neural networks in various fields, the computing resources required by the model also increase. Model size, memory footprint, number of calculation operations (FLOPs), and power consumption are the main aspects that hinder the use of deep neural networks in certain resource-constrained environments. Those large models may not be stored and cannot be run in real-time on embedded systems. To solve this problem, many methods have been proposed, such as low-rank approximation of weights [4,46], weight quantization [47,48], knowledge distillation [49], and model pruning, where model pruning has attracted much attention thanks to its competitive performance and compatibility.
In the work of the early 1990s, when the weight is set to zero, the second-order Taylor approximation method with an increased network loss function is used for pruning. In optimal brain damage [18], the saliency for each parameter was computed using a diagonal Hessian approximation, and the low saliency parameters were pruned from the network, and the network is retrained. In optimal brain surgeon [19], the saliency of each parameter was calculated using the inverse Hessian matrix, the low saliency weight was pruned, and all other weights in the network were updated with second-order information. More recently, another paper [24] proposed to trim the weight of the network to a small extent, and further integrate this technology into the deep compression pipeline [28] to obtain a highly compressed model. Besides, many researchers have proposed various algorithms to iteratively remove redundant neurons, use Variational Dropout to trim excess weights [50], and learn sparse networks through regularization of the L0 paradigm based on random gates [25]. But, one disadvantage of these unstructured pruning methods is that the resulting weight matrix is sparse, and if there is no dedicated hardware/library, it cannot cause compression and acceleration [28].
In contrast, the structured pruning method is pruning at the channel or even at the level. As the original convolution structure is still retained, no dedicated hardware/library is required to achieve these benefits. Among the structured pruning methods, channel The proposed model pruning method is based on SE (squeeze-and-excitation) optimization weight. Build a model with higher accuracy by setting a larger expansion ratio (above model). Calculate the SE weight of the corresponding part of the training data set, and score the depthwise convolution kernel. Cut the corresponding depthwise convolution kernel to get a smaller model (below model). The accuracy of the model remains at a similar level.

Related Work
With the good results of deep neural networks in various fields, the computing resources required by the model also increase. Model size, memory footprint, number of calculation operations (FLOPs), and power consumption are the main aspects that hinder the use of deep neural networks in certain resource-constrained environments. Those large models may not be stored and cannot be run in real-time on embedded systems. To solve this problem, many methods have been proposed, such as low-rank approximation of weights [4,46], weight quantization [47,48], knowledge distillation [49], and model pruning, where model pruning has attracted much attention thanks to its competitive performance and compatibility.
In the work of the early 1990s, when the weight is set to zero, the second-order Taylor approximation method with an increased network loss function is used for pruning. In optimal brain damage [18], the saliency for each parameter was computed using a diagonal Hessian approximation, and the low saliency parameters were pruned from the network, and the network is retrained. In optimal brain surgeon [19], the saliency of each parameter was calculated using the inverse Hessian matrix, the low saliency weight was pruned, and all other weights in the network were updated with second-order information. More recently, another paper [24] proposed to trim the weight of the network to a small extent, and further integrate this technology into the deep compression pipeline [28] to obtain a highly compressed model. Besides, many researchers have proposed various algorithms to iteratively remove redundant neurons, use Variational Dropout to trim excess weights [50], and learn sparse networks through regularization of the L0 paradigm based on random gates [25]. But, one disadvantage of these unstructured pruning methods is that the resulting weight matrix is sparse, and if there is no dedicated hardware/library, it cannot cause compression and acceleration [28].
In contrast, the structured pruning method is pruning at the channel or even at the level. As the original convolution structure is still retained, no dedicated hardware/library is required to achieve these benefits. Among the structured pruning methods, channel pruning [22,51,52] is the most popular method because it operates at the most granular level while still being suitable for conventional deep learning frameworks. There are three classic ideas for the channel pruning algorithm. The first is based on the importance factor [52], that is, to evaluate the effectiveness of a channel, and to constrain some channels to make the model structure itself sparse, so that pruning is based on this. The second is to use reconstruction errors to guide pruning [22,51], indirectly measuring the impact of a channel on the output. The third is to measure the sensitivity of the channel based on the change of the optimization target. However, the work of [29] pointed out that, for structured pruning, after obtaining the compression model through the pruning algorithm, it is better to initialize and train the compression model randomly instead of using the weights of the large network for fine-tuning. For the final compressed small model, the network architecture obtained by the pruning algorithm is more important than the "important" weight obtained by the pruning. The model through the pruning algorithm can provide design guidance for designing an effective network architecture. For most convolutional neural networks, fine-tuned convolution kernels from large networks are more likely to cause the model to fall into overfitting.
Similar to the method of scoring the part of the model that needs to be cut in model pruning, the attention mechanism method widely used in image processing and image sequence processing [53] is used to evaluate the intermediate features extracted by the model. In CNN, based on the original attention mechanism applied to the human visual system [31,32], the spatial attention mechanism was applied to it very early. For example, in the original picture information, focus on the area that needs more attention [33][34][35], and find the relationship weight of any pixel in the image to the current pixel from the global information [36]. In addition to spatial attention, many studies have focused on the attention mechanism of the convolution channel [37] in CNN. In the work of [37], they use global average-pooled features to compute channel-wise attention. Spatial attention puts more weight on the input image, and according to different input images, more information parts will be found. The channel attention is more about weighting the convolution kernel. Corresponding to different input images, a specific convolution kernel can find important information more of the time.
In our work, we use the LBC kernel [41] to replace the general convolution kernel and perform structured pruning to avoid the above problems. The LBC is inspired by the LBP [42] of the feature extraction method in traditional image processing. Local binary pattern (LBP) is a simple, but powerful hand-designed descriptor, used for images based in the field of facial recognition [54,55], and has a wide range of applications in computer vision fields such as image classification [56]. The LBP descriptor is formed by sequentially comparing the intensity of adjacent pixels in the patch with the intensity of the center pixel. Compared with the center pixel, neighbors with higher intensity values are assigned 1, otherwise 0. Based on this theory, the LBC layer comprises a set of sparse, binary, and randomly generated sets of convolutional weights from −1, 0, and 1, and the pixel intensity relationship in the receptive field can be calculated. Unlike other binarized networks [47,48,57], the LBC network mixes a non-trainable LBC layer and a trainable convolutional layer. While retaining the advantages of the binary network in terms of computational speed and storage space, it can be used as an end-to-end network for conventional training without additional libraries, also greatly reducing the number of parameters to be learned during training. The effectiveness of the LBC network has also been verified [41,44]. However, as other lightweight network designs are proposed, it is more common to use smaller and more distributed convolution operations such as depthwise convolution [58,59], pointwise convolution [5,8], and group convolution [60,61] to replace standard convolution. The advantage of LBC in the calculation is reduced, and owing to the non-trainable nature of its LBC kernel, the performance of the LBC network is related to the number of LBC kernels [41]. To achieve better performance, it is necessary to increase the number of LBC kernels, thereby increasing the number of parameters of the model. In [62], the paper proposed a 3D version of LBVCNN and imitated the Local Binary Pattern from Three Orthogonal Planes (LBP-TOP) feature to rotate a 3D image sequence with a time axis as input. LBVCNN rotates the W, H, and T axes of a 3D image sequence construct matrices of (W, T, H) and (H, T, W), and uses the 3D LBC kernel to process the three matrices. This is equivalent to multiplexing these 3D LBC kernels, disguised as an increase in the number of 3D LBC kernels to get better results.
In our work, we use the LBC kernel to replace the traditional convolution kernel and use the structured model pruning method to prune the trained large model. While obtaining a more effective model structure, the LBC kernel parameters that are more effective for the result are retained, thereby improving the result of the smaller LBC model.

Model Pruning Based on SE Optimization Weight
In our work, we propose a moving reverse bottleneck convolution block based on deep LBC layer and SE optimization. We score the LBC kernel according to the SE optimization weight, and we obtain the LBC kernel with better performance in the model. By expanding the ratio in the mobile inverted bottleneck [12,43,44], we get a large model with more LBC kernels and more parameters. According to the statistics of the output of the SE weight of each block of the model on the database, we get the score corresponding to each LBC kernel. Based on this score, the large model can be pruned. Figure 2 showing the difference between the standard LBC and the LBC blocks we proposed. Figure 2a shows the structure of the LBCNN proposed in [41]; each LBC block is composed of two parts. The first part is a sparse LBC layer with non-trainable parameters. The structure of this layer is the same as the standard convolutional layer; the convolution kernel size is set to 3 × 3 without backpropagation. The parameters of the LBC kernel are to first generate a set of all-zero matrices and then replace a part of 0 randomly with 1 or −1 according to the Bernoulli distribution to generate a sparse LBC kernel. The second part is a standard convolution layer; the size of the convolution kernel is 1 × 1. In such an LBC block, it consists of a non-trainable convolutional layer and a trainable convolutional layer, so that the model can be trained normally. The 1 × 1 convolutional layer in the second part only provides a parameter that can be trained for the previous LBC layer. Figure 2b shows the structure of the standard LBC block with the residual module. The residual module adds shortcut connections, but does not change the structure of the LBC block. Figure 2c shows the structure of the standard LBC block with the SE-residual module. and T axes of a 3D image sequence construct matrices of (W, T, H) and (H, T, W), and uses the 3D LBC kernel to process the three matrices. This is equivalent to multiplexing these 3D LBC kernels, disguised as an increase in the number of 3D LBC kernels to get better results.

Depthwise LBC Layer and LB Mobile Inverted Bottleneck Block
In our work, we use the LBC kernel to replace the traditional convolution kernel and use the structured model pruning method to prune the trained large model. While obtaining a more effective model structure, the LBC kernel parameters that are more effective for the result are retained, thereby improving the result of the smaller LBC model.

Model Pruning Based on SE Optimization Weight
In our work, we propose a moving reverse bottleneck convolution block based on deep LBC layer and SE optimization. We score the LBC kernel according to the SE optimization weight, and we obtain the LBC kernel with better performance in the model. By expanding the ratio in the mobile inverted bottleneck [12,43,44], we get a large model with more LBC kernels and more parameters. According to the statistics of the output of the SE weight of each block of the model on the database, we get the score corresponding to each LBC kernel. Based on this score, the large model can be pruned. Figure 2 showing the difference between the standard LBC and the LBC blocks we proposed. Figure 2a shows the structure of the LBCNN proposed in [41]; each LBC block is composed of two parts. The first part is a sparse LBC layer with non-trainable parameters. The structure of this layer is the same as the standard convolutional layer; the convolution kernel size is set to 3 × 3 without backpropagation. The parameters of the LBC kernel are to first generate a set of all-zero matrices and then replace a part of 0 randomly with 1 or −1 according to the Bernoulli distribution to generate a sparse LBC kernel. The second part is a standard convolution layer; the size of the convolution kernel is 1 × 1. In such an LBC block, it consists of a non-trainable convolutional layer and a trainable convolutional layer, so that the model can be trained normally. The 1 × 1 convolutional layer in the second part only provides a parameter that can be trained for the previous LBC layer. Figure 2b shows the structure of the standard LBC block with the residual module. The residual module adds shortcut connections, but does not change the structure of the LBC block. Figure 2c shows the structure of the standard LBC block with the SE-residual module. In our work, we used depthwise convolution to construct the LBC layer. Compared with the standard LBC layer, the depth-separable LBC not only reduces the parameters, but also corresponds to each input feature map; there is only a 3 × 3 sparse LBC kernel for feature extraction, making the LBC kernel more intuitively reflect the result of its feature In our work, we used depthwise convolution to construct the LBC layer. Compared with the standard LBC layer, the depth-separable LBC not only reduces the parameters, but also corresponds to each input feature map; there is only a 3 × 3 sparse LBC kernel for feature extraction, making the LBC kernel more intuitively reflect the result of its feature extraction shown as Figure 2d. At the same time, we introduced SE optimization to add Electronics 2021, 10, 1329 6 of 14 attention weights to the feature maps extracted from the depthwise LBC layer, and used SE optimization weights to replace the 1 × 1 convolution of the traditional LBC layer shown in Figure 2e. In the entire mobile inverted bottleneck, the first and last two 1 × 1 convolutional layers are mainly used to adjust the number of input and output model channels. Even if we change the number of LBC kernels through model pruning, the input kernel output of the whole block will not change, which is convenient for the model to calculate the residual.

Baseline LB-MBNet Model
By combining the LBC mobile inverted bottleneck block, we propose our baseline model. The mobile inverted bottleneck has applications in many model structures and has also been proven to be an efficient model structure. In the EfficientNet [12], the input resolution, depth, and width of the model are all quantified, and an optimization search is performed to find the best performance model structure under different parameters. In our proposed model, the network width is also quantified, but we only change the number of LBC kernels in the depthwise LBC layer of each block. In the input of the model, we added a stem block to perform the first step of processing the input image. In this step, we used the standard convolutional layer instead of the LBC layer. In the subsequent model pruning, we do not prune the stem part, only pruning the block that uses the LBC. The expansion rate r in the model, that is, the ratio of the number of input channels in this block to the LBC convolution kernels in the block can be individually adjusted to adapt to different pruning scales. Figure 3 shows the proposed baseline model structure.
Electronics 2021, 10, x FOR PEER REVIEW 6 of 14 extraction shown as Figure 2d. At the same time, we introduced SE optimization to add attention weights to the feature maps extracted from the depthwise LBC layer, and used SE optimization weights to replace the 1 × 1 convolution of the traditional LBC layer shown in Figure 2e. In the entire mobile inverted bottleneck, the first and last two 1 × 1 convolutional layers are mainly used to adjust the number of input and output model channels. Even if we change the number of LBC kernels through model pruning, the input kernel output of the whole block will not change, which is convenient for the model to calculate the residual.

Baseline LB-MBNet Model
By combining the LBC mobile inverted bottleneck block, we propose our baseline model. The mobile inverted bottleneck has applications in many model structures and has also been proven to be an efficient model structure. In the EfficientNet [12], the input resolution, depth, and width of the model are all quantified, and an optimization search is performed to find the best performance model structure under different parameters. In our proposed model, the network width is also quantified, but we only change the number of LBC kernels in the depthwise LBC layer of each block. In the input of the model, we added a stem block to perform the first step of processing the input image. In this step, we used the standard convolutional layer instead of the LBC layer. In the subsequent model pruning, we do not prune the stem part, only pruning the block that uses the LBC. The expansion rate r in the model, that is, the ratio of the number of input channels in this block to the LBC convolution kernels in the block can be individually adjusted to adapt to different pruning scales. Figure 3 shows the proposed baseline model structure.

Model Pruning Based on SE Optimization Weight
As mentioned earlier in this paper, we directly connect the SE block and the depthwise LBC layer and use the SE optimization weight to express the importance of the LBC kernels. By setting the expansion rate r to a larger value, we can get a large model with more parameters. Through experiments on the database, we compared the accuracy performance of the large model and the small model and concluded that the results of the large model have better performance. This is also the basis for our model pruning.
SE optimization is used to perform global average pooling on the feature map of each channel and quantify the relative strength relationship between the features. It optimizes the attention weight of features in different feature channels for each sample. Correspond-

Model Pruning Based on SE Optimization Weight
As mentioned earlier in this paper, we directly connect the SE block and the depthwise LBC layer and use the SE optimization weight to express the importance of the LBC kernels. By setting the expansion rate r to a larger value, we can get a large model with more parameters. Through experiments on the database, we compared the accuracy performance of the large model and the small model and concluded that the results of the large model have better performance. This is also the basis for our model pruning.
SE optimization is used to perform global average pooling on the feature map of each channel and quantify the relative strength relationship between the features. It optimizes the attention weight of features in different feature channels for each sample. Corresponding to different samples, SE optimization weights are also different. However, we can calculate the mean value and distribution of SE optimization weights in large-scale samples, and score the characteristic channels in disguised form. Figure 4 shows the relationship between each LBC kernel and SE optimization weight in the proposed model. For the input X c , c is the number of input feature channels. First, channel expansion is performed through 1 × 1 convolution to obtain X rc . X i (i ∈ 0 . . . rc) is the i-th feature map in X rc and X' i is the output of the depthwise LBC layer. After the processing of the SE block, W i (i ∈ 0 . . . rc) is obtained.
ing to different samples, SE optimization weights are also different. However, we can calculate the mean value and distribution of SE optimization weights in large-scale samples, and score the characteristic channels in disguised form. Figure 4 shows the relationship between each LBC kernel and SE optimization weight in the proposed model. For the input Xc, c is the number of input feature channels. First, channel expansion is performed through 1 × 1 convolution to obtain Xrc. Xi (i ∈ 0…rc) is the i-th feature map in Xrc and X'i is the output of the depthwise LBC layer. After the processing of the SE block, Wi (i ∈ 0…rc) is obtained. For each sample, Wrc is the weight of this sample for each feature channel, and it is also the weight of the corresponding depthwise LBC kernel. By extracting the SE optimization weights of the overall sample of the database, we can count the mean and distribution of the SE optimization weights. Figure 5 shows the statistics of SE optimization weights of the training set on the CIFAR-10 database. Figure 5a shows our baseline model, using randomly initialized model parameters. Figure 5b shows a large model we built with more parameters by changing r. The parameters of (a) and (b) are initialized randomly, and the ratio of the model parameters is roughly equal to the ratio of the expansion ratio r. Figure 5c shows the use of the model pruning method we proposed. After pruning some of the convolution channels of model (b), (a) and (c) have the same expansion ratio and parameter amount. The parameters of (c) are calculated from (b).
In most of the blocks of (b), the number of SE optimization weights is large, the SE optimization with higher weights only accounts for a small part of the overall channel number, and many weights are close to 0. This also means that the LBC kernel corresponding to the channels with lower weight cannot extract significant features well in the entire data set. By filtering the SE optimization weights, we can get better LBC kernel parameters, each of which is a 3 × 3 sparse matrix. After model pruning, the SE weight distribution of the reconstructed model (c) is more concentrated than that of (a), and the LBC kernel has similar importance. For each sample, W rc is the weight of this sample for each feature channel, and it is also the weight of the corresponding depthwise LBC kernel. By extracting the SE optimization weights of the overall sample of the database, we can count the mean and distribution of the SE optimization weights. Figure 5 shows the statistics of SE optimization weights of the training set on the CIFAR-10 database. Figure 5a shows our baseline model, using randomly initialized model parameters. Figure 5b shows a large model we built with more parameters by changing r. The parameters of (a) and (b) are initialized randomly, and the ratio of the model parameters is roughly equal to the ratio of the expansion ratio r. Figure 5c shows the use of the model pruning method we proposed. After pruning some of the convolution channels of model (b), (a) and (c) have the same expansion ratio and parameter amount. The parameters of (c) are calculated from (b).
In most of the blocks of (b), the number of SE optimization weights is large, the SE optimization with higher weights only accounts for a small part of the overall channel number, and many weights are close to 0. This also means that the LBC kernel corresponding to the channels with lower weight cannot extract significant features well in the entire data set. By filtering the SE optimization weights, we can get better LBC kernel parameters, each of which is a 3 × 3 sparse matrix. After model pruning, the SE weight distribution of the reconstructed model (c) is more concentrated than that of (a), and the LBC kernel has similar importance.

Experiment Settings
For our proposed method, we use the CIFAR-10 database for image classification experiments. CIFAR-10 [45] is an image classification dataset containing a training set of 50K and a testing set of 10K 32 × 32 color images across the following 10 classes: airplanes, automobiles, birds, cats, deers, dogs, frogs, horses, ships, and trucks. Basically, all image classification models will be verified in the CIFAR-10 database.
In our test models, the model depth is fixed. In addition to the stem part, there are 4 MB blocks with LBC. Except for block 0, each block is repeated six times, and the scale of the feature map is halved in the initial layer of each block. Out of the SE blocks, each model has 59 convolutional layers, and there are 18 sets of SE optimization weights involving LBC in the last three blocks to participate in model pruning. Specific parameters are shown in Table 1. Besides, we also set the number of repetitions of each block to 3 and established a model of 32 convolutional layers. For the input data, we use data augmentation to expand the training data. The operations include width shift, height shift, and horizontal flip; the shift range is 4 pixels. All models are compiled using the stochastic gradient descent (SGD) method; the initial learning rate is set to 0.1, and decays by 0.1 at 80, 140, and 180 epochs. Each model is first trained for 200 epochs and then pruned, and then trained again for 200 epochs using the same hyperparameters. We use a graphics card "NVIDIA TITAN X Pascal" as the hardware for our experiments. The GPU operates at a frequency of 1417 MHz.

Experimental Result
In this chapter, we will evaluate the effect of the LBC layer in the proposed baseline model and compare its performance to that of the standard convolutional layer. At the same time, by changing the value of the expansion ratio r, the impact of different model scales on the recognition results is evaluated. Table 2 shows the model recognition result obtained while adjusting the expansion ratio r. We calculated the parameters and floating-point operations per second (FLOPS) of the model. From Table 2, we can conclude that, with the increase in r, the trainable parameter amount and FLOPs of the model also increase almost linearly. The accuracy of the model has also increased. When r = 6, the recognition accuracy of the LB-MBNet-59-r6 model reached 92.47%, whereas that of the LB-MBNet-32-r6 model reached 92.88%. When r = 80, the recognition accuracy of the LB-MBNet-59-r80 model reached 96.06%, whereas that of the LB-MBNet-32-r80 model reached 95.63%, increasing by 3.59% and 2.75%, respectively.
For the experiment of model pruning, we use the model with the best results and the maximum number of parameters as the basic model and obtain the mean value of SE optimization weight through 50 K samples of the training data set. For different optimization scales, we only keep the part with the larger SE optimization weight value and pruning the corresponding layers in each block of the model. In our experiment, we separately prune the model in segmented and one-shot ways. In the segmented experiment, the r-value will gradually drop to the optimal scale we expect. In the one-shot experiment, we use the basic model to directly modify the r-value to the optimal scale we expect. Table  3 showing the result of model pruning. We use the r80 model as the basic model for one-shot pruning. Here, we transfer the weight of the corresponding part to the rebuilt model according to the fixed r. From Table 3, we can conclude that, after the model is pruned, the accuracy rate remains at a relatively high level.
In the LB-MBNet-59 model, compared with the basic r80 model, the accuracy of the r40 model reaches 96.15%, which is an improvement of 0.09% on the basic model. We believe this is because the parameters of the r80 model are relatively high, and the model has over-fitting. In the case of reducing half of parameters while maintaining a more effective LBC kernel, the results of the r40 model have risen slightly. As we continue to reduce the expansion ratio, although the accuracy of the model has dropped slightly, compared with the randomly initialized model, the results of the pruning-rebuild model are significantly better. When the expansion ratio r is reduced to 6, the model parameters and FLOPs return to the level of the basic r6 model we proposed, but the rebuilt model result has been improved by 2.77% to 95.24%. At the same time, compared with the r80 model at the beginning of pruning, the accuracy rate is only reduced by 0.82%, and the parameter amount and FLOPs are reduced by 92% of the r80 model. This result proves that our proposed method is effective.
We further reduced the expansion ratio to r3. At this time, the Re-LB-MBNet-59-r3 model accuracy rate dropped significantly, reaching 93.97%, which is a 1.27% drop compared with the r6 model. However, compared with the randomly initialized r6 model, the parameters and FLOPs of this result decreased by 50%, but the result increased by 1.50%. We think that, because the expansion ratio is too small, the generalization ability of the model is not enough, which leads to overfitting at 93.97%. However, because the LBC kernel is not trainable, and after selection through model pruning, the result is better than random initialization.

Comparison with State-of-the-Art
We also compared the results of our proposed method with the state-of-the-art methods. At the same time, because many model pruning methods use ResNetV2-56 [63] as the basic model on the CIFAR-10 database, the baseline model we proposed has a similar number of convolutional layers (59 layers). We also compared the model pruning method based on ResNetV2-56. However, it should be noted that, because we have applied the SE optimization part in the proposed model, the amount of model parameters has also increased. Table 4 shows the comparison results between our proposed method and the state-ofthe-art methods.  Table 4 shows that the results of our method obtained relatively better accuracy compared with the state-of-the-art methods. Because our model uses the SE optimization module, our model has a relatively large number of parameters. However, because the convolution kernel with a size of 1 × 1 is also used in the SE optimization module, the increase in the number of parameters has little effect on the amount of calculation. Through the results of the two baseline models of LB-MBNet-59-r6 and LB-MBNet-32-r6, the recognition accuracy is not good. However, after the expansion and pruning operation, the recognition accuracy of Re-LB-MBNet-59-r6 has reached the state-of-the-art level while keeping the number of model parameters and calculations unchanged. Among the Re-LB-MBNet-59-r6 models, the most similar result is the ResNetV2-1001 [63] model, but our mode has much fewer parameters, and the FLOPs are only about 15%. Besides, in the Re-LB-MBNet-32-r6 model, the ResNetV2-164 [63] and L1-Sparse [52] models are the closest results. Although our models have more model parameters, but fewer FLOPs.
In our proposed method, precisely through SE optimization weights to score and prune the LBC kernel, we can make full use of the advantages of the untrainable kernel to obtain a model with less computation and high accuracy. Based on [29], the weight of the network after pruning by other methods is not very important, and our method truly retained only better model parameters.

Discussion
As we envisioned, the method of superimposing the non-trainable layer and the trainable layer can reduce the training parameters while allowing the model to be normally constructed into an end-to-end structure. However, the initialization of the non-trainable layer is very important. All parameters in the standard convolutional network can be trained, and reasonable parameters can be found through the optimization method of the model. However, in a convolution model with non-trainable parameters, it is necessary to filter out reasonable parameters through the methods like model pruning. In our proposed method, not only the advantages of LBC are maintained, which has a sparse and nontrainable binary convolution kernel, and at the same time, more reasonable LBC parameters are found through the method of model pruning, and the model with high recognition accuracy and fewer model parameters is obtained.
Besides, the proposed method uses data-driven SE optimization weights as the evaluation of pruning, and the results are obtained based on the results of model training, which are more accurate than some manually designed evaluation indicators. The basic model we proposed can evolve towards faster and more miniaturization. It is necessary to further optimize the depth and width of the model. However, this has already involved the research field of NAS and has not been carried out in this paper. The experimental results have been able to prove the effectiveness of our current proposed method.

Conclusions
Inspired by LBC and SE optimization, we propose a depthwise LBC and SE optimization model structure in this paper. By increasing the expansion ratio of the inverse bottleneck structure in the model, a large model with higher accuracy, but a huge number of parameters can be obtained through training. AccordFing to the SE optimization weight, we perform channel-based model pruning of the basic model and only retain the depthwise LBC convolution channel that contributes more to the result. Our experimental results of image classification on the CIFAR-10 database prove the effectiveness of our proposed method. Our proposed method shows that the untrainable convolution kernel has the same feature expression capabilities as standard CNNs, and through pruning based on SE weights, it can indeed retain the more powerful untrainable convolution kernels from the large model.