In this section, we apply our layer-wise pruning algorithm not only to image classification, which is usually utilized by other network compression methods, but also to semantic segmentation, which shows that our method can be used in other deep learning networks.
4.1. Image Classification
We apply our method to a VGG network [
1], which is one of representative networks in image classification. The VGG network is the standard DNN that consists of convolution layer and FC (fully connected) layer. In particular, VGG-16 is a very heavy network consisting of 13 convolution layers and 2 FC layers, and it has 134M of parameters. Many researches have attempted to compress this large-size network because it is not easy to operate in the resource limited devices.
We use the CIFAR10 dataset to evaluate the classification performance. For implementation, we use the pytorch library (1.5.0 verison). The reference model trained with a SGD (Stochastic Gradient Descent) optimizer with decaying learning rate starting from
and achieved 7.03% error. The target compression rate is over 98%, which is the same as 50 times (×50) by the pruning process. For the comparison, we refer the performance of other methods in AutoPrune [
24]. In
Table 1, when we use the same rate to compress VGG-16 to the ×50, an error rate increased from 7.03% to 10.99%. This accuracy drop is rather severe to employ it practically. Because the methods in Zhuang et al. [
39] and Zhu et al. [
40] can be categorized as channel pruning, their performance in terms of compression rate is comparably lower than other methods. However, they have another advantage of reducing computational cost effectively due to direct removal of channels. Sparse VD to the ×65 with no accuracy drop. In case of AutoPrune [
24], they compressed VGG-16 up to ×75 with little accuracy drop, 0.22%. Applying the proposed layer-wise method to compress VGG-16, the network is shrunk up to ×225, and the error rate decreases to 6.89%. The proposed method can generate the highly efficient network where remaining weights are less than 1% of total weights with accuracy improvement from the baseline, as in
Table 1. It validates that the proposed pruning strategy based on GMM-based weight modeling is effective.
The remaining weight and compression rate for each layer is shown in
Table 2. In VGG-16, from Conv 1 to Conv 8 are compressed at a relatively small rate. This means that there are relatively larger magnitude in the front side of the network with less redundancy. The layers in the back side of convolution layers from Conv 9 to Conv 12 and the FC layers 1 and 2 are massively pruned due to the large portion of small weights. Even though the huge number of parameters are pruned, it does not have a significant impact on performance. Interestingly, the compression rate of the final FC layer is comparably low. It means that the final layer includes important information that has weights of large magnitude.
In
Table 2, we can notice that VGG-16 has highly weight redundancy with a meaningful value of only 0.6 M out of 134 M. Because the rate of remainder is different according to layers, we can assume that the rate of remainder for a specific layer indicates the importance of that layer. We select Conv 8 as a more important layer and Conv 12, 13 as less important layers according to their rates of remainders in
Table 2. As in
Figure 5, by recovering Conv 8, which means not compressing Conv 8, we can achieve a more accurate result than the fully compressed model in
Table 2 with 1 M more weights. However, by recovering Conv 12 or 13, the accuracy is similar or even worse than the fully compressed model even though they have 2 M more weights. From this experiment, we can notice that, if the rate of remainder is higher for a layer, i.e., if the layer is less redundant after applying the proposed layer-wise weight pruning method, we can regard the layer as a more important layer.
Recently, efficient networks have been introduced by modifying the network structure or approximating the convolution computation, like MobileNet [
10]. MobileNet is one of good options when we develop applications running in a mobile phone or an embedded board. However, according to target platform, we may need more reduced network due to the limitation of memory or processing time. As in
Table 3, the proposed method can reduce the number of weights up to less than 20% of original MobileNet network, while an accuracy drop is less than 0.5%. Because the filter size of VGG-16 and MobileNet is
, we applied our method to GoogleNet [
29], which has
filters. Our method compressed GoogleNet less than 10% from the original network, while an accuracy drop is less than 1%, as shown in
Table 3. From these experiments, we can show that the proposed method can be applied to various networks, regardless of network efficiency and mask size.
To show the limitation of the amount of network size reduction, we have tried to compress the network as much as possible. As in
Figure 6a, a percentage of zeros is from 98.65% to 99.91% when we tried 5 times. Interestingly, the loss is totally not reduced as a retraining step is processed when a percentage of zeros is the maximum value of 99.91% in
Figure 6b. Consequently, the accuracy is not increased for the case in
Figure 6c. When the network is severely reduced so that a specific layer with significantly small remaining weights cannot transfer relevant information into the next layer, the training cannot be proceeded due to the bottleneck problem explained in
Section 1. When a percentage of zeros is not the maximum value, we can see that the loss becomes reduced, and the accuracy is increased, as shown in
Figure 6b,c. There is the limitation of network size reduction, which the network cannot be trainable due to the bottleneck problem. We should consider that the compressed network does not reach to this maximum value during the process of network compression.
In Equations (
8) and (
9), there are control parameters,
and
k, which can affect the performance of the proposed pruning method. Accordingly, we show the compression rate and accuracy according to varying
and
k values in
Figure 7. We tested
k from 5.5 to 7 with 0.5 interval, and
from 6 to 10 with 1 interval. Each value of compression rate and accuracy is obtained by averaging of performing 5 times. In case of small
k, there is no further compression through
is increasing. On the other hand, when
k is sufficiently large, the network is more compressed according to the increase of
, as in
Figure 7a. In addition, in
Figure 7b, accuracy decreases relatively slowly at
. Therefore, the compression rate and accuracy have the trade-off relationship when we select sufficiently large
k values. We select
and
for all of our experiments.
As mentioned earlier, the performance could be degraded when the specific layers are pruned severely due to the bottleneck problem.
Figure 8 shows the weight distribution after the training in the original network, the pruned network using the same rate, and the pruned network based on GMM, respectively. In the layer of small number of weights, as shown in
Figure 8a,d, the distribution of weights has different profiles from the original or GMM-based network due to the same rate pruning applied to initially small number of weights. Furthermore, in FC layer 3, most weights around the zero are pruned. Contrarily, in the layer of large number of weights in
Figure 8b,c, the great number of weights close to zero are remaining redundantly in the same rate pruning method compared with the GMM-based pruning method. From this layer-level analysis of weight distributions, the proposed GMM-based pruning method not only preserves the sufficient number of weights in the small-sized layers to escape from the bottleneck problem but also reduces the redundant number of weights to compress the network more effectively.
In
Figure 9, we show that the number of remaining weights for each layer in the same rate pruning and GMM-based pruning methods, respectively. Compared with the same rate pruning, the proposed layer-wise pruning has more weights from conv5 to conv9 layers but less weights from conv9 to conv12. Because the FC1 and FC2 layers are pruned considerably, we can notice that these layers have very high redundancy compared with other layers.
4.2. Semantic Segmentation
We apply the proposed method to another deep learning network to show the effectiveness of our method. Semantic segmentation is widely used in variety of applications, such as self-driving. FCN (fully connected network) [
5] proposed the 1 × 1 convolution to enable segmentation through DNNs. However, FCN is hard to run in real-time due to high computational cost. To reduce the computational cost, SegNet [
30] was proposed, which reduces the trainable parameters to execute in real time on the device with the strong computing power, such as Nvidia TitanX GPU. But, they still have the 29 M parameter, which could not run in real-time on the resource limited device.
We pre-trained a SegNet model using NYUv2 Dataset with an SGD optimizer with decaying learning rate starting from
, which is implemented using the pytorch library (1.5.0 verison). We attempt to compress the model to the 98% over as in
Section 4.1. We compute the accuracy (pixel-wise accuracy
) and mIoU (mean intersection over union
) to show our method’s performance. In
Table 4, accuracy and mIoU of the reference model are 57.71 and 20.50, respectively. When compressing this model to ×50 using the same rate pruning method [
23], the accuracy and mIoU drop to 53.64 and 16.90, respectively. Our compression method reduces up to compression rate of ×64, while the accuracy and mIoU are 55.27 and 17.90, respectively. For semantic segmentation network, the proposed method shows less accuracy drops, as well as higher compression rate, compared with the conventional pruning method [
23]. From the result, only 0.44 M out of 29 M parameters can be used with moderate accuracy drops. Because semantic segmentation should compute pixel-level classification labels, it has more accuracy drops than image-level classification. It means that the weights in the semantic segmentation have less redundancy.
In
Table 5, the remaining weights and compression rates are shown for each layer when applying the proposed method. The initial layers in the encoder network are more relevant than the last layers. Contrarily, the initial layers in the decoder network are more redundant than the last layers except for the connecting layer between encoder and decoder networks. This information can be helpful to design an efficient network for semantic segmentation.
To show the effectiveness of the propose pruning method, we apply our method to recent efficient semantic segmentation network, ENet [
6]. Although ENet was designed to achieve more than 10 fps on the low-resource device, Nvidia TX1, the network compression enables to apply it to the embedded board with lower specification or without GPU. As shown in
Table 6, the proposed method not only reduces the half of weights from the original ENet [
6] but also improves the accuracy of mIoU slightly. This accuracy improvement might be caused that the pruning method has a similar effect of the dropout method to generalize the network. If we pruned more than half, for example,
, mIoU is dropped more than
. This tendency is due to generalization-stability trade-off [
43] and shows that the redundancy of ENet is not severer than SegNet. The proposed method can be applied semantic segmentation networks, SegNet and ENet, to reduce the number of weights during maintaining the accuracy.