# Layer-Wise Compressive Training for Convolutional Neural Networks

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Background

#### 2.1. Overview of Training Algorithms

#### 2.2. Pruning

## 3. Compressive Training

#### $\sigma $-Constrained Stochastic Gradient Descent

## 4. A Greedy Approach for Compressive Training

#### 4.1. Pre-Training

**Step 0—Trained CNN Model**. The input of the proposed flow is represented by the trained model of the CNN that needs to be optimized. Our solution is designed to work on classical floating-point CNN models; however, it can be also applied to quantized CNNs. It can work equally on top of pre-trained floating-point CNN models, or on clean models, after a standard training process.

#### 4.2. Setup

**Step 1—Pruning**. It consists of a standard magnitude-based pruning applied to both convolutional and fully-connected layers. The user specifies an a priori value for the desired percentage of sparsity of the net. Since such a value is unique for the entire CNN, each layer may show a different pruning percentage. This allows representing the CNN model with a non-homogeneous inter-layer sparsity. We decided to follow this direction under the assumption that each layer influences the knowledge of the CNN differently, i.e., each layer provides a specific contribution to the final prediction. For this reason, the layers do not all keep the same amount of information, but the knowledge is spread heterogeneously among the layers, and hence they keep different percentages of redundant parameters.

**Step 2—Layers Sorting**. It is known that some layers are more significant than others. That means the compression of less significant layers will marginally degrade the overall performance classification. The most significant layers, instead, should be preserved in their original form. As a rule of thumb, we selected the intra-layer sparsity as a measure of significance. More in detail, we argue that layers with lower intra-layer sparsity are those that play a major role in the classification process, whereas those with a higher intra-layer sparsity can be sacrificed to achieve a more compact CNN representation. In other words, we base our concept of significance on the number of activated neurons.

#### 4.3. Optimization

**Step 3—re-Training**The retraining phase is applied in order to recover the accuracy loss due to pruning. The retraining is applied after pruning at first, and then after each optimization loop.

**Step 4—Compression**It is the compressive training described in Section 3. The weights are projected in a sub-dimensional space composed by just three values for each layer, i.e., $(-\sigma ,0,+\sigma )$, with $\sigma $ defined layer-wise.

**Step 5— Validation**The model is validated in order to quantify the accuracy loss due to compression, and thus to decide if it is worth continuing with further compressions. Validation is a paramount step for the greedy approach as it actually enables an accuracy-driven optimization. The accuracy $Ac{c}_{n}$ is evaluated and stored after each compression epoch n.

**Step 6—Condition 1 (C1)**The accuracy recorded during the n-th epoch (parameter $Ac{c}_{n}$) is used to determine if the CNN model can be further compressed, as in Equation (18). The accuracy of the pre-trained model ($Ac{c}_{0}$) works as baseline, whereas the parameter $\u03f5$ represents the user-defined accuracy loss tolerance:

**Step 7—Update**This stage is applied if Equation (18) is verified. The counter N indicates how many layers of the sorted list can be compressed. Each and every time $C1$ is evaluated as true, and N is incremented by $\Delta N$. The latter represents another granularity knob, hence on the speed of the framework; $\Delta N$ is mainly defined by the network size: the larger the CNN model, the larger the $\Delta N$.

**Step 8—Condition 2 (C2)**This last condition is based on the maximum number of epochs ${n}_{max}$, a user-defined hyperparameter. At the n-th iteration, if more than ${n}_{max}$ epochs are elapsed, the algorithm stops, else the flow iterates over step 3.

## 5. Results

**CNN models**—The adopted CNN models are trained from scratch or retrained from the Torchvision package of PyTorch [24]. More specifically, we adopted the following CNNs: AlexNet [1], VGG [25], and several residual networks with increasing complexity [26].

**Datasets**—CIFAR10 and CIFAR100 [27] are two large-scale image recognition benchmarks that, as their names suggest, differ for the number of available labels. Both are made up of 60,000 RGB images. The raw $32\times 32$ data are pre-processed using a standard contrast normalization. We applied a standard data augmentation composed of a 4-pixel zero-padding with a random horizontal flip. Each dataset is equally split into 50,000 and 10,000 images for training and validation, respectively. The intersection between training-set and validation-set is void. Tested CNNs are: AlexNet [1], VGG [25], and residual networks [26].

#### 5.1. Performance

#### 5.2. Comparison with the State-of-the-Art

## 6. Conclusions and Future Works

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- Poplin, R.; Varadarajan, A.V.; Blumer, K.; Liu, Y.; McConnell, M.V.; Corrado, G.S.; Peng, L.; Webster, D.R. Predicting cardiovascular risk factors from retinal fundus photographs using deep learning. arXiv, 2017; arXiv:1708.09843. [Google Scholar]
- Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
- Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Van Der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal.
**2017**, 42, 60–88. [Google Scholar] [CrossRef] [PubMed][Green Version] - Schlüter, J.; Grill, T. Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR 2015), Malaga, Spain, 26–30 October 2015; pp. 121–126. [Google Scholar]
- Wang, J.; Chen, Y.; Hao, S.; Peng, X.; Hu, L. Deep learning for sensor-based activity recognition: A survey. Pattern Recognit. Lett.
**2018**, in press. [Google Scholar] [CrossRef] - Bojarski, M.; Del Testa, D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.; Muller, U.; Zhang, J.; et al. End to end learning for self-driving cars. arXiv, 2016; arXiv:1604.07316. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv, 2015; arXiv:1503.02531. [Google Scholar]
- Ba, J.; Caruana, R. Do deep nets really need to be deep? arXiv, 2014; arXiv:1312.6184. [Google Scholar]
- Rigamonti, R.; Sironi, A.; Lepetit, V.; Fua, P. Learning separable filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2754–2761. [Google Scholar]
- Jaderberg, M.; Vedaldi, A.; Zisserman, A. Speeding up convolutional neural networks with low rank expansions. arXiv, 2014; arXiv:1405.3866. [Google Scholar]
- LeCun, Y.; Denker, J.S.; Solla, S.A. Optimal brain damage. In Advances in Neural Information Processing Systems; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1990; pp. 598–605. [Google Scholar]
- Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning filters for efficient convnets. arXiv, 2016; arXiv:1608.08710. [Google Scholar]
- Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv, 2015; arXiv:1510.00149. [Google Scholar]
- Gong, Y.; Liu, L.; Yang, M.; Bourdev, L. Compressing deep convolutional networks using vector quantization. arXiv, 2014; arXiv:1412.6115. [Google Scholar]
- Wu, J.; Leng, C.; Wang, Y.; Hu, Q.; Cheng, J. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4820–4828. [Google Scholar]
- Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 525–542. [Google Scholar]
- Courbariaux, M.; Bengio, Y.; David, J.P. Binaryconnect: Training deep neural networks with binary weights during propagations. arXiv, 2015; arXiv:1511.00363. [Google Scholar]
- Zhu, C.; Han, S.; Mao, H.; Dally, W.J. Trained ternary quantization. arXiv, 2016; arXiv:1612.01064. [Google Scholar]
- Li, F.B.; Zhang, B.L. Ternary Weight Networks. arXiv, 2016; arXiv:1605.04711. [Google Scholar]
- Hashemi, S.; Anthony, N.; Tann, H.; Bahar, R.I.; Reda, S. Understanding the impact of precision quantization on the accuracy and energy of neural networks. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Lausanne, Switzerland, 27–31 March 2017; pp. 1474–1479. [Google Scholar]
- Grimaldi, M.; Pugliese, F.; Tenace, V.; Calimera, A. A compression-driven training framework for embedded deep neural networks. In Proceedings of the Workshop on INTelligent Embedded Systems Architectures and Applications, Turin, Italy, 4 October 2018; pp. 45–50. [Google Scholar]
- Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. arXiv, 2015; arXiv:1506.02626. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv, 2014; arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features From Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
- Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv, 2016; arXiv:1606.06160. [Google Scholar]

**Figure 2.**Weights distribution of the second layer of AlexNet trained on CIFAR10. After the preliminary training (

**left**), after pruning (

**center**), after compression (

**right**).

**Figure 5.**AlexNet on Imagenet, layers after the sorting algorithm; the sparsity value S is reported for each layer.

**Figure 6.**Training plot of VGG-19 architecture on CIFAR10 dataset. The plot shows the validation accuracy after the re-Training and the Compression (Step3 and Step4) with the red-line, while, with the blue-line, the validation accuracy trend on the $\sigma $-constrained solution space can be seen (after the Compression, Step4).

**Table 1.**Validation accuracy results of compressive training on CNNs trained on CIFAR10. For each CNN model, the accuracy loss (Acc.Loss) is referred to the baseline accuracy reported in parentheses.

CIFAR10 | AlexNet (77.22%) | VGG19_{bn} (93.02%) | ResNet110 (93.81%) | |||
---|---|---|---|---|---|---|

Accuracy | Acc.Loss | Accuracy | Acc.Loss | Accuracy | Acc.Loss | |

OUR | 76.44% | 0.78% | 92.20% | 0.82% | 93.32% | 0.49% |

Compression Rate | $26.4\times $ | $6.7\times $ | $8.0\times $ |

**Table 2.**Validation accuracy results of compressive training on CNNs trained on CIFAR100. For each CNN model, the accuracy loss (Acc.Loss) is referred to the baseline accuracy reported in parentheses.

CIFAR100 | AlexNet (44.01%) | VGG19_{bn} (71.95%) | ResNet110 (71.14%) | |||
---|---|---|---|---|---|---|

Accuracy | Acc.Loss | Accuracy | Acc.Loss | Accuracy | Acc.Loss | |

OUR | 43.47% | 0.54% | 71.62% | 0.33% | 70.80% | 0.34% |

Compression Rate | $26.4\times $ | $6.5\times $ | $7.3\times $ |

**Table 3.**Layer-wise sparsity and width before (Full Precision) and after the compressive greedy technique (Compressed Model) application on ResNet20 architecture (CIFAR10). Considering that convolutional layers have input shape $(n,{c}_{in},{k}_{h},{k}_{w})$, and fully-connected layers have input shape $(n,{c}_{in})$, where ${k}_{h}$ and ${k}_{w}$ are respectively the height and width of input planes in pixels, while ${c}_{in}$ denotes the number of input channels and n the batch size.

FP | CM | ||||
---|---|---|---|---|---|

Layer | Input Shape | Sparsity | Width | Sparsity | Width |

[%] | [Bit] | [%] | [Bit] | ||

Conv1 | (16,3,3,3) | 0 | 32 | 41 | 32 |

Conv2 | (16,16,3,3) | 0 | 32 | 18 | 2 |

Conv3 | (16,16,3,3) | 0 | 32 | 45 | 32 |

Conv4 | (16,16,3,3) | 0 | 32 | 23 | 2 |

Conv5 | (16,16,3,3) | 0 | 32 | 24 | 2 |

Conv6 | (16,16,3,3) | 0 | 32 | 31 | 2 |

Conv7 | (16,16,3,3) | 0 | 32 | 37 | 2 |

Conv8 | (32,16,3,3) | 0 | 32 | 32 | 2 |

Conv9 | (32,32,3,3) | 0 | 32 | 45 | 2 |

Conv10 | (32,16,1,1) | 0 | 32 | 29 | 32 |

Conv11 | (32,32,3,3) | 0 | 32 | 44 | 2 |

Conv12 | (32,32,3,3) | 0 | 32 | 50 | 2 |

Conv13 | (32,32,3,3) | 0 | 32 | 53 | 2 |

Conv14 | (32,32,3,3) | 0 | 32 | 62 | 2 |

Conv15 | (64,32,3,3) | 0 | 32 | 44 | 2 |

Conv16 | (64,64,3,3) | 0 | 32 | 58 | 2 |

Conv17 | (64,32,1,1) | 0 | 32 | 50 | 32 |

Conv18 | (64,64,3,3) | 0 | 32 | 70 | 2 |

Conv19 | (64,64,3,3) | 0 | 32 | 87 | 2 |

Conv20 | (64,64,3,3) | 0 | 32 | 90 | 2 |

Conv21 | (64,64,3,3) | 0 | 32 | 94 | 2 |

Fc1 | (10,64) | 0 | 32 | 39 | 32 |

Final Sparsity | 0.00% | 68.16% | |||

Compression Rate | -× | 6.1× | |||

Accuracy | 93.02% | 92.47% |

**Table 4.**Layer-wise sparsity and width before (Full Precision) and after the compressive greedy technique (Compressed Model) application on AlexNet architecture (CIFAR100). Considering that convolutional layers have input shape $(n,{c}_{in},{k}_{h},{k}_{w})$, and fully-connected layers have input shape $(n,{c}_{in})$, where ${k}_{h}$ and ${k}_{w}$ are, respectively, the height and width of input planes in pixels, while ${c}_{in}$ denotes the number of input channels and n the batch size.

FP | CM | ||||
---|---|---|---|---|---|

Layer | Shape | Sparsity | Width | Sparsity | Width |

[%] | [Bit] | [%] | [Bit] | ||

Conv1 | (64,3,11,11) | 0 | 32 | 39 | 32 |

Conv2 | (192,64,5,5) | 0 | 32 | 57 | 2 |

Conv3 | (384,192,3,3) | 0 | 32 | 68 | 2 |

Conv4 | (256,384,3,3) | 0 | 32 | 55 | 2 |

Conv5 | (256,256,3,3) | 0 | 32 | 66 | 2 |

Fc1 | (10,256) | 0 | 32 | 18 | 32 |

Final Sparsity | 0.00% | 60.61% | |||

Compression Rate | -× | 26.7× | |||

Accuracy | 44.01% | 43.47% |

**Table 5.**Validation accuracy results of compressive training on CNNs trained on CIFAR10, comparison with respect to TTN work. For each CNN model, the accuracy loss (Acc.Loss) is referred to the baseline accuracy reported in parentheses. The compression rates are considered just for our solution.

CIFAR10 | ResNet20 (93.02%) | ResNet56 (93.65%) | ||
---|---|---|---|---|

Accuracy | Acc.Loss | Accuracy | Acc.Loss | |

OUR | 92.47% | 0.55% | 93.04% | 0.61% |

TTN [19] | 91.13% | 1.89% | 93.56% | 0.09% |

Compression Rate | $6.1\times $ | $6.9\times $ |

**Table 6.**Validation accuracy results of compressive training on CNNs trained on Imagenet, comparison with respect to other compressive approaches. For each CNN model, the accuracy loss (Acc.Loss) is referred to the baseline accuracy reported in parentheses. The compression rates are considered just for our solution.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Grimaldi, M.; Tenace, V.; Calimera, A.
Layer-Wise Compressive Training for Convolutional Neural Networks. *Future Internet* **2019**, *11*, 7.
https://doi.org/10.3390/fi11010007

**AMA Style**

Grimaldi M, Tenace V, Calimera A.
Layer-Wise Compressive Training for Convolutional Neural Networks. *Future Internet*. 2019; 11(1):7.
https://doi.org/10.3390/fi11010007

**Chicago/Turabian Style**

Grimaldi, Matteo, Valerio Tenace, and Andrea Calimera.
2019. "Layer-Wise Compressive Training for Convolutional Neural Networks" *Future Internet* 11, no. 1: 7.
https://doi.org/10.3390/fi11010007