# ThriftyNets: Convolutional Neural Networks with Tiny Parameter Budget

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

#### 2.1. Pruning

#### 2.2. Quantization

#### 2.3. Distillation

#### 2.4. Efficient Scaling

#### 2.5. Factorization

#### 2.6. Recurrent Residual Networks as ODE

## 3. Methodology

#### 3.1. Context

#### 3.2. Thrifty Networks

**ThriftyNet**, is then defined by the following recursive sequence:

`PAD`creates extra channels filled with constant values to extend the dimension of $\mathbf{x}$,

`BN`is a batchnorm layer, ${\mathcal{D}}_{t}$ is a downsampling operation (typically achieved with strides or pooling) or the identity function, and

`FC`is a final fully connected layer. Note that the PAD function is only used on the input of the network as a mean to adapt the input tensor shape to the convolutional layer. Note that an overview of the proposed method is depicted in Figure 1.

#### 3.3. Augmented Thrifty Networks

**augmented thrifty network**adds $T(h+1)$ parameters on top of a regular thrifty network, with h being a hyperparameter representing how many steps in history are kept in memory when processing a new iteration. They are grouped in a matrix $\alpha $. Those parameters are the coefficients weighting the contribution from past activations at each step. In augmented thrifty nets, Equation (1) is replaced as follows:

#### 3.4. Pooling Strategy

#### 3.5. Grouped Convolutions

#### 3.6. Hyperparameters and Size of the Model

- The number of filters f in the convolutional layer
- The dimension $(a,b)$ of the convolution kernel
- The number of iterations T
- The number of steps h to consider when performing shortcuts from previous activations
- The downsampling sequence ${\left({\mathcal{D}}_{t}\right)}_{t}$

#### 3.7. Depth and Abstraction

## 4. Experiments

#### 4.1. Impact of Data Augmentation

#### 4.2. Comparison with Standard Architectures

#### 4.3. Factorization and Filter Usage

#### 4.4. Efficient ThriftyNets

#### 4.5. Effect of the Number of Iterations

#### 4.6. Effect of the Number of Filters

#### 4.7. Effect of the Number of Downsamplings

#### 4.8. Freezing the Shortcut Parameters in an Augmented ThriftyNet

- (a)
- The same model without resetting the other parameters
- (b)
- The same model starting from the same initialization
- (c)
- The same model starting from another (random) initialization

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–10 December 2015; pp. 1135–1143. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv
**2015**, arXiv:1503.02531. [Google Scholar] - Courbariaux, M.; Bengio, Y.; David, J.P. Binaryconnect: Training deep neural networks with binary weights during propagations. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–10 December 2015; pp. 3123–3131. [Google Scholar]
- Wu, J.; Wang, Y.; Wu, Z.; Wang, Z.; Veeraraghavan, A.; Lin, Y. Deep k-Means: Re-Training and Parameter Sharing with Harder Cluster Assignments for Compressing Deep Convolutions. arXiv
**2018**, arXiv:1806.09228. [Google Scholar] - Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv
**2015**, arXiv:1510.00149. [Google Scholar] - Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv
**2017**, arXiv:1704.04861. [Google Scholar] - Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv
**2016**, arXiv:1602.07360. [Google Scholar] - Tan, M.; Le, Q.V. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv
**2019**, arXiv:1905.11946. [Google Scholar] - LeCun, Y.; Denker, J.S.; Solla, S.A. Optimal brain damage. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 26–29 November 1990; pp. 598–605. [Google Scholar]
- Blalock, D.; Ortiz, J.J.G.; Frankle, J.; Guttag, J. What is the State of Neural Network Pruning? arXiv
**2020**, arXiv:2003.03033. [Google Scholar] - Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning filters for efficient convnets. arXiv
**2016**, arXiv:1608.08710. [Google Scholar] - Luo, J.H.; Wu, J.; Lin, W. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5058–5066. [Google Scholar]
- Hacene, G.B.; Lassance, C.; Gripon, V.; Courbariaux, M.; Bengio, Y. Attention based pruning for shift networks. arXiv
**2019**, arXiv:1905.12300. [Google Scholar] - Gupta, S.; Agrawal, A.; Gopalakrishnan, K.; Narayanan, P. Deep learning with limited numerical precision. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1737–1746. [Google Scholar]
- Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 4107–4115. [Google Scholar]
- Merolla, P.A.; Arthur, J.V.; Alvarez-Icaza, R.; Cassidy, A.S.; Sawada, J.; Akopyan, F.; Jackson, B.L.; Imam, N.; Guo, C.; Nakamura, Y.; et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science
**2014**, 345, 668–673. [Google Scholar] [CrossRef] - Farabet, C.; Martini, B.; Corda, B.; Akselrod, P.; Culurciello, E.; LeCun, Y. Neuflow: A runtime reconfigurable dataflow processor for vision. In Proceedings of the CVPR 2011 Workshops, Colorado Springs, CO, USA, 20–25 June 2011; pp. 109–116. [Google Scholar]
- Gong, Y.; Liu, L.; Yang, M.; Bourdev, L. Compressing deep convolutional networks using vector quantization. arXiv
**2014**, arXiv:1412.6115. [Google Scholar] - Denton, E.L.; Zaremba, W.; Bruna, J.; LeCun, Y.; Fergus, R. Exploiting linear structure within convolutional networks for efficient evaluation. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1269–1277. [Google Scholar]
- Choi, Y.; El-Khamy, M.; Lee, J. Towards the limit of network quantization. arXiv
**2016**, arXiv:1612.01543. [Google Scholar] - Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv
**2014**, arXiv:1412.6550. [Google Scholar] - Koratana, A.; Kang, D.; Bailis, P.; Zaharia, M. Lit: Block-wise intermediate representation training for model compression. arXiv
**2018**, arXiv:1810.01937. [Google Scholar] - Furlanello, T.; Lipton, Z.C.; Tschannen, M.; Itti, L.; Anandkumar, A. Born again neural networks. arXiv
**2018**, arXiv:1805.04770. [Google Scholar] - Park, W.; Kim, D.; Lu, Y.; Cho, M. Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3967–3976. [Google Scholar]
- Lassance, C.; Bontonou, M.; Hacene, G.B.; Gripon, V.; Tang, J.; Ortega, A. Deep geometric knowledge distillation with graphs. arXiv
**2019**, arXiv:1911.03080. [Google Scholar] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv
**2016**, arXiv:1605.07146. [Google Scholar] - Chen, W.; Wilson, J.; Tyree, S.; Weinberger, K.; Chen, Y. Compressing neural networks with the hashing trick. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2285–2294. [Google Scholar]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Zhang, L.; Schaeffer, H. Forward Stability of ResNet and Its Variants. J. Math. Imaging Vis.
**2020**, 62, 328–351. [Google Scholar] [CrossRef] [Green Version] - Avelin, B.; Nyström, K. Neural ODEs as the Deep Limit of ResNets with constant weights. arXiv
**2019**, arXiv:1906.12183. [Google Scholar] [CrossRef] [Green Version] - Chen, T.Q.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D.K. Neural ordinary differential equations. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 6571–6583. [Google Scholar]
- Jacobsen, J.H.; Smeulders, A.; Oyallon, E. i-revnet: Deep invertible networks. arXiv
**2018**, arXiv:1802.07088. [Google Scholar] - Behrmann, J.; Grathwohl, W.; Chen, R.T.; Duvenaud, D.; Jacobsen, J.H. Invertible residual networks. arXiv
**2018**, arXiv:1811.00995. [Google Scholar] - Liao, Q.; Poggio, T. Bridging the gaps between residual learning, recurrent neural networks and visual cortex. arXiv
**2016**, arXiv:1604.03640. [Google Scholar] - Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. 2009. Available online: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.222.9220&rep=rep1&type=pdf (accessed on 30 March 2020).
- Foret, P.; Kleiner, A.; Mobahi, H.; Neyshabur, B. Sharpness-Aware Minimization for Efficiently Improving Generalization. arXiv
**2020**, arXiv:2010.01412. [Google Scholar] - Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation policies from data. arXiv
**2018**, arXiv:1805.09501. [Google Scholar] - DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv
**2017**, arXiv:1708.04552. [Google Scholar] - Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv
**2017**, arXiv:1710.09412. [Google Scholar] - Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision, Thessaloniki, Greece, 23–25 September 2019; pp. 6023–6032. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- Paupamah, K.; James, S.; Klein, R. Quantisation and Pruning for Neural Network Compression and Regularisation. arXiv
**2020**, arXiv:2001.04850. [Google Scholar] - Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Zhu, M.; Chang, B.; Fu, C. Convolutional neural networks combined with runge-kutta methods. arXiv
**2018**, arXiv:1802.08831. [Google Scholar]

**Figure 1.**Flow diagram of our algorithm (

**bottom**). The typical three-channeled input is first padded with zeros to match a predetermined number of filters. Then, ThriftyNet performs a user-defined amount of iterations T, consisting of a convolution with the filter, a non-linear activation, a shortcut and a Batch Normalization (

**left**box). Alternatively, an augmented ThriftyNet performs the same operation, as well as a weighted sum of this result with previous iterations before the normalization step (

**right**box). In both cases, the final tensor ${x}_{T}$ is fed into a global max pooling, extracting one feature per filter, and into a fully connected layer, connecting it to the output classes. The resulting architecture contains very few parameters mostly determined by the number of feature maps in the convolutional layer.

**Figure 3.**Accuracy on CIFAR-10 of various ThriftyNets of 40 k parameters, as a function of the number of Macs they perform. In blue, the points are obtained by considering the irregular spacing of downsamplings. In red, the points are obtained by varying the number of iterations, while maintaining a regular spacing between downsamplings.

**Figure 4.**Accuracy on CIFAR-10 as a function of the number of iterations. The number of parameters is fixed at 40 k, and pooling is evenly spaced in order to be performed 4 times, plus one last time in the end. Experiments were performed 5 times and the mean accuracy is plotted.

**Figure 5.**Accuracy on CIFAR-10 (

**left**) and SVHN (

**right**) as function of the number of parameters. Experiments were performed for different numbers of iterations (T) of an augmented ThriftyNet. h=5. Pooling was performed every T/5 iterations. The number of filters varied from 32 to 256 by increments of 32.

**Figure 6.**Accuracy on CIFAR-10 as function of the number of downsamplings. We performed 30, and downsamplings are evenly spaced, with a global max pooling at the very end.

**Table 1.**Summary of the number of parameters of the proposed models, as functions of the hyperparameters.

Model | Convolutions | Size |
---|---|---|

ThriftyNet | Classical | ${f}^{2}ab+2fT$ |

ThriftyNet | Grouped | $f(ab+f)+2fT$ |

augmented ThriftyNet | Classical | ${f}^{2}ab+2fT+hT$ |

augmented ThriftyNet | Grouped | $f(ab+f)+2fT+hT$ |

**Table 2.**Impact of the chosen data augmentation technique on the test sets accuracy when training augmented ThriftyNets with 40 k parameters and 20 iterations trained on CIFAR-10.

Data Augment | Test Accuracy |
---|---|

Standard (crops and horizontal flips) | 90.64% |

Standard + auto augment [38] | 91.00% |

Standard + cutout size 8 [39] | 90.40% |

Standard + mixup [40] | 88.09% |

Standard + cutmix [41] | 88.47% |

**Table 3.**Comparative results on CIFAR-10. Experiments were performed 5 times and the interval is the observed standard deviation.

Model | Parameters | Macs | Test Accuracy |
---|---|---|---|

Resnet-110 [26] | 1.7 M | 250 M | 93.57% |

FitNet [21] | 1.6 M | - | 91.10% |

SANet [13] | 980 k | <42 M | 95.52% |

Pruned ShuffleNets [43] | 879 K | <4 M | 93.05% |

DenseNet-BC [44] | 800 k | 129 M | 95.49% |

Pruned MobileNet [43] | 671 K | <7.8 M | 91.53% |

Wide-ResNet [27] | 564 K | 84.3 M | 93.15% |

IRKNets [45] | 320 k | - | 92.82% |

Resnet-20 [26] | 270 k | 40 M | 91.25% |

3-state Recurrent Resnet [35] | 121 K | - | 92.53% |

Fully Recurrent Resnet [35] | 39.7 K | - | 87% |

Tiny Resnet (ours) | 43.6 K | 6.8 M | 86.72% |

Tiny DensetNet-BC (ours) | 39.6 K | 6 M | 87.81% |

ThriftyNet h = 5, T = 15 (ours) | 39.6 K | 130 M | 90.15 ± 0.42% |

ThriftyNet h = 5, T = 45 (ours) | 39.6 K | 300 M | 90.95 ± 0.45% |

Model | Parameters | Macs | Test Accuracy |
---|---|---|---|

FitNet [21] | 2.5 M | - | 64.96% |

ResNet-164 [26] | 1.7 M | 257 M | 75.67% |

IRKNets [45] | 1.4 M | - | 79.15% |

SANet [13] | 1.01 M | 42 M | 77.39% |

DenseNet-BC [44] | 800 k | 129 M | 77.73% |

Wide-ResNet [27] | 600 k | 84.3 M | 69.11% |

ThriftyNet h = 5, T = 40 (ours) | 600 k | 2740 M | 74.37% |

**Table 5.**Comparative results on ImageNet ILSVRC 2012, considering Top 1 Accuracy (Top1A), Top 5 Accuracy (Top5A), number of parameters and number of operations (Macs).

Model | Parameters | Macs | Top1A | Top5A |
---|---|---|---|---|

Resnet18 | 11.6 M | 1816 M | 69.7% | 89.8% |

small-Resnet18 | 6.7 M | 1044 M | 66.7% | 87.5% |

Resnet18-ThriftyNet h = 1, T = 3, B = 2 (ours) | 6.84 M | 1810 M | 68.4% | 88.7% |

Resnet18-ThriftyNet h = 1, T = 7, B = 4 (ours) | 4.15 M | 4347 M | 67.1% | 87.5% |

Model | Test Accuracy |
---|---|

Baseline accuracy | 91.08% |

After binarization and fine tuning (a) | 88.50% |

After training from the same initialization (b) | 90.47% |

After training from another initialization (c) | 89.98% |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Coiffier, G.; Boukli Hacene, G.; Gripon, V.
ThriftyNets: Convolutional Neural Networks with Tiny Parameter Budget. *IoT* **2021**, *2*, 222-235.
https://doi.org/10.3390/iot2020012

**AMA Style**

Coiffier G, Boukli Hacene G, Gripon V.
ThriftyNets: Convolutional Neural Networks with Tiny Parameter Budget. *IoT*. 2021; 2(2):222-235.
https://doi.org/10.3390/iot2020012

**Chicago/Turabian Style**

Coiffier, Guillaume, Ghouthi Boukli Hacene, and Vincent Gripon.
2021. "ThriftyNets: Convolutional Neural Networks with Tiny Parameter Budget" *IoT* 2, no. 2: 222-235.
https://doi.org/10.3390/iot2020012