# Mutual Information Based Learning Rate Decay for Stochastic Gradient Descent Training of Deep Neural Networks

## Abstract

**:**

## 1. Introduction

## 2. Related Work

- A MI-based automation of decaying LR SGD training of neural network models that adaptively sets the LR layer-wise or for the whole network, through the training cycle.
- A LR Range Test that defines the broad LR bounds within which the proposed algorithm operates.
- Evaluation of the proposed algorithm in comparison with state-of-the-art alternatives applied to a range of data sets and models, to demonstrate the viability of the use of MI for automating decaying LR SGD training.

## 3. Approach

Algorithm 1: MI-based decaying LR SGD |

Algorithm 2: LR Range Test |

## 4. Experiments

## 5. Discussion

## 6. Conclusions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- He, X.; Zhao, K.; Chu, X. AutoML: A Survey of the State-of-the-Art. arXiv
**2019**, arXiv:1908.00709. [Google Scholar] - Bottou, L. Online Algorithms and Stochastic Approximations. In Online Learning and Neural Networks; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. arXiv
**2016**, arXiv:1603.05027. [Google Scholar] - Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; He, K. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv
**2017**, arXiv:1706.02677. [Google Scholar] - Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
- Ruder, S. An overview of gradient descent optimization algorithms. arXiv
**2016**, arXiv:1609.04747. [Google Scholar] - Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res.
**2011**, 12, 2121–2159. [Google Scholar] - Zeiler, M. ADADELTA: An Adaptive Learning Rate Method. arXiv
**2012**, arXiv:1212.5701. [Google Scholar] - Tieleman, T.; Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. In Neural Networks for Machine Learning; COURSERA: Mountain View, CA, USA, 2012. [Google Scholar]
- Diederik, K.; Ba, J. Adam: A method for stochastic optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Rolinek, M.; Martius, G. L4: Practical loss-based stepsize adaptation for deep learning. arXiv
**2018**, arXiv:1802.05074. [Google Scholar] - You, Y.; Gitman, I.; Ginsburg, B. Large Batch Training of Convolutional Networks. arXiv
**2017**, arXiv:1708.03888. [Google Scholar] - Shamir, O.; Sabato, S.; Tishby, N. Learning and generalization with the Information Bottleneck. Theor. Comput. Sci.
**2010**, 411, 2696–2711. [Google Scholar] [CrossRef] [Green Version] - Hu, B.G.; He, R.; Yuan, X.T. Information-Theoretic Measures for Objective Evaluation of Classifications. arXiv
**2011**, arXiv:1107.1837. [Google Scholar] - Meyen, S. Relation between Classification Accuracy and Mutual Information in Equally Weighted Classification Tasks. Master’s Thesis, University of Hamburg, Hamburg, Germany, 2016. [Google Scholar]
- Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the IEEE Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015. [Google Scholar]
- Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 22–24 September 1999. [Google Scholar]
- Vasudevan, S. Dynamic Learning Rate using Mutual Information. arXiv
**2018**, arXiv:1805.07249. [Google Scholar] - Fang, H.; Wang, V.; Tamaguchi, M. Dissecting Deep Learning Networks–Visualising Mutual Information. Entropy
**2018**, 20, 823. [Google Scholar] [CrossRef] [Green Version] - Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trichler, A.; Bengio, Y. Learning Deep Representations by Mutual Information estimation and maximization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Smith, L.N. Cyclical Learning Rates for Training Neural Networks. arXiv
**2015**, arXiv:1506.01186. [Google Scholar] - LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version] - Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. (IJCV)
**2015**, 115, 211–252. [Google Scholar] [CrossRef] [Green Version] - Springenberg, J.; Dosovitskiy, A.; Brox, T.; Riedmiller, M. Striving for Simplicity: The All Convolutional Net. arXiv
**2014**, arXiv:1412.6806. [Google Scholar] - Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Repreresentations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E
**2004**, 69, 066138. [Google Scholar] [CrossRef] [Green Version] - Zagoruyko, S.; Komodakis, N. Wide Residual Networks. arXiv
**2016**, arXiv:1605.07146. [Google Scholar] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. arXiv
**2015**, arXiv:1512.03385. [Google Scholar] - Wilson, A.; Roelofs, R.; Stern, M.; Srebro, N.; Recht, B. The marginal value of Adaptive Gradient methods in Machine Learning. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Keskar, N.; Socher, R. Improving generalization performance by switching from Adam to SGD. arXiv
**2017**, arXiv:1712.07628. [Google Scholar] - Cho, D.; Yoo, C.; Im, J.; Cha, D. Comparative assessment of various machine learning-based bias correction methods for numerical weather prediction model forecasts of extreme air temperatures in urban areas. Earth Space Sci.
**2020**, 7, e2019EA000740. [Google Scholar] [CrossRef] [Green Version]

**Figure 1.**Mutual Information (MI) (of input and output training data) vs. sample size for MNIST (

**left**) and CIFAR-10 (

**right**) as computed using the KSG estimator. The plots show estimated mean and standard deviation (error bar) for each sample size tested. A sample size of 1000 was chosen for MI computation in the experiments of this paper—this was selected as a trade-off between computational cost of computing MI and the variation in estimates. A sample size sensitivity test using CIFAR-10 is described in the experiments.

**Figure 3.**MNIST (Model based on LeNet-5): Accuracy and LR plots using the proposed approach and single model-level LR in [0.0001, 0.2]. The proposed approach produced a best test accuracy of 99.27% in 50 epochs, compared to the best alternative of 99.53% obtained using both Adam and RMSprop. Best outcomes from 3 random seed runs are reported.

**Figure 4.**MNIST (Model based on LeNet-5): Accuracy and LR plots using the proposed approach and layer-wise LR in [0.0001, 0.2]. The proposed approach produced a best test accuracy of 99.39% in 50 epochs, compared to the best alternative of 99.43% obtained using LARS. Best outcomes from 3 random seed runs are reported.

**Figure 5.**CIFAR10 (AllConvNet): Accuracy and LR plots using the proposed approach and a single model-level LR in [0.00075, 0.04]. The proposed approach produced a best test accuracy of 88.86% in 350 epochs, compared to the best alternative of 88.13% obtained using RMSprop. Best outcomes from 3 random seed runs are reported.

**Figure 6.**CIFAR10 (AllConvNet): Accuracy and LR plots using the proposed approach and layer-wise LR in [0.00075, 0.04]. The proposed approach produced a best test accuracy of 87.77% in 350 epochs, compared to the best alternative of 77.03% obtained using LARS. Best outcomes from 3 random seed runs are reported.

**Figure 7.**CIFAR10 (VGG16): Accuracy and LR plots using the proposed approach and and a single model-level LR in [0.0003, 0.07]. The proposed approach produced a best test accuracy of 92.21% in 200 epochs, compared to the best alternative of 92.58% obtained using SGD with fixed LR decay policy. Best outcomes from 3 random seed runs are reported.

**Figure 8.**CIFAR100 (Wide-Resnet-28-10): Accuracy and LR plots using the proposed approach and a single model-level LR in [0.0003, 0.07]. The proposed approach produced a best test accuracy of 81.25% in 200 epochs, compared to the best alternative of 81.76% obtained using SGD with a fixed LR decay policy. The proposed approach reached top-level accuracies 10–15% faster than the alternative. Best outcomes from 3 random seed runs are reported.

**Figure 9.**Imagenet-1K (Resnet-50): Accuracy and LR plots using the proposed approach and a single model-level LR in [0.0005, 0.1]. The proposed approach produced a best test accuracy of 76.05% in 100 epochs, compared to the best alternative of 75.57% obtained using SGD with a fixed LR decay policy. Best outcomes from 3 random seed runs are reported.

**Figure 10.**Results of the application of the proposed approach to a temperature prediction (regression problem) data set [32]. The proposed approach produced a competitive Mean Absolute Error (MAE) of 1.32 ${}^{\xb0}$C in comparison to the best alternative approach (Adam) which produced an MAE of 0.97 ${}^{\xb0}$C. Reported numbers are best outcomes of three random seed runs.

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Vasudevan, S.
Mutual Information Based Learning Rate Decay for Stochastic Gradient Descent Training of Deep Neural Networks. *Entropy* **2020**, *22*, 560.
https://doi.org/10.3390/e22050560

**AMA Style**

Vasudevan S.
Mutual Information Based Learning Rate Decay for Stochastic Gradient Descent Training of Deep Neural Networks. *Entropy*. 2020; 22(5):560.
https://doi.org/10.3390/e22050560

**Chicago/Turabian Style**

Vasudevan, Shrihari.
2020. "Mutual Information Based Learning Rate Decay for Stochastic Gradient Descent Training of Deep Neural Networks" *Entropy* 22, no. 5: 560.
https://doi.org/10.3390/e22050560