A Bounded Scheduling Method for Adaptive Gradient Methods
Abstract
:1. Introduction
- We investigate the factors that lead to the poor performance of Adam while training complex neural networks.
- We set effective bounds for the learning rate of Adam without manual tuning, which can improve the generation capability.
- We schedule the bounds of learning rate to improve the performance of Adam. Firstly we fix the upper bound and increase the lower bound gradually to find wide, flat minima. Then we fix the lower bound and decrease the upper bound gradually to ensure the convergence of training. At last, a fixed learning rate is used to make the algorithm converge to the optimal solution.
- We train multiple tasks on several models to evaluate the algorithm. MNIST [20] is trained on simple neural networks, CIFAR-10 [21] and CIFAR-100 [21] are trained on ResNet [22], DenseNet [23] and SENet [24], Penn Treebank [25] is trained on LSTM [26]. All these experiments show that our method is capable of eliminating the generalization gap between Adam and SGD and maintaining a higher convergence speed in training.
2. Background
2.1. Traditional Learning Rate Methods
2.2. Adaptive Gradient Methods
3. Bsadam: Bounded Scheduling Method for Adam
3.1. Preliminaries
- The learning rate learned by Adam may circumstantially be too small for effective convergence, which will make it fail to find a right path and converge to a suboptimal point [13].
3.2. Specify Bounds for Adam
3.3. Schedule Bounds for Adam
3.3.1. Finding Minima
- linear rise:
- exponential rise:
- trigonometric rise:
3.3.2. Converging
- linear decrease:
- exponential decrease:
- trigonometric decrease:
3.3.3. Uniform Scaling
3.4. Algorithm Overview
Algorithm 1 Bsadam. |
|
4. Experiments
4.1. Experimental Setup
- SGD(M) We used SGDM for image classification tasks and SGD for language modeling tasks. When using SGDM, we set the momentum parameter to the default value 0.9. we roughly tuned the learning rate for SGD(M) on a logarithmic scale from to first and then fine-tune the learning rate.
- Adagrad The general learning rates used for Adagrad are chosen from {0.0005, 0.001, 0.005, 0.01, 0.1}.
- Adam The general learning rates used for Adam were chosen from {0.0005, 0.001, 0.005, 0.01, 0.05}. We set {, } as the recommended default value {0.9, 0.999}. The perturbation value is set to .
- Bsadam We used the same hyper-parameter settings for Adam. The upper and lower bounds were determined by learning rate range test and the rise and decrease of bounds are linear.
4.2. Image Classfication
4.2.1. Simple Neural Network
4.2.2. Deep Convolutional Network
4.3. Language Modeling
4.4. Comparison of Different Scheduling Methods
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Seo, S.; Kim, J. Efficient Weights Quantization of Convolutional Neural Networks Using Kernel Density Estimation based Non-uniform Quantizer. Appl. Sci. 2019, 9, 2559. [Google Scholar] [CrossRef]
- Song, K.; Yang, H.; Yin, Z. Multi-Scale Attention Deep Neural Network for Fast Accurate Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2018, 1. [Google Scholar] [CrossRef]
- Maas, A.L.; Qi, P.; Xie, Z.; Hannun, A.Y.; Lengerich, C.T.; Jurafsky, D.; Ng, A.Y. Building DNN acoustic models for large vocabulary speech recognition. Comput. Speech Lang. 2017, 41, 195–213. [Google Scholar] [CrossRef] [Green Version]
- Hinton, G.; Deng, L.; Yu, D.; Dahl, G.; Mohamed, A.R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Kingsbury, B.; et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
- Violante, M.G.; Marcolin, F.; Vezzetti, E.; Ulrich, L.; Billia, G.; Di Grazia, L. 3D Facial Expression Recognition for Defining Users’ Inner Requirements—An Emotional Design Case Study. Appl. Sci. 2019, 9, 2218. [Google Scholar] [CrossRef]
- Zhang, J.; Zong, C. Deep Neural Networks in Machine Translation: An Overview. IEEE Intell. Syst. 2015, 30, 16–25. [Google Scholar] [CrossRef]
- Robbins, H.; Monro, S. A Stochastic Approximation Method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
- Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2001, 12, 2121–2159. [Google Scholar]
- Nocedal, J.; Wright, S. Numerical Optimization; Springer Science & Business Media: Berlin, Germany, 2006. [Google Scholar]
- Smith, L.N. Cyclical learning rates for training neural networks. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 464–472. [Google Scholar]
- Smith, L.N.; Topin, N. Super-convergence: Very fast training of neural networks using large learning rates. Artif. Intell. Mach. Learn. Multi-Domain Oper. Appl. 2019, 11006, 1100612. [Google Scholar]
- Luo, L.; Xiong, Y.; Liu, Y.; Sun, X. Adaptive gradient methods with dynamic bound of learning rate. arXiv 2019, arXiv:1902.09843. [Google Scholar]
- Zeiler, M.D. ADADELTA: An adaptive learning rate method. arXiv 2012, arXiv:1212.5701. [Google Scholar]
- Tieleman, T.; Geoffrey, H. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 2012, 4, 26–31. [Google Scholar]
- Kingma, D.P.; Ba, J.L. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Dozat, T. Incorporating Nesterov Momentum into Adam. ICLR Workshop 2016, 1, 2013–2016. [Google Scholar]
- Reddi, S.J.; Kale, S.; Kumar, S. On the Convergence of Adam and Beyond. arXiv 2019, arXiv:1904.09237. [Google Scholar]
- Nitish, S.K.; Richard, S. Improving generalization performance by switching from Adam to SGD. arXiv 2017, arXiv:1712.07628. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Iandola, F.; Moskewicz, M.; Karayev, S.; Girshick, R.; Darrell, T.; Keutzer, K. Densenet: Implementing efficient convnet descriptor pyramids. arXiv 2014, arXiv:1404.1869. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Mitchell, P.M.; Mary, A.M.; Beatrice, S. Building a large annotated corpus of english: The penn treebank. Comput. Linguist. 1993, 19, 313–330. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Roux, N.L.; Schmidt, M.; Bach, F. A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets. Adv. Neural Inf. Process. Syst. 2012, 4, 2663–2671. [Google Scholar]
- Fletcher, R. On the barzilai-borwein method. In Optimization and Control with Applications; Springer: Boston, MA, USA, 2005; pp. 235–256. [Google Scholar]
- Raydan, M. On the barzilai and borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 1993, 13, 321–326. [Google Scholar] [CrossRef]
- Massé, P.-Y.; Ollivier, Y. Speed learning on the fly. arXiv 2015, arXiv:1511.02540. [Google Scholar]
- Xiangyi, C.; Sijia, L.; Ruoyu, S.; Mingyi, H. On the convergence of a class of Adam-type algorithms for non-convex optimization. arXiv 2018, arXiv:1808.02941. [Google Scholar]
- Ashia, C.W.; Rebecca, R.; Mitchell, S.; Nati, S.; Benjamin, R. The marginal value of adaptive gradient methods in machine learning. Adv. Neural Inf. Process. Syst. 2017, 30, 4148–4158. [Google Scholar]
- Hardt, M.; Recht, B.; Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. arXiv 2015, arXiv:1509.01240. [Google Scholar]
- Zhang, R. Making convolutional networks shift-invariant again. arXiv 2019, arXiv:1904.11486. [Google Scholar]
Data Set | Model | Network Type | SGD(M) | Adagrad | Adam | Bsadam |
---|---|---|---|---|---|---|
MNIST | 1-Layer Perceptron | Feedforward | 0.1 | 0.001 | 0.001 | (0.01,0.5) |
MNIST | 1-Layer Convolutional | Convolutional | 0.1 | 0.001 | 0.001 | (0.01,0.5) |
CIFAR-10 | ResNet | Deep Convolutional | 0.1 | 0.001 | 0.001 | (0.01,0.5) |
CIFAR-10 | DenseNet | Deep Convolutional | 0.1 | 0.001 | 0.001 | (0.01,0.5) |
CIFAR-10 | SENet | Deep Convolutional | 0.1 | 0.001 | 0.001 | (0.01,0.5) |
CIFAR-100 | ResNet | Deep Convolutional | 0.3 | 0.001 | 0.001 | (0.05,1) |
CIFAR-100 | DenseNet | Deep Convolutional | 0.1 | 0.001 | 0.001 | (0.05,1) |
CIFAR-100 | SENet | Deep Convolutional | 0.1 | 0.001 | 0.001 | (0.01,0.5) |
Penn Treebank | 1-Layer LSTM | Recurrent | 50 | 0.001 | 0.001 | (5,100) |
Penn Treebank | 2-Layer LSTM | Recurrent | 50 | 0.001 | 0.001 | (5,100) |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tang, M.; Huang, Z.; Yuan, Y.; Wang, C.; Peng, Y. A Bounded Scheduling Method for Adaptive Gradient Methods. Appl. Sci. 2019, 9, 3569. https://doi.org/10.3390/app9173569
Tang M, Huang Z, Yuan Y, Wang C, Peng Y. A Bounded Scheduling Method for Adaptive Gradient Methods. Applied Sciences. 2019; 9(17):3569. https://doi.org/10.3390/app9173569
Chicago/Turabian StyleTang, Mingxing, Zhen Huang, Yuan Yuan, Changjian Wang, and Yuxing Peng. 2019. "A Bounded Scheduling Method for Adaptive Gradient Methods" Applied Sciences 9, no. 17: 3569. https://doi.org/10.3390/app9173569
APA StyleTang, M., Huang, Z., Yuan, Y., Wang, C., & Peng, Y. (2019). A Bounded Scheduling Method for Adaptive Gradient Methods. Applied Sciences, 9(17), 3569. https://doi.org/10.3390/app9173569