# Combining Optimization Methods Using an Adaptive Meta Optimizer

^{*}

^{†}

^{‡}

## Abstract

**:**

## 1. Introduction

- We show experimentally that the combination of two different optimizers in a new meta-optimizer leads to a better generalization capacity in different contexts.
- We describe ATMO using Adam and SGD but show experimentally that other types of optimizers can be profitably combined.
- We release the source code and setups of the experiments [12].

## 2. Related Work

**SWATS**(SWitching from Adam To Sgd), which starts training with an adaptive optimization method and switches to SGD when appropriate. This idea starts from the observation that despite superior training results, adaptive optimization methods such as ADAM generalize poorly compared to SGD because they tend to work well in the early part of the training but are overtaken by SGD in the later stages of training. In concrete terms, SWATS is a simple strategy that goes from Adam to SGD when an activation condition is met. The experimental results obtained in this paper are not so different from ADAM or SGD when used individually, so the authors concluded that using SGD with perfect parameters is the best idea. In our proposal, we want to combine two well-known optimizers to create a new one that simultaneously uses two different optimizers from the beginning to the end of the training process.

**ESGD**is a population-based Evolutionary Stochastic Gradient Descent framework for optimizing deep neural networks [14]. In this approach, individuals in the population optimized with various SGD-based optimizers using distinct hyper-parameters are considered competing species in the context of coevolution. The authors experimented with optimizer pools consisting of SGD and ADAM variants, where it is often observed that ADAM tends to be aggressive early on but stabilizes quickly, while SGD starts slowly but can reach a better local minimum. ESGD can automatically choose the appropriate optimizers and their hyper-parameters based on the fitness value during the evolution process so that the merits of SGD and ADAM can be combined to seek a better local optimal solution to the problem of interest. In the method we propose, we do not need another approach, such as the evolutionary one, to decide which optimizer to use and with which hyper-parameters, but it is the same approach that decides the contribution of SGD and that of ADAM at each step.

**ADAMW**[15,16] (ADAM with decoupled Weight decay regularization), which is a version of ADAM in which weight decay is decoupled from ${L}_{2}$ regularization. This optimizer offers good generalization performance, especially for text analysis, and since we also perform some experimental tests on text classification, then we also compare our optimizer with ADAMW. In fact, ADAMW is often used with BERT [17] applied to well-known datasets for text classification.

**Padam**[18] (Partially ADAM) is one of the recent Adam derivates that achieves very interesing results. It bridges the generalization gap for adaptive gradient methods by introduceing a partial adaptive parameter to control the level of adaptiveness of the optimization procedure. We principally use ATMO with a combination of ADAM and SGD, but we test the generalization of this method also by combining Padam and SGD [12] to compare with many other optimizers (Table 1).

## 3. Preliminaries

**SGD**[1] optimizer can be described as:

**weight decay**[19] strategy, often used in many SGD implementations. The weight decay can be seen as a modification of the $\nabla \mathcal{L}\left(w\right)$ gradient, and in particular, we describe it as follows:

**momentum**, and in this case, we refer to it as

**SGD(M)**[20] (Stochastic Gradient Descend with Momentum). SGD(M) almost always works better and faster than SGD because the momentum helps accelerate the gradient vectors in the right direction, thus leading to faster convergence. The iterations of SGD(M) can be described as follows:

**damping**coefficient [21], which controls the rate at which the momentum vector decays. The dampening coefficient changes the momentum as follows:

**Nesterov**momentum [22] is an extension of the moment method that approximates the future position of the parameters that takes into account the movement. The SGD with nesterov transforms again the ${v}_{k}$ of Equation (5); more precisely:

Algorithm 1 Stochastic Gradient Descent (SGD). |

Input: the weights ${w}_{k}$, learing rate $\eta $, weight decay $\gamma $, momentum $\mu $, dampening d, boolean $nesterov$ ${v}_{0}=0$ function${\Delta}_{\mathrm{SGD}}$(${w}_{k}$, $\nabla ,\gamma $, $\mu $, d, $nesterov$)$\widehat{\nabla}=\nabla +{w}_{k}\xb7\gamma $ if $m\ne 0$ thenif $k=0$ then${v}_{k}=\widehat{\nabla}$ else${v}_{k}={v}_{k-1}\xb7\mu +\widehat{\nabla}\xb7(1-d)$ end ifif $nesterov=True$ then${v}_{k}=\widehat{\nabla}+{v}_{k}\xb7\mu $ end ifend ifreturn ${v}_{k}$end functionfor batches do${w}_{k+1}={w}_{k}-{\eta}_{s}\xb7{\Delta}_{\mathrm{SGD}}({w}_{k},\nabla ,\gamma ,\mu ,d,nesterov)$ end for |

**ADAM**[2] (ADAptive Moment estimation) optimization algorithm is an extension to SGD that has recently seen broader adoption for deep learning applications in computer vision and natural language processing. ADAM’s equation for updating the weights of a neural network by iterating over the training data can be represented as follows:

**AMSGrad**[23] is a stochastic optimization method that seeks to fix a convergence issue with Adam based optimizers. AMSGrad uses the maximum of past squared gradients ${v}_{k-1}$ rather than the exponential average to update the parameters:

Algorithm 2 ADAptive Moment estimation (ADAM). |

Input: the weights ${w}_{k}$, learing rate $\eta $, weight decay $\gamma $, ${\beta}_{1}$, ${\beta}_{2}$, $\u03f5$, boolean $amsgrad$${m}_{0}=0$ ${v}_{0}^{a}=0$ ${\widehat{v}}_{0}=0$ function ${\Delta}_{\mathrm{ADAM}}$(${w}_{k}$,$\nabla ,\eta $, $\gamma $, ${\beta}_{1}$, ${\beta}_{2}$, $\u03f5$, $amsgrad$)$\widehat{\nabla}=\nabla +{w}_{k}\xb7\gamma $ ${m}_{k}={m}_{k-1}\xb7{\beta}_{1}+\widehat{\nabla}\xb7(1-{\beta}_{1})$ ${v}_{k}^{a}={v}_{k-1}^{a}\xb7{\beta}_{2}+\widehat{\nabla}\xb7\widehat{\nabla}\xb7(1-{\beta}_{2})$ if $amsgrad=True$ then${\widehat{v}}_{k}=\mathrm{max}({\widehat{v}}_{k-1},{v}_{k}^{a})$ $denom=\frac{\sqrt{{\widehat{v}}_{k}}}{\sqrt{1-{\beta}_{2}}+\u03f5}$ else$denom=\frac{\sqrt{{v}_{k}^{a}}}{\sqrt{1-{\beta}_{2}}+\u03f5}$ end if${\eta}_{a}=\frac{\eta}{1-{\beta}_{1}}$ ${d}_{k}=\frac{{m}_{k}}{denom}$ return ${d}_{k},{\eta}_{a}$end functionfor batches do${d}_{k},{\eta}_{a}={\Delta}_{\mathrm{ADAM}}({w}_{k},\nabla ,\eta ,\gamma ,{\beta}_{1},{\beta}_{2},\u03f5,amsgrad)$ ${w}_{k+1}={w}_{k}-{\eta}_{a}\xb7{d}_{k}$ end for |

## 4. Proposed Approach

#### 4.1. ADAM Component

#### 4.2. SGD Component

#### 4.3. The ATMO Optimizer

Algorithm 3 ATMO on mixing ADAM and SGD. |

Input: the weights ${w}_{k}$, ${\lambda}_{a}$, ${\lambda}_{s}$, learing rate $\eta $, weight decay $\gamma $, other SGD and ADAM parameters … for batches do${d}_{k},{\eta}_{a}={\Delta}_{\mathrm{ADAM}}({w}_{k},\nabla ,\eta ,\gamma ,\dots )$ ${v}_{{n}_{k}}={\Delta}_{\mathrm{SGD}}({w}_{k},\nabla ,\gamma ,\dots )$ $merged={\lambda}_{s}\xb7{v}_{{n}_{k}}+{\lambda}_{a}\xb7{d}_{k}$ ${\eta}_{m}={\lambda}_{s}\xb7\eta +{\lambda}_{a}\xb7{\eta}_{a}$ ${w}_{k+1}={w}_{k}-{\eta}_{m}\xb7merged$ end for |

**Theorem**

**1.**

**.**If ADAM and SGD are two optimizers whose convergence is guaranteed, then the Cauchy necessary convergence condition is also true for ATMO.

**Proof.**

**Theorem**

**2.**

**Proof.**

#### 4.4. Geometric Explanation

#### 4.5. Toy Examples

#### 4.6. Dynamic $\lambda $

## 5. Datasets

**Cifar10**[26] dataset consists of 60,000 images divided into 10 classes (6000 per class) with a training set size and test set size of 50,000 and 10,000, respectively. Each input sample is a low-resolution color image of size $32\times 32$. The 10 classes are airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks.

**Cifar100**[26] dataset consist of 60,000 images divided into 100 classes (600 per classes) with a training set size and test set size of 50,000 and 10,000, respectively. Each input sample is a $32\times 32$ colour image with a low resolution.

**Corpus of Linguistic Acceptability**(CoLA) [27] is another dataset that contains 9594 sentences belonging to training and validation sets and excludes 1063 sentences belonging to a set of tests kept out. In our experiment, we only used the training and test sets.

## 6. Experiments

#### 6.1. Experiments with Images

#### 6.2. Experiments with Text Documents

#### 6.3. Time Analysis

## 7. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat.
**1951**, 22, 400–407. [Google Scholar] [CrossRef] - Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Zaheer, M.; Reddi, S.; Sachan, D.; Kale, S.; Kumar, S. Adaptive methods for nonconvex optimization. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2018; pp. 9793–9803. [Google Scholar]
- Luo, L.; Xiong, Y.; Liu, Y.; Sun, X. Adaptive Gradient Methods with Dynamic Bound of Learning Rate. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Bera, S.; Shrivastava, V.K. Analysis of various optimizers on deep convolutional neural network model in the application of hyperspectral remote sensing image classification. Int. J. Remote Sens.
**2020**, 41, 2664–2683. [Google Scholar] [CrossRef] - Graves, A. Generating sequences with recurrent neural networks. arXiv
**2013**, arXiv:1308.0850. [Google Scholar] - Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res.
**2011**, 12, 2121–2159. [Google Scholar] - Zeiler, M.D. Adadelta: An adaptive learning rate method. arXiv
**2012**, arXiv:1212.5701. [Google Scholar] - Kobayashi, T. SCW-SGD: Stochastically Confidence-Weighted SGD. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 1746–1750. [Google Scholar]
- Zhang, Z. Improved adam optimizer for deep neural networks. In Proceedings of the 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), Banff, AB, Canada, 4–6 June 2018; pp. 1–2. [Google Scholar]
- Pawełczyk, K.; Kawulok, M.; Nalepa, J. Genetically-trained deep neural networks. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, Kyoto, Japan, 15–19 July 2018; pp. 63–64. [Google Scholar]
- Landro Nicola, G.I.; Riccardo, L.G. Mixing ADAM and SGD: A Combined Optimization Method with Pytorch. 2020. Available online: https://gitlab.com/nicolalandro/multi_optimizer (accessed on 18 June 2021).
- Keskar, N.S.; Socher, R. Improving generalization performance by switching from adam to sgd. arXiv
**2017**, arXiv:1712.07628. [Google Scholar] - Cui, X.; Zhang, W.; Tüske, Z.; Picheny, M. Evolutionary stochastic gradient descent for optimization of deep neural networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2018; pp. 6048–6058. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv
**2017**, arXiv:1711.05101. [Google Scholar] - Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. 2018. Available online: https://openreview.net/forum?id=rk6qdGgCZ (accessed on 18 June 2020).
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv
**2018**, arXiv:1810.04805. [Google Scholar] - Chen, J.; Zhou, D.; Tang, Y.; Yang, Z.; Gu, Q. Closing the generalization gap of adaptive gradient methods in training deep neural networks. arXiv
**2018**, arXiv:1806.06763. [Google Scholar] - Krogh, A.; Hertz, J.A. A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1992; pp. 950–957. [Google Scholar]
- Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1139–1147. [Google Scholar]
- Damaskinos, G.; Mhamdi, E.M.E.; Guerraoui, R.; Patra, R.; Taziki, M. Asynchronous Byzantine machine learning (the case of SGD). arXiv
**2018**, arXiv:1802.07928. [Google Scholar] - Liu, C.; Belkin, M. Accelerating SGD with momentum for over-parameterized learning. arXiv
**2018**, arXiv:1810.13395. [Google Scholar] - Reddi, S.J.; Kale, S.; Kumar, S. On the convergence of adam and beyond. arXiv
**2019**, arXiv:1904.09237. [Google Scholar] - Lee, J.D.; Simchowitz, M.; Jordan, M.I.; Recht, B. Gradient descent only converges to minimizers. In Proceedings of the Conference on Learning Theory, New York, NY, USA, 23–26 June 2016; pp. 1246–1257. [Google Scholar]
- Rosenbrock, H. An automatic method for finding the greatest or least value of a function. Comput. J.
**1960**, 3, 175–184. [Google Scholar] [CrossRef] [Green Version] - Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Citeseer: University Park, PA, USA, 2009. [Google Scholar]
- Warstadt, A.; Singh, A.; Bowman, S.R. Neural Network Acceptability Judgments. arXiv
**2018**, arXiv:1805.12471. [Google Scholar] [CrossRef] - Gulli, A. AG’s Corpus of News Articles. 2005. Available online: http://groups.di.unipi.it/~gulli/\AG_corpus_of_news_articles.html (accessed on 15 October 2020).
- Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; pp. 649–657. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv
**2015**, arXiv:1512.03385. [Google Scholar] - Targ, S.; Almeida, D.; Lyman, K. Resnet in resnet: Generalizing residual architectures. arXiv
**2016**, arXiv:1603.08029. [Google Scholar] - Huggingface.co. Bert Base Uncased Pre-Trained Model. Available online: https://huggingface.co/bert-base-uncased (accessed on 15 October 2020).

**Figure 1.**Intuitive representation of the idea behind the proposed ATMO that Mix ADAM and SGD optimizers: the weights are modified simultaneously by both the optimizers.

**Figure 2.**Graphical representation of the basic idea for the proposed ATMO optimizer. In (

**a**), if the two translations ${\overrightarrow{w}}_{1}$ and ${\overrightarrow{w}}_{2}$ obtained from two different optimizers are similar, then the resulting translation ${\overrightarrow{w}}_{1}+{\overrightarrow{w}}_{2}$ is boosted. In (

**b**), if the translations ${\overrightarrow{w}}_{1}$ and ${\overrightarrow{w}}_{2}$ go in two different directions, then the resulting translation is smaller. We also use two hyper-parameters ${\lambda}_{1}$ and ${\lambda}_{2}$ to weigh the contribution of the two optimizers.

**Figure 3.**The figures show the behavior of the three optimizers ATMO, ADAM and SGD on different surfaces. The subfigure (

**a**) describes the surface defined in Equation (27). For better visualization in this figure the SGD was shifted on X axis of $0.1$. The subfigure (

**b**) describes the Rosenbrook’s surface with $a=1$ and $b=100$. The subfigure (

**c**) describes the surface $z=\frac{\left|x\right|}{10}+\left|y\right|$.

**Figure 4.**Resnet18 test accuracies with the best parameters of the best results obtained on Cifar10.

**Figure 5.**Resnet18 test accuracies min, max, and avg computed every 35 epochs using all experiments.

**Table 1.**Best accuracies in percentage to compare ATMO (that combines Padam and SGD) and the results published in [18] using a Resnet18 on the Cifar10 dataset.

SGD-Momentum | ADAM | Amsgrad | AdamW | Yogi | AdaBound | Padam | Dynamic ATMO |
---|---|---|---|---|---|---|---|

95.00 | 92.89 | 93.53 | 94.56 | 93.92 | 94.16 | 94.94 | 95.27 |

**Table 2.**Mean accuracies (avg. acc) and maximum accuracies (acc max) in percentage on Cifar10, after 7 runs and 350 epochs for each run. Dyn. ATMO shows the accuracies obtained with the use of dynamic lambdas.

Name | ${\mathit{\lambda}}_{\mathit{a}}$ | ${\mathit{\lambda}}_{\mathit{s}}$ | avg acc. | acc max |
---|---|---|---|---|

Resnet18 | ||||

Adam | 1 | 0 | 91.42 | 91.55 |

SGD | 0 | 1 | 89.12 | 89.52 |

ATMO | 0.5 | 0.5 | 92.07 | 92.40 |

ATMO | 0.4 | 0.6 | 92.12 | 92.35 |

ATMO | 0.6 | 0.4 | 91.84 | 91.99 |

ATMO | 0.7 | 0.3 | 91.87 | 92.14 |

ATMO | 0.3 | 0.7 | 91.99 | 92.17 |

Dyn. ATMO | $[1,0]$ | $1-{\lambda}_{a}$ | 94.02 | 94.22 |

Resnet34 | ||||

Adam | 1 | 0 | 91.39 | 91.68 |

SGD | 0 | 1 | 90.94 | 91.43 |

ATMO | 0.5 | 0.5 | 92.30 | 92.46 |

ATMO | 0.4 | 0.6 | 92.24 | 92.63 |

ATMO | 0.6 | 0.4 | 92.21 | 92.34 |

ATMO | 0.7 | 0.3 | 92.01 | 92.18 |

ATMO | 0.3 | 0.7 | 92.26 | 92.84 |

Dyn. ATMO | $[1,0]$ | $1-{\lambda}_{a}$ | 93.88 | 94.11 |

Name | avg acc. | acc max | ||
---|---|---|---|---|

Resnet18 | ||||

Adam | 67.52 | 68.11 | ||

SGD | 63.26 | 63.74 | ||

Dynamic ATMO | 74.27 | 74.48 | ||

Resnet34 | ||||

Adam | 68.26 | 68.64 | ||

SGD | 66.38 | 67.26 | ||

Dynamic ATMO | 73.85 | 73.89 |

**Table 4.**Accuracy results of BERT pre-trained on CoLA (50 epochs) and AG’s news (10 epochs) after 5 runs.

Name | ${\mathit{\lambda}}_{\mathit{a}}$ | ${\mathit{\lambda}}_{\mathit{s}}$ | avg acc. | acc max |
---|---|---|---|---|

CoLA | ||||

AdamW | - | - | 78.59 | 85.96 |

Adam | 1 | 0 | 79.85 | 83.30 |

SGD | 0 | 1 | 81.48 | 81.78 |

ATMO | 0.5 | 0.5 | 85.92 | 86.72 |

ATMO | 0.4 | 0.6 | 86.18 | 87.66 |

ATMO | 0.6 | 0.4 | 85.45 | 86.34 |

ATMO | 0.7 | 0.3 | 84.66 | 85.78 |

ATMO | 0.3 | 0.7 | 86.34 | 86.91 |

AG’s News | ||||

AdamW | - | - | 92.62 | 92.93 |

Adam | 1 | 0 | 92.55 | 92.67 |

SGD | 0 | 1 | 91.28 | 91.39 |

ATMO | 0.5 | 0.5 | 93.72 | 93.80 |

ATMO | 0.4 | 0.6 | 93.82 | 93.98 |

ATMO | 0.6 | 0.4 | 93.55 | 93.67 |

ATMO | 0.7 | 0.3 | 93.19 | 93.32 |

ATMO | 0.3 | 0.7 | 93.86 | 93.99 |

**Table 5.**Mean computational time for one epoch in seconds (s) for each experiment. We computed the train and test time together.

Dataset | Model | SGD | ADAM | ADAMW | ATMO |
---|---|---|---|---|---|

Cifar 10 | |||||

Resnet18 | 32.85 s | 33.05 s | - | 32.34 s | |

Resnet34 | 52.54 s | 52.47 s | - | 52.07 s | |

Cifar 100 | |||||

Resnet18 | 33.83 s | 34.42 s | - | 33.56 s | |

Resnet34 | 52.72 s | 53.17 s | - | 52.17 s | |

CoLA | |||||

BERT | 64.05 s | 63.67 s | 60.34 s | 62.6 s | |

AG’s News | |||||

BERT | 916.54 s | 898.89 s | 867.98 s | 901.19 s |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Landro, N.; Gallo, I.; La Grassa, R.
Combining Optimization Methods Using an Adaptive Meta Optimizer. *Algorithms* **2021**, *14*, 186.
https://doi.org/10.3390/a14060186

**AMA Style**

Landro N, Gallo I, La Grassa R.
Combining Optimization Methods Using an Adaptive Meta Optimizer. *Algorithms*. 2021; 14(6):186.
https://doi.org/10.3390/a14060186

**Chicago/Turabian Style**

Landro, Nicola, Ignazio Gallo, and Riccardo La Grassa.
2021. "Combining Optimization Methods Using an Adaptive Meta Optimizer" *Algorithms* 14, no. 6: 186.
https://doi.org/10.3390/a14060186