Learning-Rate Annealing Methods for Deep Neural Networks

: Deep neural networks (DNNs) have achieved great success in the last decades. DNN is optimized using the stochastic gradient descent (SGD) with learning rate annealing that overtakes the adaptive methods in many tasks. However, there is no common choice regarding the scheduled-annealing for SGD. This paper aims to present empirical analysis of learning rate annealing based on the experimental results using the major data-sets on the image classiﬁcation that is one of the key applications of the DNNs. Our experiment involves recent deep neural network models in combination with a variety of learning rate annealing methods. We also propose an annealing combining the sigmoid function with warmup that is shown to overtake both the adaptive methods and the other existing schedules in accuracy in most cases with DNNs. This aims to present a comparative analysis of learning rate annealing methods on Our experiment involves both We also propose order to smooth


Introduction
Deep learning is a machine learning paradigm based on deep neural networks (DNNs) that have achieved great success in various fields including segmentation, recognition, and many others. However, the training of DNN is a difficult problem, since it is a global optimization problem to find the parameters updated by the stochastic gradient descent [1][2][3][4][5] (SGD) and its variants.
The learning rate annealing significantly affects the accuracy of the trained model. In the gradient descent methods, the loss gradient is computed using the current model with the training set, and then each model parameter is updated by the loss gradient multiplied by the learning rate. In order to escape the local minima and saddle points and converge to the global optima, the learning rate will start a large value and then shrink to zero. It is ideal that the learning rate for each model parameter is determined automatically based on the convergence of the parameter. To this aim, gradient-based adaptive learning rate methods, for example, RMSprop and Adam, were developed. The adaptive method provides a quick convergence of algorithms in general. Unfortunately, the test accuracy of networks trained using the adaptive method is usually inferior to SGD with scheduled annealing. The scheduled annealing enables us to directly control the stochastic noise that helps the algorithm to escape local minima and saddle points and to converge the global optimal solution. Therefore, the hand-crafted schedule is an essential approach in practice.
However, there is no common choice regarding the scheduled-annealing for the optimization of deep neural networks. On the one hand, the classical annealing, for example, exponential function and staircase, that were designed for shallow networks, may not be suitable for DNNs. On the other hand, the recent warmup strategy designed for DNNs is heuristic and does not provide a smooth decay of step-size. Consequently, researchers in application fields should take time to test a number of annealing methods. More importantly, SGD has been performed using different annealing methods in different papers related to the DNN optimization. These facts motivate us to rethink the annealing strategy of SGD. This paper aims to present a comparative analysis of learning rate annealing methods based on the experimental results using the major data-sets on the image classification that is one of the key applications of the DNNs. Our experiment involves both shallow and recent deep models in combination with a variety of learning rate annealing methods. We also propose an annealing that combines the sigmoid function with warmup in order to leverage the decay and warmup strategies with a smooth learning rate curve. The proposed annealing is shown to outperform the adaptive methods and the other state-of-the-art schedules, in most cases with DNNs, in our experiments.
We summarize the background in Section 2. We study the related works and propose our method in Section 3. We then present and discuss our empirical results that compare annealing methods in Section 4 and conclude in Section 5.

Backgrounds
We overview the general background of deep networks and its optimization before we study related works on the learning rate annealing in the next section.
Variance reduction: The variance of stochastic gradients is detrimental to SGD, motivating variance reduction techniques [22][23][24][25][26][27][28] that aim to reduce the variance incurred due to their stochastic process of estimation, and improve the convergence rate mainly for convex optimization, while some are extended to non-convex problems [29][30][31]. One of the most practical algorithms for better convergence rates includes momentum [32], modified momentum for accelerated gradient [33], and stochastic estimation of accelerated gradient descent [34]. These algorithms are more focused on the efficiency in convergence than the generalization of models for accuracy. We focus on the baseline SGD with learning rate annealing.

Energy landscape:
The understanding of energy surface geometry is significant in deep optimization of highly complex non-convex problems. It is preferred to drive a solution toward a plateau in order to yield better generalization [35][36][37]. Entropy-SGD [36] is an optimization algorithm biased toward such a wide flat local minimum. In our approach, we do not attempt to explicitly measure geometric properties of the loss landscape with extra computational cost, but instead implicitly consider the variance determined by the learning rate annealing.

Batch size selection:
There is a trade-off between the computational efficiency and the stability of gradient estimation leading to the selection of their compromise with, generally, a constant, while the learning rate is scheduled to decrease for convergence. The generalization effect of stochastic gradient methods has been analyzed with constant batch size [38,39]. On the other hand, increasing the batch size per iteration with a fixed learning rate has been proposed in [40], where the equivalence of increasing the batch size to learning rate decay is demonstrated. A variety of varying batch size algorithms have been proposed by variance of gradients [41][42][43][44]. However, the batch size is usually fixed in practice, since increasing the batch size results in a huge computational cost.

Preliminary
Let us start with a review of the gradient descent method that considers a minimization problem of an objective function F : R m → R in a supervised learning framework: w * = arg min w F(w), where F is associated with parameters w = (w 1 , w 2 , · · · , w m ) in the finite-sum form: prediction function defined with the associated model parameters w from a data space X to a label space Y, and f i (w) := (h w (x i ), y i ) is a differentiable loss function defined by the discrepancy between the prediction h w (x i ) and the true label y i . The objective is to find optimal parameters w * by minimizing the empirical loss incurred on a given set of training data {(x 1 , y 1 ), (x 2 , y 2 ), · · · , (x n , y n )}. The optimization of supervised learning applications that often require a large number of training data mainly uses the stochastic gradient descent (SGD) that updates solution w t at each iteration t based on the gradient: where η t ∈ R is a learning rate and ξ t is an independent noise process with zero mean. The computation of gradients for the entire training data is computationally expensive and often intractable, so that the stochastic gradient is computed using batch β t at each iteration t: where β t is a subset of the index set [n] = {1, 2, · · · , n} for the training data and B := |β t | is the batch size that is usually fixed during the training process. Thus, the annealing of learning rate (η t ) determines both the step-size and the stochastic noise in the model update.

Adaptive Learning Rate
In the gradient-based optimization, it is desirable to determine the step-size automatically based on the loss gradient that reflects the convergence of each of the unknown parameters. To this aim, the parameter-wise adaptive learning rate scheduling has been developed, such as AdaGrad [45], AdaDelta [46,47], RMSprop [48], and Adam [49], that provide a quick convergence of the algorithm in practice. Recent works of the adaptive method include the combination of Adam with SGD [50], automatic selection of learning rate methods [51], and efficient loss-based method [52]. However, the adaptive method is usually inferior to SGD in accuracy for unknown data in supervised learning, such as the image classification with conventional shallow models [53].
In practice, SGD with scheduled annealing shows better results than the adaptive methods due to the benefits of a generalization and training advantage. Therefore, the hand-crafted schedule is still an essential approach for optimization problems.

Learning-Rate Decay
A schedule defines how things will change over time. In general, learning rate scheduling specifies a certain learning rate for each epoch and batch. There are two types of methods for scheduling global learning rates: the decay, and the cyclical one. The most preferred method is the learning rate annealing that is scheduled to gradually decay the learning rate during the training process. A relatively large step-size is preferred at the initial stages of training in order to obtain a better generalization effect [54]. The shrinkage of the learning rate reduces the stochastic noise. This avoids the oscillation near the optimal point and helps the algorithm to converge.
The popular decay methods of the learning rate are the step (staircase) decay [40] and the exponential decay [55]. The staircase method drops the learning rate in several step intervals and achieves a pleasing result in practice. The exponential decay [55] attenuates the learning rate sequentially for each step and provides a smooth curve. The top row of Figure 1 shows the learning rate schedules, including these methods.
The other one is the cyclical method [56], in which a learning rate period that consists of an upper and a lower bound is repeated during epochs. The observation behind the cyclic method is that increasing the learning rate in the optimization process may have a negative effect, but can result in a better generalization of the trained model [56]. The learning rate period can be a single decay, for example, the exponential decay and step decay method [57], and a triangle function.

Learning-Rate Warmup
The learning rate warmup, for example, [58], is a recent approach that uses a relatively small step size at the beginning of the training. The learning rate is increased linearly or non-linearly to a specific value in the first few epochs, and then shrinks to zero. The observations behind the warmup are that: the model parameters are initialized using a random distribution, and thus, the initial model is far from the ideal one; thus, an overly large learning rate causes numerical instability; and training a initial model carefully in the first few epochs may enable us to apply a larger learning rate in the middle stage of the training, resulting in a better regularization [59]. The bottom row of Figure 1 provides the learning rate schedules by the conventional annealing methods with warmup. Among them, the trapezoid [60] is a drastic approach that is designed to train the model using the upper-bound step-size as much as possible.

Proposed Sigmoid Decay with Warmup
We consider a simple variant of exponential decay, that is, sigmoid decay. We decay the learning rate using a sigmoid function during the training as follows: where η t is the learning rate at step t (scaled in [0,1] for numerical convenience), and η up , and η low , respectively define the upper and lower bounds of desired learning rates. Note that these parameters are shared by the conventional schedules. κ is a coefficient that can adjust the slope of the learning rate curve, and we use κ = 1/5. Moreover, we propose a sigmoid decay with the warmup schedule, which is known as a good heuristic for the training. The right-bottom section of Figure 1 draws the curve by the proposed Sigmoid Decay with Warmup (sig+) that aims to leverage both the decay and warmup while providing a smooth curve of the learning rate.
The proposed annealing (sig+) is designed to leverage both a smooth learning rate curve with a warmup strategy. Concretely, different from the conventional decay methods (exp, str, sig) shown in the top row of Figure 1, our method employs the warmup strategy and that enables us to use a large learning rate with deep neural network models that can be fragile at the initial stage. In contrast to the cyclic method and the existing warm-up methods (rep, trap, st+) shown at the bottom of Figure 1, our sig+ provides a smooth learning rate curve that yields a desirable shrinkage of the stochastic noise in the optimization process.

Experimental Results and Discussions
We now experiment the learning rate schedules shown in Figure 1 and the adaptive methods: RMSprop [48], and Adam [49] using both conventional shallow networks and deep neural networks (DNNs) based on major benchmarks shown in Table 1 on the image classification that is one of the important tasks of neural computing.

Experimental Set-Up
We employ fully connected (fc) networks with two hidden layers (NN-2) and with three hidden layers (NN-3), two convolution layers with two fc layers (LeNet-4), and VGG-9 [65] as the conventional shallow networks. For these shallow networks, we use MNIST [61] and Fashion-MNIST [62] data-sets shown in Table 1. We also conduct experiments using deep neural networks: VGG-19 [65], ResNet-18, ResNet-50 [66,67], DenseConv [68], and GoogLeNet [69]. We employ these models since the convolution kernel in VGG and the skip-connection in ResNet are the two fundamental architectures of recent deep networks; DenseConv is one of the successive deep networks; and GoogLeNet shows a superior performance in many practical tasks as we demonstrate. For these deep networks, we use SVHN [63] and CIFAR [64] benchmarks summarized in Table 1. We employ the batch size of B = 128, the momentum of 0.9, the weight-decay of 0.0005, and the epoch size of 100 as practical conditions.
Regarding the hyper-parameters of annealing, we have experimented SGD with three constant learning rates η = 0.1, η = 0.01, and η = 0.001, and observed that η = 0.1 is too large; η = 0.001 is too small; and η = 0.01 tends to be good for the tested networks. Therefore, we set the upper and lower bounds of learning rate curves as η up = 0.1 and η low = 0.001. The learning rate scale of RMSprop and Adam is set to 0.001 for the shallow models and 0.0001 for the deep models. We assign the warmup step to 10% of a total epoch, and the initial learning rate of the warmup is set to 0.01.
We performed each condition 10 times individually. We drew the average curve of test accuracy over epochs for qualitative evaluation, and also present the average and the maximum of test accuracy for quantitative evaluation within the individual trials. Figures 2 and 3 show accuracy curves for MNIST and Fashion-MNIST by SGD using learning rate decay and warmup, respectively. The constant η = 0.1 oscillates over the entire epochs, indicating that the learning rate was too large. Adaptive methods (rms and adam) have converged faster than the others, but fell into a local minima with lower accuracy compared to the decay annealing methods. In contrast to the exponential decay, the sigmoid decay (sig) keeps the learning rate high during the early stages of training, and its accuracy curve rose slowly. However, the sigmoid decay drew a better curve than the exponential decay in the latter half of the training phase due the generalization effect of the larger step-size in the early phase. The accuracy curve of the sigmoid decay (sig) is further compared with the cyclic method (rep) and the warmup variants (trap, str+, and sig+) in Figure 3. The accuracy curves varied along with the designed learning rate curves. The proposed annealing (sig+) follows the original sigmoid but converges to a slightly better solution in most cases using the shallow models.

Effect of Annealing Methods for Shallow Networks
The accuracy of the shallow networks with MNIST and Fashion-MNIST data-sets are summarized in Table 2. The step decay with warmup and the exponential decay have achieved superior performance in the majority of networks for, respectively, MNIST and Fashion-MNIST. It is difficult to observe the effects of the warmup strategy except for LeNet-4. The adaptive methods, that is, RMSprop and Adam, are fast, but their accuracy is lower than SGD with the constant η = 0.01. The decaying methods generally showed better accuracy than the constant learning rate η = 0.01.

NN-2 NN-3
LeNet-4 VGG-9 ics 2021, 1, 0 6 of 12 larger step-size in early phase. The accuracy curve of the sigmoid decay (sig) is further compared with the cyclic method (rep) and the warmup variants (trap, str+, and sig+) in Figure 3. The accuracy curves have varied along with the designed learning-rate curves. The proposed annealing (sig+) follows the original sigmoid but converges to a slightly better solution in most cases using the shallow models. The accuracy of the shallow networks with MNIST and Fashion-MNIST data-sets are summarized in Table 1. The step decay with warmup and the exponential decay have achieved superior performance in the majority of networks for, respectively, MNIST and Fashion-MNIST. It is difficult to observe the effects of the warmup strategy except for LeNet-4. The adaptive methods, i.e., RMSprop and Adam, are fast but their accuracy is lower than SGD with the constant η = 0.01. The decaying methods generally showed better accuracy than the constant learning-rate η = 0.01.

Effect of Annealing Methods for Deep Neural Networks
We provide comparative analysis of the annealing methods with deep neural networks based on SVHN, CIFAR-10, and CIFAR-100 data-sets. Regarding the hyper-parameters of annealing methods, we use the same learning rate values as the shallow networks except for that the initial learning rate of the adaptive methods was tuned to 1/10 of the previous experiment. Figure 4 demonstrates that, just as the shallow networks, the adaptive methods (rms and adam) have converged faster than the others, but to a local minima with low accuracy. Figure 5 shows that the accuracy curve of DNNs also follows the learning rate curve as the shallow models. Among the tested annealing methods, the proposed annealing (sig+) successfully drew the best curves in the last half-epochs. Table 3 summarizes the test accuracy using DNNs and provides the following observations and intuitions: the employment of a large learning rate in the first and middle stages of training process (e.g., str and sig) results in better accuracy than the exponential one (exp); the smoothing curve (sig) that avoids drastic change of the step-size has led to better accuracy than the non-smooth step function (str); the warmup strategy has further improved DNNs than the original one (i.e., str and str+); and the proposed annealing using sigmoid and warmup together provides the best performance in most cases with the deep networks. Therefore, we conclude that the slope and smoothness of the learning rate curve have a significant influence on the training process, and the proposed method has successfully improved the accuracy of DNNs with the employment of warmups independent of the DNN architecture and data-sets.

Conclusions
We have studied learning rate annealing strategies that impact the trained networks, and applied the annealing methods to the shallow networks and the deep networks using the major data-sets. We have performed a comparative analysis of learning rate schedules and adaptive methods, and observed that applying the schedule to SGD has better results in test accuracy than the currently preferred schedules and adaptive methods. Additionally, our results showed that the warmup improves the model accuracy of deep models. Concretely, we have performed that our sigmoid decay with warmup as a learning rate policy leads to superior performance for deep neural networks.
The contribution of the proposed annealing is not limited to image classification. It will be directly applicable to other supervised learning tasks. Moreover, it will be useful for studying the characteristics of generative adversarial networks, since the proposed annealing enables us to control the learning rate while providing a pleasing result, leading to a better understanding of the adversarial losses.