A Scaling Transition Method from SGDM to SGD with 2ExpLR Strategy

: In deep learning, the vanilla stochastic gradient descent (SGD) and SGD with heavy-ball momentum (SGDM) methods have a wide range of applications due to their simplicity and great generalization. This paper uses an exponential scaling method to realize a smooth and stable transition from SGDM to SGD, which combines the advantages of the fast training speed of SGDM and the accurate convergence of SGD (named TSGD). We also provide some theoretical results on the convergence of this algorithm. At the same time, we take advantage of the learning rate warmup strategy’s stability and the learning rate decay strategy’s high accuracy. A warmup–decay learning rate strategy with double exponential functions is proposed (named 2ExpLR). The experimental results on different datasets for the proposed algorithms indicate that the accuracy is improved signiﬁcantly and that the training is faster and more stable.


Introduction
In recent years, with the development of deep-learning technologies, many neural network models have been proposed, such as FCNN [1], LeNet [2], LSTM [3], ResNet [4], DensNet [5], and so on. They also have an extensive range of applications [6][7][8]. Most networks, such as CNN and RNN, are supervised types of deep learning that require a training set to teach models to yield the desired output. Gradient Descent (GD) is the most common optimization algorithm in machine learning and deep learning, and it plays a significant role in training.
In practical applications, the SGD (a type of GD algorithm) is one of the most dominant algorithms because of its simplicity and low computational complexity. However, one disadvantage of SGD is that it updates parameters only with the current gradient, which leads to slow speeds and unstable training. Researchers have proposed new variant algorithms to address these deficiencies.
Polyak et al. [9] proposed a heavy-ball momentum method and used the exponential moving average (EMA) [10] to accumulate the gradient to speed up the training. Nesterov accelerated gradient (NAG) [11] is a modification of the momentum-based update, which uses a look-ahead step to improve the momentum term [12]. Subsequently, researchers improved the SGDM and proposed synthesized Nesterov variants [13], PID control [14], AccSGD [15], SGDP [16], and so on.
A series of adaptive gradient descent methods and their variants have also been proposed. These methods scale the gradient by some form of past squared gradients, making the learning rate automatically adapt to changes in the gradient, which can achieve a rapid training speed with an element-wise scaling term on learning rates [17].
Many algorithms produce excellent results, such as AdaGrad [18], Adam [19], and AMSGrad [20]. In neural-network training, these methods are more stable, faster, and perform well on noisy and sparse gradients. Although they have made significant progress, the generalization of adaptive gradient descent is often not as good as SGD [21]. There are some additional issues [22,23].
First, the solution of adaptive gradient methods with second moments may fall into a local minimum of poor generalization and even diverge in certain cases. Then, the practical learning rates of weight vectors tend to decrease during training, which leads to sharp local minima that do not generalize well. Therefore, some trade-off methods of transforming Adam to SGD are proposed to obtain the advantages of both, such as Adam-Clip (p, q) [24], AdaBound [25], AdaDB [26], GWDC [27], linear scaling AdaBound [28], and DSTAdam [29]. Can we transfer this idea to SGDM and SGD?
This study is strongly motivated by recent research on the combination of SGD and SGDM under the QHM algorithm [30]. They gave a straightforward alteration of momentum SGD, averaging a vanilla SGD step and a momentum step, and obtained good results in experiments. It can be explained as a v− weighted average of the momentum update step and the vanilla SGD update step, and the authors provided a recommended rule of thumb v = 0.7. However, in many scenarios, it is not easy to find the optimal value of v. On this basis, we conducted a more in-depth study on the combination of SGDM and SGD.
In this paper, first, we analyze the disadvantages of SGDM and SGD and propose a scaling method to achieve a smooth and stable transition from SGDM to SGD, combining the fast training speed of the SGDM and the high accuracy of the SGD. At the same time, we combine the advantages of the warmup and decay strategy and propose a warmup-decay learning rate strategy with double exponential functions. This makes the training algorithm more stable in the early stage and more accurate in the later stage. The experimental results show that our proposed algorithms had a faster training speed and better accuracy in the neural network model.

Preliminaries
Optimization problem. f :R n → R is the objective function, and f (θ, ζ) ∈ C 1 is continuously differentiable, where θ is the optimized parameter. Considering the following convex optimization problem [31]: where ζ is a stochastic variable. Applying the gradient descent method to solve the above minimization problem: where η is the constant step size, g t is the descent direction, and θ is the parameter that needs to be updated. SGD. SGD uses the current gradient of the loss function to update the parameters [32]: where f is the loss function and ∇ f (θ, ζ) is the gradient of the loss function. SGDM. The momentum method considers the past gradient information and uses it to correct the direction, which speeds up the training. The update rule of the heavy-ball momentum method [33] is where m t is the momentum and β is the momentum factor. Efficiency of SGD and SGDM. In convex optimization, the gradient descent method can search the optimal global solution of the objective function by the steepest descent. However, it has a series of zigzag paths in the iteration process, which seriously affects the training speed. We can explore the root of this defect using theoretical analysis. Using exact line search to solve the convex optimization problems (1): In order to find the minimum point along the direction d (t) from x (t) , let Thus, We know that shows that the path of the iterative sequence { f (θ t , ζ)} is a zigzag. When it is close to the minimum point, the step size of each movement will be tiny and seriously affects the speed of convergence as shown in Figure 1a. Polyak et al. [9] proposed SGD with a heavy-ball momentum algorithm to mitigate the zigzag oscillation and accelerate training. The update direction of the SGDM is the EMA [10] of the current gradient and the past gradient to speed up the training speed and reduce the vibration in SGD. The momentum of the SGDM is However, in the later stage of training, there is a defect in using EMA for the gradient. The accumulation of gradients may lead to a faster speed of momentum and not stop in time when it is close to the optimum. This may oscillate in a region around the optimal solution θ * and cannot stably converge as shown in Figure 1b. At the same time, the gradients that are too far may have little helpful information, and the momentum can be calculated using the gradients of the last n times, such as AdaShift [34]. The negative gradient direction of the loss function is the optimal direction for the current parameter updating. The updating direction is no longer the fastest descent direction when using momentum. When it is close to the optimal solution, the stochastic gradient descent method is more accurate and is more likely to find the optimal point.
For more information, we use the ResNet18 to train the CIFAR10 dataset, and the first 75 epochs use the SGDM algorithm (hyperparameter setting: weight_decay = 5 × 10 −3 , lr = 0.1, β = 0.9). After the 75-th epoch, we only change the updating direction from momentum to gradient (descent direction if epoch < 76: m t else: g t , no other setting is changed). Figure 2a shows the accuracy curve of the test set during the training. It can be seen that, after the accuracy increases rapidly from starting and then enters the plateau after about the 25-th epochs, the accuracy no longer increases.
After the 75-th epoch, the current gradient is used to update the parameters, showing that the accuracy significantly improved. Figure 2b records the training loss. It can be seen that the loss of the SGD is faster and smaller. If we combine the advantages of SGDM and SGD in this way, the training algorithm can have both the fast training speed of SGDM and the high accuracy of SGD.

TSGD Algorithm
Based on the above analysis and the basis of the QHM algorithm, we provide a middle ground that combines preferable features from both. This includes the advantages of SGDM with a fast training speed and of SGD with high accuracy. A scaling transition method from SGDM to SGD is proposed-named TSGD. A scaling function ρ t is introduced to gradually scale the momentum direction to the gradient direction as the iteration, which achieves a smooth and stable transition from SGDM to SGD. The specific algorithm is represented in Algorithm 1.

Algorithm 1 A Scaling
Transition method from SGDM to SGD (TSGD) Input: initial parameters: In Algorithm 1, g t is the gradient of loss function, f in the t-th iteration, and m t is the momentum.m t is the scaled momentum, θ t is the optimized parameter, ρ t is the scaling function, and ρ t ∈ [0, 1]. We provide recommended rules of the hyperparameters ρ and β, which are ρ T = 0.01 and β = 0.9, where T is the number of iterations, T = ceil(samplesize/batchsize) * epochs .
Gitman et al. [35] made a detailed convergence analysis on the general form of the QHM algorithm. The TSGD algorithm proposed in this section conforms to the general form of the QHM algorithm. Therefore, the TSGD algorithm has the same convergence conclusion as the general form of the QHM algorithm. Theorem 1 gives the convergence theory of the TSGD algorithm based on the literature [35].
Theorem 1. Let f satisfy the condition in the literature [35]. Additionally, assume that 0 ≤ ρ t ≤ 1 and the sequences {η t } and {β t } satisfy the following conditions: Then, the sequence {θ t } that is generated by the TSGD Algorithm 1 satisfies: Moreover, we have: Theorem 1 implies that TSGD can converge, which is similar to SGD-type optimizers. Through exponential decay in step 4 of the TSGD Algorithm, the updated direction smoothly and stably transforms the momentum direction of SGDM to the gradient direction of SGD as iterations. In the early stage of training, the number of iterations is small, ρ t is close to 1, and the updated direction ism which can speed up the training speed. In the later stage, with the increase of iterations, the number of iterations is large, and ρ t is close to 0. The updated direction is gradually transformed tom When the iterative sequence closes to the optimal solution, gradient direction is used to update the parameters and is more accurate. Thus, TSGD has the advantages of both faster speed of SGDM and high accuracy of SGD. TSGD does not need to calculate the second moment of the gradient, which saves computational resources compared to the adaptive gradient descent method.

The Warmup Decay Learning Rate Strategy with Double Exponential Functions
The learning rate (LR) is a crucial hyperparameter to tune for practical training of deep neural networks and controls the rate or speed at which the model learns each iteration [36]. If the learning rate is too large, this may cause oscillation and divergence. On the contrary, training may progress slowly and even stop if the rate is too small. Thus, many learning rate strategies have been proposed and appear to work well, such as warmup [4], decay [37], restart techniques [38], and cyclic learning rates [39]. In this section, we mainly focus on the warmup and decay strategies.
On the one hand, the idea of warmup was proposed in ResNet [4]. They used 0.01 to warm up the training until the training error was below 80% (about 400 iterations) and then went back to 0.1 and continued training. In the early stage of training, due to many random parameters in the model, if a large learning rate is used, the model may be unstable and fall into a local optimum, which is challenging to fix. Therefore, we use a small learning rate to make the model learn certain prior knowledge and then use a large learning rate for training when the model is more stable. It can use less aggressive learning rates at the start of training [40]. Implementations of the warmup strategy include constant warmup [4], linear warmup [41], and gradual warmup [40].
On the other hand, decay is also a popular strategy in neural-network training. The decay can speed up the training using a larger learning rate at the beginning and can converge stably using a smaller learning rate after. Not only does this show good results [24,25,34,42] in practical applications but also the learning rate is required to decay in the theoretical convergence analysis, such as η t = 1/t [43], η t = 1/ √ t [42]. The specific implementation of learning rate decay is as follows: step decay [26], linear attenuation [44], exponential decay [45], etc.
To ensure better performance of the training algorithm, we combined the warmup and decay strategies. The warmup strategy was used for a small number of iterations in the early stage of training, and then the decay strategy was used to decrease the learning rate gradually. In this way, the model can learn stably early on, train fast in the middle, and converge accurately later.
First, an exponential function is used in the warmup stage to gradually increase the learning rate from zero to the upper learning rate as shown in Figure 3a. The numerical formula is as follows lr warmup = lr u 1 − ρ t 1 .
Second, an exponential function is also used in the decay stage to gradually decrease the learning rate from the upper learning rate to the lower learning rate as shown in Figure 3b. The numerical formula is as follows The main idea of the warmup-decay learning rate strategy is that the learning rate can increase in the warmup stage and decrease in the decay stage. Therefore, we combine the above two processes, and the warmup-decay learning rate strategy with double exponential functions (2ExpLR) is proposed as shown in Figure 3c. The numerical formula is as follows Similarly, we also provide recommended rules of the hyperparameters ρ 1 , ρ 2 as with ρ, that are ρ T 1 = ρ T 2 = lr l * 10 −1 . Thus, ρ 1 , ρ 2 can be easily calculated using ρ 1 = ρ 2 = 10 (lg lr l −1)/T .
In Figure 3, we know that 2ExpLR does not need to set the transition point from warmup to decay manually, compared with other strategies [24][25][26]. At the same time, we achieve a smooth and stable transition from warmup to decay through 2ExpLR. This gives the training algorithm a faster convergence speed and higher accuracy.  The TSGD algorithm with the 2ExpLR strategy is described as Algorithm 2.

Extension
The scaling transition method on the gradient can be extended to other applications. We can abstract this scaling transition process providing a general framework for the parameter transition from one state to another.
For example, neural networks have been widely used in many fields to solve problems. During the training process, the input data type and amount directly influence the performance of the ANN model [46]. If the data is dirty, contains large amounts of noise, the data size is too small, or the training time is too long, etc., the model can be affected by overfitting [47]. Various regularization techniques have been proposed and developed for neural networks to solve these problems. The regularization technique is used to avoid overfitting of the network and has more parameters than the input data and is used for a network learned with noisy inputs [48], such as L1 and L2 regularization methods.
The model has not learned any knowledge in the early stage of training, and using regularization at this time may result in harmful effects on the model. Therefore, we can train the model generally at the beginning and then use the regularization strategy to train the model in the later stage. Thus, we can implement the scaling transition framework as follows where L( * ) is the loss function, N is the number of samples,ŷ i is the predicted value, and y i is the actual value. λ is a non-negative regularization parameter, R( * ) is a regularization function, such as L 2 = θ 2 2 , or L 1 = θ 1 .

Experiments
In this section, to verify the performance of the proposed TSGD and TSGD+2ExpLR, we compared them with other algorithms, including SGDM and Adam. Specifically, we used IRIS, CASIA-FaceV5, and CIFAR classification tasks in the experiments. We set the same random seed through PyTorch to ensure fairness. The architecture we chose for the experiments was ResNet-18. The computational setup is shown in Table 1. Our implementation is available at https://github.com/kunzeng/TSGD (accessed on 20 November 2022).

IRIS and BP neural network.
The IRIS dataset is a classic dataset for classification, machine learning, and data visualization. This dataset includes three iris species with 50 samples of each and some properties of each flower. An excellent way to determine the correct learning rates is to perform experiments using a small but representative sample of the training set, and the IRIS dataset is a good fit. Due to the small number of samples, we used all 150 samples as the training set. The performance of each algorithm is represented by the accuracy and loss value on the training set. We set the relevant parameters using empirical values and a grid search: samplesize = 150, epoch = 200, batchsize = 1, lr(SGD: 0.003, SGDM: 0.05, TSGD: 0.03), up_lr = 0.5, low_lr = 0.005, rho = 0.977237, and rho1 = rho2 = 0.962708. The experimental results show that the method achieves its expected effect. In Figures 4 and 5, the training process of SGDM has more oscillations, and the convergence speed of SGD is slower. However, TSGD is more stable and faster than SGDM and SGD. It is astonishing that the TSGD+2ExpLR algorithm is stable during training and increases accuracy quickly.  CASIA-FaceV5 and ResNet18. The CASIA-FaceV5 contains 2500 color facial images of 500 subjects. All face images are 16-bit color BMP files, and the image resolution is 640*480. Typical intra-class variations include the illumination, pose, expression, and eye-glasses imaging distance, which provides a remarkable ability to distinguish between different algorithms. We preprocessed it for better training and reshaped it to 100 × 100. The parameters were set as: samplesize = 2500, epoch = 100, batchsize = 10, lr = 0.1, lr u = 0.5, lr l = 0.005, then T = ceil(samplesize/batchsize) * epoch = 25,000, and thus ρ = 10 (log(ρ T ,10)/T) ≈ 0.999815. This is also applicable to ρ 1 = ρ 2 = 0.999696. The experi-mental results are in Figures 6 and 7. Although we can observe that the accuracy of the TSGD algorithm does not exceed 1 Adam, it exceeds the SGD algorithm, and the training process is more stable. The main reason may be that the CASIA-FaceV5 dataset has fewer samples and more categories, and each category has only five samples. The Adam algorithm is more suitable for this scenario. The accuracy of the TSGD+2ExpLR algorithm is much higher than that of other algorithms. The loss value is also much lower than other algorithms, and the training process is more stable. Cifar10 and ResNet18. The CIFAR10 dataset was collected by Hinton et al. [49]. It contains 60,000 color images in 10 categories, with a training set of 50,000 images and a test set of 10,000 images. The parameters are adjusted: epoch = 200, batchsize = 128, ρ = 0.9999411, ρ 1 = ρ 2 = 0.9999028, lr = 0.1, lr u = 0.5, and lr l = 0.005. The other parameters were the same as the above or were the default values recommended by the model. The architecture we used was ResNet18, and the learning rate was divided by 10 at epoch 150 for SGDM, Adam, and TSGD. Figure 8 shows the test accuracy curves of each optimization method. As we can see, SGDM had the lowest accuracy and more oscillation before 150 epochs, and Adam had a fast speed but lower accuracy after 150 epochs. It is evident that the TSGD and TSGD+2ExpLR had higher accuracy, and the accuracy curves were smoother and more stable. The speed of TSGD even exceeded the speed of Adam. Figure 9 shows the loss of different algorithms on the training set. At the beginning of training, the loss of TSGD decreases slowly, possibly due to the small learning rate of the warmup stage to make the model stable. However, it can be seen that the loss of the TSGD algorithm decreases faster, and the loss is the smallest in the later stage.   Cifar100 and ResNet18. We also present our results for CIFAR100. The task is similar to Cifar10; however, Cifar100 has 10 categories, and each category has 10 subcategories for a total of 100 categories. The parameters are the same as CIFAR10. We show the performance curves in Figures 10 and 11. It can be seen that the performances of each algorithm on CIFAR100 are similar to those of CIFAR10. The two algorithms proposed in this paper can hold the top two spots in all, and the curves are still smoother and more stable. In particular, in TSGD+2ExpLR, the accuracy reaches its peak at about 100 epochs, which reflects the effect of the warmup-decay learning rate strategy with double exponential functions. The loss of the TSGD and TSGD+2ExpLR algorithms decreases faster and finally has a minor loss, almost close to zero. The TSGD and TSGD+2ExpLR are considerably improved over SGDM and Adam, with faster speed and higher accuracy.

Conclusions
In this paper, we combined the advantages of SGDM with a fast training speed and SGD with high accuracy and proposed a scaling transition method from SGDM to SGD. At the same time, we used two exponential functions to combine the warmup strategy and decay strategy, thus, proposing the 2ExpLR strategy. This method allowed the model to train more stably and obtain higher accuracy. The experimental results showed that the TSGD and 2ExpLR algorithms had good performance in terms of the training speed and generalization ability.