An Adaptive Optimization Method Based on Learning Rate Schedule for Neural Networks

Artificial intelligence (AI) is achieved by optimizing the cost function constructed from learning data. Changing the parameters in the cost function is an AI learning process (or AI learning for convenience). If AI learning is well performed, then the value of the cost function is the global minimum. In order to obtain the well-learned AI learning, the parameter should be no change in the value of the cost function at the global minimum. One useful optimization method is the momentum method; however, the momentum method has difficulty stopping the parameter when the value of the cost function satisfies the global minimum (non-stop problem). The proposed method is based on the momentum method. In order to solve the non-stop problem of the momentum method, we use the value of the cost function to our method. Therefore, as the learning method processes, the mechanism in our method reduces the amount of change in the parameter by the effect of the value of the cost function. We verified the method through proof of convergence and numerical experiments with existing methods to ensure that the learning works well.


Introduction
Artificial intelligence (AI) is completed by defining the cost function constructed through an artificial neural network (ANN) from given learning data, and by determining the parameters that minimize this cost function. In other words, AI learning is the cost function optimization process. Therefore, AI learning has two problems to solve-the definition and the optimization of the cost function. The first problem, the definition of the cost function, is the more data and the more complicated the structure of the ANN. For this reason, the cost function is increased the complexity [1][2][3][4][5]. The complexity here means that the number of local minimums of the cost function increases. The local minimum of the cost function is the part that makes the first order derivative of the cost function zero. Therefore, if AI learning is continued as a learning method based on the first order derivative of the cost function, the learning is not performed at this local minimum. Therefore, as the local minimum part of the cost function increases, AI learning does not proceed smoothly. In order to take advantage of more applications and AI, the amount of data increases and the structure of the ANN becomes more complicated. AI learning should be conducted using the cost function containing many local minimums. That is, the second problem arises in which the optimization problem is solved for the cost function containing many local minimums [6][7][8][9][10].
The main purpose of this paper is to solve the second problem, that is, we want to complete AI learning based on the first derivative of the cost function in a cost function that contains many local minimums. In order to complete AI learning, we introduce the method of adding the first order derivative of the cost function to the cost function so that the learning is carried out using the global minimum. Therefore, the mathematical Lagrangian method is used [11,12]. Additionally, the most powerful of the existing methods based on the first derivative of the cost function is the momentum method [13,14]. The momentum method consists of adding the first derivative of the cost function by multiplying it by a certain ratio, that is, the learning of the parameters is accomplished by the added values of the first derivative [15][16][17][18][19][20]. Therefore, even if the first derivative of the cost function is zero at a local minimum, learning can continue based on the previously added first derivative of the cost function. The problem with this method is that, even when we are well educated (situations where we need to stop learning), learning continues based on the previously added first derivative of the cost function. As learning continues, learning does not stop, and when learning is completed, optimal learning may not have been achieved. In order to solve this problem, methods using the change of the learning rate (step size) and adding an adaptive property to the momentum method have been developed [21][22][23][24]. This paper is based on the momentum method and adds an adaptive property, which constitutes a step size change with the degree of the cost function. It maintains the power of the momentum method, adds adaptive properties to it to make a certain percentage of learning, and finally constructs a step size according to the amount of the cost function, making it as close as possible to the minimum value of the cost function. More specific ideas and methods are discussed in detail in the text. This paper also proves that the parameters defined in our method converge. In this paper, we use the usual notations found in mathematics for convenience.
This paper is organized as follows-Section 2 introduces the cost function and the momentum method; Section 3 explains our proposed method; Section 4 confirms our claim through numerical experiments; and the conclusions are presented in Section 5.

The Optimization Problem and the Momentum Method
From the learning data, we defined the cost function. The learning data were divided into the input data {x j } and the output data {y j }. From the input data {x j } and using an ANN, we defined y h,j = y h (w, x j ), where w is a learning parameter. AI learning was completed with the definition of the learning parameter w that satisfies y h,j = y j .

The Optimization Problem
In order to satisfy y h,j = y j , we constructed the cost function as: where l is an integer (number of learning data), and we solved the following the optimization problem: To solve the optimization problem (2), gradient-based methods are generally used [1,3]. Gradient-based methods are used to find parameter w, such that the gradient of the cost function is zero. However, the cost function C(w) is not convex in general. If the cost function C(w) is not convex, the idea of zeroing the gradient may not work well. Therefore, in order to increase the efficiency of learning, we applied the Lagrange multiplier method to define F(w) as follows, and to find the parameter w that makes F(w) zero.
where µ is a small constant. In order to obtain w that satisfies Equation (3) by resulting in zero, we used an iterative parameter change rule: where λ is a constant and is called the learning rate. Equation (4) is similar to the momentum method.

The Momentum Method
In order to solve F(w) = 0, the momentum method is: The momentum method multiplies the ratio of β 1 , and thus has the problem that w changes continuously even when F(w) becomes zero. Therefore, the momentum method needs a way to stop learning.

Our Proposed Method
Our method is based on the momentum method that stops the learning at the moment when it meets the global minimum value of the cost function, which is the moment when learning is well executed. By changing the learning rate, the following is obtained: where: where η 0 is a constant, and: In particular, is a constant introduced to prevent division by zero.

Lemma 1.
The relationship between m i and v i is: Proof. The proof can be found in [19].
Using the relationship between m i and v i , from Equation (5), we have: Here, since is to prevent division by zero, it can be considered as zero in the calculation. Therefore, the sequence {w i } also converges when the sequence {η i } converges to zero. Theorem 1. We need to find the parameter that minimizes the cost function C. The limit value of our constructed sequence {w i } is a sequence that gives the value of the cost function C as zero. The closer the cost function C is to zero, the closer the relationship between y h,j and y j (the value of y h,j made through the ANN approaches the output value y j ).
Proof. After a sufficiently large number τ, a sufficiently small value η i , and using the Taylor's theorem, the following calculation is performed: where part of O (w i+1 − w i ) 2 represents the second order and is ignored. From Equations (5) and (8), we have: Since β 1 < 1, = 0, and adjusting η 0 and µ: Therefore, the following is obtained: and lim i→∞ C(w i ) = 0.
In Theorem (1), it is possible that δ is less than 1. Therefore, we explain in more detail that δ is less than 1 in the following corollary. (1) is less than 1.

Corollary 1. δ defined in Theorem
Proof. First of all, i is assumed to be from a number after a sufficiently large number τ. Under the condition of sufficiently small values η i and µ, the equation: is divided into three parts and is computed as: and Equations (9) and (10) are obtained by easy calculation, by condition of the cost function (C < 1), and by µ. We should show that Equation (11) is less than 1. Under the assumption in this corollary and from the definition of m i , the result is sign(m i ) = sign(∂C(w i )/∂w) because β 1 < 1, and the sign does not change dramatically, since ∂C(w)/∂w is a continuous function. Therefore, sign(∂C(w i )/∂w m i−1 ) > 0. From Equation (11), the following is obtained: If sign(∂C(w i )/∂w m i−1 ) < 0, then there is a ζ ∈ [w i , w i−1 ] or [w i−1 , w i ], such that ∂C(ζ)/∂w ≈ 0. Therefore, ∂C(w i )/∂w ≈ 0. Equation (11) is a small value less than 1. By this process, we set η 0 constant to a small value, resulting in the following:

Numerical Tests
Since our method is based on the momentum method and includes the effect of said momentum method, this paper compared GD, Adam, Adagrad, and our proposed method. In Sections 4.1 and 4.2, the performance of each method was compared by assuming a two-variable function as the cost function. This is a visualization to easily show how each method changes the parameters. In Section 4.3, we experimented with the problem of classifying the numbers from 0 to 9 written by humans. This is the most basic dataset to compare the performance of machine learning. This dataset called Modified National Institute of Standard and Technology (MNIST, see : http://yann.lecun.com/exdb/mnist/). This experiment was performed by combining the MNIST dataset and the convolution neural network (CNN) method, which is widely used for image classification. This method is a well-known method that is typically used as a method of classifying images [6].

Three-Dimensional Surface with One Local Minimum: Weber's Function
This section visualizes how each method changes parameters using the two-variable function. The two-variable function (i.e., Weber's function) used in this experiment is generally defined as follows: κ is an integer, and µ i ∈ R. For convenience, ν = (ν 1 , ν 2 , ν 3 , ν 4 ) and ν i is a column vector in R 2 . For experiments of local minimums in existing cases, we used the following hyperparameters: κ = 4, µ = (2, −4, 2, 1), ν 1 = (−10, −10), ν 2 = (0, 0), ν 3 = (5, 8), and ν 4 = (25, 30). This Weber's function has a local minimum of (−10, −10) and a global minimum of (25, 30). Figure 1a shows Weber's function determined by the given µ and ν. This function has a global minimum of (25, 30) and a local minimum of (−10, −10), which are represented by the red and blue points in Figure 1a, respectively. In order to express the change of the parameters more easily, Figure 1b connects the given Weber's function with the same function value as the contour line used on the map. In this experiment, the initial value of the parameter was set to (−10, −40) and the learning rate was 5 × 10 −1 , in order to check whether any of the methods has an effect of escaping the local minimum.   Figure 2 shows the change of the parameters by each method. GD, Adagrad, and Adam moved in the direction of the local minimum near the initial value and were no longer minimized; however, our proposed method avoided the local minimum to reach the global minimum.

Three-Dimensional Surface with Three Local Minimums: The Styblinski-Tang Function
In this experiment, the performance of each method was compared using the Styblinski-Tang function with three local minimums as the cost function. The Styblinski-Tang function with three local minima is defined by: and has a global minimum of (−2.903, −2.903). Figure 3 is expressed in the same way as in the previous section.  The initial value of the parameter is (6, 0), the learning rate is 1 × 10 −2 , and learning has been conducted 300 times. Figure 4 shows the change in the parameters learned by each method. In this experiment, GD, Adagrad, and Adam also moved toward the local minimum near the initial value; however, our proposed method avoided the local minimum to reach the global minimum.

MNIST with CNN
MNIST dataset are popular for machine learning and consist of a 10-class gray-scale image with a size of 28 × 28. Figure 5 shows part of the MNIST data. For a performance comparison of each method, a simple structured CNN consisting of two convolution layers and one hidden layer was used. For the convolution, a 5 × 5 × 32 sized filter and 5 × 5 × 32 × 64 sized filter were used, the batch size was set to 64, and a dropout of 50% was performed. Learning was conducted 8800 times, and the learning rate was 0.001. Figure 6 shows the results of the experiment.  Figure 6a shows the change in cost according to the learning using each method, and shows that all of the methods minimized the cost. Figure 6b shows the accuracy using the learning data as the learning progresses, and Figure 6c shows the validation accuracy using the test data as the learning progresses to check if it is over-fitting. For a more detailed comparison, Figure 7 shows the accuracy after a certain iteration.   Figure 7a shows the training accuracy after conducting 1000 iterations, while Figure 7b shows the validation accuracy after conducting 2000 iterations. The training accuracy was high for all methods; however, Adam and our proposed method were particularly high. The order of validation accuracy from high to low is: our proposed method > Adam > Adagrad > GD.

CIFAR-10
In this section, the Canadian Institute For Advanced Research-10 (CIFAR-10) dataset was trained with a residual network (RESNET) model to check the performance of each method in a more deep model. The CIFAR-10 dataset had 60,000 images. The image of the CIFAR-10 dataset used were in color (size 32 × 32) with 10 different classes (i.e., airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks). The RESNET 44 model was used for the learning; the batch size was 128, the learning rate was 0.001, and the number of iterations was 80,000. Figure 8 shows the results of each method used in this experiment.   Figure 8a shows the training cost of each method, while Figure 8b shows the verification accuracy of each method per 10,000 steps. In Figure 8a, Adam and the proposed method had the lowest costs, and the proposed method experienced little vibration. Therefore, it seems to be a stable learning method. In Figure 8b, the proposed method achieved the best performance, followed by the Adam method.

Conclusions
For AI learning, we introduced a method based on the existing momentum method, using the first derivative of the cost function and the cost function at the same time. In addition, our proposed method was introduced to stop learning in an optimal situation by adjusting the learning rate in a way that responds to changes in the cost function. Our proposed method was verified not only through mathematical proofs, but also through the results of numerical experiments. It was confirmed through numerical experiments that learning through this process is superior to the existing GD, momentum, and Adam methods. In particular, through experiments, we confirmed that the stopping of learning is also important in terms of effective learning, because it results in better accuracy than other methods in terms of learning accuracy. In other words, in order for machine learning to work well, it is important to stop learning at the right moment and at the same time, as there are many changes in variables.
In the future, not only will learning stop at a useful moment in machine learning, but the problem of stopping learning will be achieved at the same time as effective learning in an ANN with a deeper structure.