A Zeroth-Order Adaptive Learning Rate Method to Reduce Cost of Hyperparameter Tuning for Deep Learning

: Due to powerful data representation ability, deep learning has dramatically improved the state-of-the-art in many practical applications. However, the utility highly depends on ﬁne-tuning of hyper-parameters, including learning rate, batch size, and network initialization. Although many ﬁrst-order adaptive methods (e.g., Adam, Adagrad) have been proposed to adjust learning rate based on gradients, they are susceptible to the initial learning rate and network architecture. Therefore, the main challenge of using deep learning in practice is how to reduce the cost of tuning hyper-parameters. To address this, we propose a heuristic zeroth-order learning rate method, Adacomp , which adaptively adjusts the learning rate based only on values of the loss function. The main idea is that Adacomp penalizes large learning rates to ensure the convergence and compensates small learning rates to accelerate the training process. Therefore, Adacomp is robust to the initial learning rate. Extensive experiments, including comparison to six typically adaptive methods (Momentum, Adagrad, RMSprop, Adadelta, Adam, and Adamax) on several benchmark datasets for image classiﬁcation tasks (MNIST, KMNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100), were conducted. Experimental results show that Adacomp is not only robust to the initial learning rate but also to the network architecture, network initialization, and batch size.


Introduction
Deep learning has been highly successful across a variety of applications, including speech recognition, visual object recognition, and object detection [1][2][3]. In general application, deep learning consists of training and inference phases. In the training phase, a predefined network is trained on a given dataset (known as the training set) to learn the underlying distribution characteristics. In the inference phase, the well-trained network is then used for unforeseen data (known as the test set) to implement specific tasks, such as regression and classification. One fundamental purpose of deep learning is to achieve as high accuracy as possible in the inference phase after only learning from the training set. In essence, training a deep learning network is equivalent to minimizing an unconstrained non-convex but smooth function.
min w∈R n f (w) := E ξ∼D F(w, ξ) (1) where D is the population distribution, F(w, ξ) is the loss function of sample ξ, and f (w) is the expectation of F(w, ξ) with respect to ξ. One effective method to solve problem (1) is mini-batch stochastic gradient descent (SGD) [4]. That is, in each iteration t, the model parameter w t is updated to w t+1 , following where γ t is the learning rate and B t is a random mini-batch of size b. Here, based on three advantages discussed below, we consider using problem Equations (1) and (2) to analyze deep learning. First, generality. Because most deep learning networks correspond to non-convex optimization, the derived results for problem (1) can be applied to general deep learning tasks in practice. Second, effectiveness. Because the data scale used in deep learning is usually huge, Equation (2) can achieve a better utility-efficiency tradeoff than SGD or batch GD. Third, simplicity. When using Equation (2), searching for a solution to problem (1) is reduced to setting a proper learning rate γ t . Therefore, the research question is how to set a proper γ t in Equation (2) to ensure the convergence of problem (1). Without of loss generality, we make the following assumptions.
• Non-convex but smooth f (w). That is, f (w) is non-convex but satisfies Unbiased estimate and bounded variance of g(w) = ∇F(w, ξ). That is, It is well-known that searching for the global minima of Equation (1) is NP-hard [5]. Therefore, one usually aims to search for the first-order stationary point, the gradient of which satisfies ∇ f (w) < ε where ε is a given error bound. For simplicity, denote the average of gradients on mini-batch B t asḡ t = 1 |B t | ∑ ξ∈B t ∇F(w t , ξ i ). It is well-known that SGD can converge for proper settings of γ t [6][7][8][9]. Actually, based on the above assumptions and through direct calculation, we have where f * = min w∈R n f (w) and L, σ 2 are defined in the assumptions. Therefore, for any given constant γ = min{1/L, O(1/ √ L + T)}, we can deduce that ∑ T t=1 E ∇ f (w t ) is bounded, which implicates that ∇ f (w t ) → 0 as t → ∞. That is, SGD is convergent and one can output the stationary solution with high probability [6].
Nevertheless, one challenge of applying these theoretical results in practice comprises the unknown parameters f (w 1 ), L and σ, which are related to network architecture, network initialization, loss function, and data distribution. To avoid computing exact values that are network-and dataset-dependent, there are two common ways to set the learning rate in practice. One way is setting an initial level at the beginning and then adjusting it with a certain schedule [4], such as step, multi-step, exponential decaying, or cosine annealing [10]. However, setting the learning rate typically involves a tuning procedure in which the highest possible learning rate is chosen by hand [11,12]. Besides, there are additional parameters in the schedule that also need to be tuned. To avoid the delicate and skillful parameter tuning, the other way is using adaptive learning rate methods, such as Adagrad [13], RMSprop [14], and Adam [15], in which only the initial learning rate needs to be predefined. However, as shown in our experiments and other studies [16][17][18], they are sensitive to the initial learning rate and each of them has its own effective intervals (refer to Figure 1). Usually, the setting of an initial learning rate is model-and dataset-dependent. This increases the cost of tuning the learning rate and the difficulty of selecting the proper adaptive method in practice.
This motivates us to design an adaptive method, which can reduce the cost of tuning the initial learning rate. Furthermore, the method should achieve a satisfied accuracy no matter what the network architecture and data are. To achieve this, we propose a zeroth-order method, Adacomp, to adaptively tune the learning rate. Unlike existing first-order adaptive methods, which adjust learning rate by exploiting gradients or additional model parameters, Adacomp only uses the values of loss function and is derived from minimizing Equation (1) with respect to γ t , which has the original expression where θ is an undetermined parameter and θ g(w t ) 2 will be further substituted by other explicit factors. Refer to Equation (8) for details. Note that Equation (3) only uses the observable variables g(w t ) and f (w t ) − f (w t+1 ) to adjust γ. It can be interpreted that when f (w k ) − f (w k+1 ) dominates (γ/2) g(w t ) 2 , we use an aggressive learning rate to enhance the progress. In contrast, we use an exponential decaying learning rate to ensure convergence. Therefore, in a highly abstract level, γ is complementary to loss difference f (t k ) − f (w t+1 ) and we name it Adacomp, which has the following two advantages. Firstly, Adacomp is insensitive to learning rate, batch size, network architecture, and network initialization. Secondly, due to only exploiting values of loss function rather than highdimensional gradients, Adacomp has high computation efficiency. In summary, our contributions are as follows.
• We propose a highly computation-efficient adaptive learning rate method, Adacomp, which only uses loss values rather than exploiting gradients as other adaptive methods. Additionally, Adacomp is robust to initial learning rate and other hyper-parameters, such as batch size and network architecture. • Based on the analysis of Adacomp, we give a new insight into why a diminishing learning rate is necessary when solving Equation (1) and why a gradient clipping strategy can outperform a fixed learning rate. • We conduct extensive experiments to compare the proposed Adacomp with several first-order adaptive methods on MNIST, KMNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 classification tasks. Additionally, we compare Adacomp with two evolutionary algorithms on MNIST and CIFAR-10 datasets. Experimental results validate that Adacomp is not only robust to initial learning rate and batch size, but also network architecture and initialization, with high computational efficiency.
The remainder is organized as follows. Section 2 introduces related work about typical first-order adaptive methods. Section 3 presents the main idea and formulation of Adacomp. In Section 4, we conduct extensive experiments to validate Adacomp, in terms of robustness to learning rate, network architectures, and other hyperparameters. We conclude the paper and list future plans in Section 5.

Related Work
There are many modifications to the gradient descent method, and the most powerful is Newton's method [19]. However, the requirements of a Hessian matrix and its inverse are prohibitive to compute in practice for large-scale models. Therefore, many first-order iterative methods have been proposed to either exploit gradients or to approximate the inverse of the Hessian matrix.

Learning Rate Annealing
One simple extension of SGD is the mini-batch SGD, which can reduce the variance by increasing the batch size. However, the proper learning rate is hard to set beforehand. A learning rate that is too small will slow down the convergence, while if it is too large it will cause large oscillation or even divergence. The ordinary method is adjusting the learning rate in the training process, such as by using the simulated annealing algorithm [20,21], or decreasing the learning rate when the loss value is less than a given threshold [22,23]. However, the iteration number and threshold must be predefined. Therefore, the method is not adjustable to different datasets.

Per-Dimension First-Order Adaptive Methods
In mini-batch SGD, a single global learning rate is set for all dimensions of the parameters, which may not be optimal when training data are sparse and different coordinates vary significantly. A per-dimension learning rate that can compensate for these differences is often advantageous.
Momentum [24,25] is one method of speeding up training per dimension. The main idea is to accelerate progress along dimensions in which gradients are in the same direction and to slow progress elsewhere. This is done by keeping track of past parameters with an exponential decay, v t = ρv t−1 + γḡ t , w t+1 = w t − v t , where ρ is an undetermined parameter to control the decay of previous updated parameters. This gives an intuitive improvement over SGD when the cost surface is a long narrow valley.
Adagrad [13,26] adjusts the learning rate according to the accumulated gradients and has shown significant improvement on large-scale tasks in a distributed environment [27]. This method only uses information of gradients with the following update rule: Here the denominator computes the L 2 norm of all previous gradients per-dimension. Since the learning rate for each dimension inversely grows with the gradient magnitudes, the large gradients have a small learning rate and vice versa. This has the nice property, as in second-order methods, that the progress along each dimension evens out over time. However, Adagrad is sensitive to initial model parameters. When some dimensions of gradient are too large, or as gradients accumulate, the learning rate will quickly tend to zero before achieving a good result.
Subsequently, several adaptive methods are proposed to overcome the drawback. RMSprop [14,28] is a modification of Adagrad, using the root-mean-square (RMS) to replace the denominator in Adagrad, with the following update rule: This can mitigate the fast decay of learning rate in Adagrad and damp oscillation. The parameter ρ is used to control the decaying speed and ε is a positive small constant to keep the denominator meaningful.
Adadelta [29,30] is another improvement of Adagrad from two aspects. On one hand, Adadelta replaces the denominator in Adagrad by the average of exponential decay of history gradients over a window with some fixed size. On the other hand, Adadelta uses the Hessian approximation to correct units of updates. Adadelta uses the following update rule: The advantage is no requirement of manually setting the global learning rate. At the beginning and middle phases, Adadelta achieves a good accelerating effect. However, at the late phase, it may oscillate at local minima.
Adam [15,31] can be viewed as a combination of Momentum and RMSprop, with additional bias correction. Adam uses the following update rule: Here m t , v t are factors of momentum and root-mean-square, and m t are corresponding bias corrections.
Adamax [15] is an extension of Adam by generalizing the L 2 norm to the L p norm and letting p → +∞, with the update rule: Note that the difference between Adamax and Adam is the expression of v t and there is no bias correction in v t .
However, all these adaptive methods rely on gradients or additional model parameters, which make them sensitive to the initial learning rate and network architectures.

Hyperparameter Optimization
This aims to find the optimal learning rate values and includes experimental performance analysis [18,32] and Bayesian optimization [33,34] based on mathematical theory. Specifically, [35] combines hyperband and Bayesian optimizations, additionally utilizing the history information of previous explored hyperparameter configurations to improve model utility. Besides, reinforcement learning (RL) and heuristic algorithms are also extensively applied to tune hyperparameters of deep learning. With respect to RL, the focus is how to alleviate the dependence on expert knowledge [16,36,37] or additionally improve the computational efficiency [38,39]. For example, [16] uses RL to learn and adjust the learning rate on each training step and [36] proposes an asynchronous RL algorithm to search the optimal convolutional neural network (CNN) architecture. Ref. [39] reuses the previously successful configuration for reshaping the advantage function and [38] adaptively adjusts the horizon of the model to improve the computational efficiency of RL. With respect to heuristic algorithms, Ref. [40] sets an adaptive learning rate for each layer of neural networks by simulating the cross-media propagation mechanism of light in the natural environment. Ref. [41] proposes to use a variable length genetic algorithm to tune the hyperparameters of a CNN. Ref. [42] proposes a multi-level particle swarm optimization algorithm for a CNN, where an initial swarm at level-1 optimizes architecture and multiple swarms at level-2 optimize hyperparameters. Ref. [43] combines six optimization methods to improve the performance of forecasting short-term wind speed, where local search techniques are used to optimize the hyperparameters of a bidirectional long short-term memory network. However, these methods suffer unavoidably from more time-consuming space searching.

Zeroth-Order Adaptive Methods
Like our method, Refs. [23,44] propose two methods that adjust the learning rate based on the training loss. In [23], the learning rate at epoch s is set as γ s = γ 0 Π s−1 t=1 r n t , where r n t is the scale factor. However, the selection of r n t is through multi-point searching (controlled by beam size), which may be a computation overhead. For example, when the beam size is 4, it has to train the network with 12 different learning rates and to select the optimal one after the current epoch. In [44], the learning rate update is , where M is the tracking coefficient and E(·) is the reconstruction error of the WAE (wavelet auto-encoder). Note that the update rule is model restricted and cannot be applied to general cases. For example, when E(t + 1) − E(t) > 0, γ t+1 may be less than zero and has no meaning. However, the computation overhead of [23] and no meaning of [44] will not occur in Adacomp.

Adacomp Method
In the section, we describe our proposed Adacomp through two main steps. First, we deduce the optimal learning rate in each iteration based on theoretical analysis. Then, we design the expression of Adacomp to satisfy several restrictions between learning rate and difference of loss function. For convenience, we summarize the above mentioned and the following used variables in Table 1.

Variables Explanations
T, t Number of total iterations and index of the current iteration.
Empirical loss function and gradient at model parameter w and sample ξ.
Expected function and gradient of F(w, ξ) and ∇F(w, ξ) with respect to ξ ∼ D. L, σ 2 Smooth constant of f (w) and variance of ∇F(w, ξ) with respect to ξ.
Learning rate at t-th iteration, parameter of Adacomp used to adjust γ t .

Idea 1: Search Optimal Learning Rate
For problem (1) in which gradients are Lipschitz continuous, i.e., by substituting Equation (2), we have To make progress at each iteration t, let the r.h.s. of Equation (4) be less than zero, and we have γ t ≤ 2 ∇ f (w t ),ḡ t /L ḡ t 2 . To greedy search the optimal learning rate, we minimize Equation (4) to obtain This presents an explicit relation between the learning rate and inner product of expected gradient ∇ f (w t ) and observed gradientsḡ t . When ∇ f (w t ),ḡ t ≥ 0, Equation (5) is meaningful (γ t ≥ 0) and using this γ t can make the largest progress at the current iteration. Elsewhere, we set γ t = 0 when ∇ f (w t ),ḡ t ≤ 0. This means that whenḡ t is far away from the correct direction ∇ f (w t ), we will drop the incorrect direction and do not update the model parameters at the current iteration.
By substituting Equation (5) into Equation (4) and through simple calculation, we obtain Theorem 1, which indicates that ∇ f (w t ) → 0 as t → ∞, i.e., the convergence of nonconvex problem (1). Theorem 1. If the learning rate γ t is set as Equation (5), then Proof. By substituting Equation (5) into Equation (4), we have Taking expectation with respect to ξ on the current condition w t , we have Taking summation on both sides of the above inequality from t = 1 to T − 1 and using the fact that Some remarks about Theorem 1 are in order. First, Theorem 1 achieves the optimal convergence rate of smooth nonconvex optimization when using the SGD algorithm. In particular, under a deterministic setting (i.e., σ = 0), Nesterov [45] shows that after running the method for at most T = O(1/ε) steps, one can achieve min t=1,··· ,T ∇ f (w t ) ≤ ε, where ε is the given error bound. Under a stochastic setting (i.e., σ > 0), the result reduces to O(1/ε 2 ). Ghadimi [6] derives the result The optimal rate is improved to O(ε −7/4 log(1/ε)) by using accelerated mirror descent [7]. Here, we obtain that using Equation (6) achieves the optimal convergence rate, i.e., O(ε −2 ), for stochastic optimization when using SGD.
Second, a diminishing learning rate is necessary to ensure convergence, i.e., gradient ∇ f (w t ) will tend to zero as t → ∞. In this case, Third, gradient clipping is a useful skill to ensure convergence. As proved in [46], using a clipped gradient can converge faster than a fixed learning rate. Here we show an explicit explanation. Based on Equation (5) to clip the gradient, which in turn means that clipping the gradient is equivalent to using a varying learning rate to update model parameters. Therefore, for any fixed learning rate γ t ≡ γ, we can adjust it based on Equation (5) to make faster progress. That is, proper gradient clipping can outperform any fixed learning rate update.

Idea 2: Approximate Unknown Terms
To set learning rate according to Equation (5), two terms, expected gradient ∇ f (w t ) and smooth constant L, are unknown in practice. Note that in the training process, we only get access to information of stochastic gradientḡ t , model parameters w t , and loss value f (w t ). Most first-order adaptive learning methods exploit stochastic gradient and model parameters, which are usually high-dimensional vectors. Instead, we will use the stochastic gradientḡ t and loss value to reformulate Equation (5). In particular, based on Equation (4), we have In the above inequality, parameters square norm of gradients ḡ t 2 and learning rate γ t are known, and loss value of expected f (w t ) at the current iteration can be approximated by empirical value F(w t , ξ); however, smooth constant L and the loss value f (w t+1 ) of the next iteration are unknown. Derived from the basic analysis [6], it is known that when Lγ t ≤ 1, SGD can converge to the stationary point. Therefore, we introduce a new parameter θ := Lγ t ∈ (0, 1] and use f ( based on the assumption that f (w) is smooth. Then, for any given learning rate γ subject to θ = Lγ ∈ (0, 1], we adjust the learning rate based on the following formula: Equation (8) has straightforward interpretations. First, when ḡ t 2 dominates f (w t−1 ) − f (w t ) in the numerator, we will use an exponential decaying learning rate strategy to prevent divergence. This includes the case that f (w t−1 ) − f (w t ) tends to zero, i.e., the model converges to a stationary point. In such case, γ t will decay to zero to stabilize the process. As shown in [46], in such case, any fixed large learning larger than a threshold will diverge the training. Second, when f (w t−1 ) − f (w t ) is relatively large, which means that the current model point is located at the rapid descent surface of f (w), we can use a relatively large learning rate to accelerate the descent speed. This usually happens at the initial phase of the training procedure. Third, the parameter θ ∈ (0, 1] can control the tradeoff between ḡ t 2 and f (w t−1 ) − f (w t ). Note that these two terms have different magnitudes, and ḡ t 2 is generally much larger than f (w t−1 ) − f (w t ). In such case, Equation (8) with θ = 1 reduces to a totally exponential decaying strategy, which may slow down the convergence. To address it, one can set an adaptive θ according to Although Equation (8) is meaningful, the remaining challenge is how to set the proper θ according to the observed gradients ḡ t 2 and difference in loss function To address this, we deal with θ ḡ t 2 as a whole and reformulate the term Here, we propose an effective adaptive schedule Adacomp that only uses the information of loss values and satisfies the following requirements.
When γ is too small, we should increase γ to prevent Adacomp from becoming trapped in a local minimum.

•
When γ is too large, we should decrease γ to stabilize the training process.
These requirements motivate us to design Adacomp based on the arctan function, which is flexible to small values but robust and bounded to large values. The main principle is as follows: Decompose γ into three parts, 1 , 21 , and 22 . 1 is used to compensate the learning rate to accelerate training when the loss function decreases. However, the compensation should be inverse to learning rate and bounded. 21 and 22 are used to penalize the learning rate to stabilize training when the loss function increases, but with a bounded amplitude when the learning rate is too large or small.
Based the above principle, we reformulate Equation (8) as: where I is the indicator function of and is a small positive constant. Many functions satisfy the above discipline. To reduce the difficulty of designing 1 , 21 , 22 , we define the following expressions, where only one parameter β needs to be tuned. Furthermore, as experimental results show (refer to Figure A1), Adacomp is not so sensitive to β.
where 1/2 < β ≤ 5 is a parameter used to control the adjustment amplitude of the learning rate. We first explain the meaning of ε and then 1 , 21 , and 22 . The meaning of . We replace the hard threshold ∆ t < 0, > 0 with soft threshold ∆ t < − , > on two aspect considerations. First, this can alleviate the impacts of ∆ t 's randomness on learning rate. Second, when ∆ t ∈ [− , ] for small positive values such as = 10 −5 , we halve γ. Actually, in such a case the training has converged and halving γ can make training more stable. Now, we explain how to set expressions of 1 , 21 , and 22 .
• Expression of 1 . Note that 1 works only if ∆ t > 0. In this case, we should increase γ to speed up training. However, the increment should be bounded and inverse to current γ to avoid divergence. Based on these considerations, we define 1 as , where 1/2 is used to keep γ unchanged and arctan(·)/(5π) is used to ensure that the increment amplitude is at most γ/10. • Expressions of 21 and 22 . Note that 21 , 22 work only if ∆ t < 0, where the twice 1/4 is used to keep γ unchanged and the remaining terms are used to control the decrement amplitude. In this case, we should decrease γ to prevent too much movement along the incorrect direction. However, the decrement should be bounded to satisfy the two following properties.

-
When γ is small, the decrement of 21 + 22 should be less than the increment in 1 given the same |∆ t | unless γ is forced to zero, which potentially leads to training stopping too early. Therefore, we define arctan(·) 5.5π in 21 to satisfy arctan(·) 5.5π . Note that 21 when γ is small, which satisfies the requirement.
In summary, Equation (9) penalizes the large learning rate and compensates the too small learning rate, and presents an overall decreasing trend. These satisfy all the requirements. Note that this adjusting strategy is complementary to ∆ t and γ; we named it Adacomp. Although Adacomp is robust to hyperparameters, it may fluctuate at a local minimum. The reason for this is that Adacomp will increase the learning rate when the loss function decreases near the local minima, and the increment potentially makes model parameters skip the local minima. Then, Adacomp in turn will penalize learning rate to decrease loss function. Thus, the alternately adaptive adjustment possibly makes model parameters fluctuate at a local minimum. To achieve both robustness and high-utility, one can view Adacomp as a pre-processing step to reduce the cost of tuning hyperparameters, and combine Adacomp with other methods to improve the final model utility. Algorithm 1 shows the pseudocode, which consists of two phases. Phase one uses Adacomp to fast-train the model by avoiding and carefully tuning the initial learning rate. Phase two uses other methods (e.g., SGD) to improve the model utility. More details are as follows. At the beginning (Input), one should define four parameters, including number of total iterations T with phase threshold T 1 , initial learning rate γ 0 , and one specific value β of Adacomp used to control the tradeoff between increment and decrement.
In phase 1 (Lines 2-8), two variables , loss l , loss c , are used to track the loss difference ∆ t in Equation (9), where loss l = 1 b ∑ ξ∈B t−1 F(w t−1 , ξ) is loss of last time and loss c = 1 b ∑ ξ∈B t F(w t , ξ) is loss of current time (Lines 3-5). Then ∆ t is used to update learning rate based on Equation (9) (Line 6). The updated learning rate γ t is used to update the model parameters w t to w t+1 based on w t+1 = w t − γ tḡt (Line 7) until t reaches to T 1 . In phase 2, we set the average of the learning rate used in phase one, γ avg , as the new current initial learning rate (Line 9). Then, one can adopt a decaying scheduler, such as StepLR in mini-batch gradient descent update to stabilize the training process (Lines 11-13). The final model parameter is output when the stopping criterion is satisfied (Line 15).

Experiments
In this section, we describe extensive experiments to validate the performance of Adacomp on five classification datasets, in comparison with other first-order iterative methods, including SGD, Momentum, Adagrad, RMSprop, Adadelta, Adam, and Adamax. Experimental results show that Adacomp is more robust to hyper-parameters and network architectures. Code and experiments are publicly available at https://github.com/ IoTDATALab/Adacomp, (accessed on 26 October 2021).

Results on MNIST Dataset
We conducted Adacomp (set β = 0.6 in Equation (9)) on MNIST dataset to validate its robustness from aspects of learning rate, batch size, and initial model parameters.
Predominately, Adacomp works well for a wider range of learning rates than all other compared methods.

Robustness to Initial Learning Rate
We set learning rate (LR) from 10 −5 to 15, and used fine granularity in the interval [0.01, 1] but loose granularity outside it. Figure 1 shows the results of test accuracy after 10 epochs when LR lies in [0.01, 1]. It is observed that only Adadelta and Adacomp work when LR is greater than one. Additionally, we illustrate results when LR lies in [1, 15] and the effective intervals of all methods are shown in Figure 2. The training batch size was fixed as 64 and the initial seed was set as 1 in all cases.  Two conclusions are observed from Figures 1 and 2. First, Adacomp is much more robust than the other seven iterative methods. From Figure 1, Adacomp achieves test accuracy greater than 98% for all settings of learning rates. In contrast, RMSprop, Adam, and Adamax are sensitive to learning rate and only work well for very small learning rates. The other remaining methods, except Adadelta, have an intermediate robustness between Adamax and Adacomp. However, when the learning rate is greater than one, only Adadelta and Adacomp still work. Furthermore, from the top of Figure 2, it is observed that Adadelta works well until learning rate increases up to 9 while Adacomp still works even when learning rate is up to 15. The bottom of Figure 2 (logarithmic scale of horizontal axis) illustrates that each method has its own effective interval. From the top RMSprop to the penultimate Adadelta, the effective interval gradually slides from left to right. However, Adacomp (ours) has a much wider interval, which means one can successfully train the model almost without tuning learning rate. The reason for strong robustness is that Adacomp adjusts learning rate to a high value in the first few epochs before shrinking to a small value, no matter what the initial setting. Second, the adaptive strategy may not always outperform SGD, which uses the fixed learning rate. As shown in Figure 1, adaptive methods, Momentum, Adagrad, RMSprop, Adam, and Adamax, are more sensitive to learning rate than SGD. This means that when using adaptive strategies, proper setting of the initial learning rate is necessary. However, we also observe that adaptive methods, Adadelta and Adacomp, are much more robust to initial learning rate than SGD.

Robustness to Other Hyperparameters
We conducted more experiments to compare the robustness of the eight iterative methods to batch sizes and initial model parameters. Figure 3a shows the impacts when batch sizes were set as 16, 32, 64, and 128, and Figure 3b shows the impacts when we repeated the experiment four individual times in which the network was initialized using the same random seed (1, 10, 30, 50, respectively) for different optimizers. For fairness, we set learning rate to 0.01. Under the level, all eight methods have a roughly equivalent accuracy (refer to Figure 1). It is observed from Figure 3a that the robustness to batch size, from strong to weak, is in the order Momentum ≈ Adagrad ≈ Adamax ≈ Adacomp Adadelta SGD RMSprop Adam (note that Adam was divergent when the batch size was set to 16). It is observed from Figure 3b that the robustness to initialized seeds is in the order Adagrad ≈ Momentum ≈ Adacomp ≈ Adamax Adadelta SGD Adam RMSprop (note that RMSprop was divergent when the seed was set to 30,50). In summary, for a certain setting of learning rate, Momentum, Adagrad, Adamax, and Adacomp have the most robustness to batch sizes and initialized seeds, while RMSprop and Adam have the weakest robustness.

Convergence Speed and Efficiency
To compare the convergence speed, we further recorded the number of epochs when prediction accuracy firstly grows up than 98% (Column 3 in Table 2). The convergence speed is defined as the ratio of Epochs to # ≥98% . Taking SGD as an example, it had 13 out of 25 times where the final prediction was greater than 98% (refer to Figure 1). Meanwhile, among the 13 successful cases, SGD took a total of 19 epochs to achieve 98%. Then, the convergence speed was 19/13 = 1.46. It is observed from Table 2 that Adacomp has the fastest convergence speed 0.92, which means Adacomp can achieve at least an accuracy of 98% using no more than one epoch on average. Subsequently, Momentum has a convergence speed of 1.09 and Adamax has the slowest speed of 3.5.
To compare the efficiency, we performed each of the methods ten times and recorded the average training and total time in the last column of Table 2. It is also observed that Adacomp achieves the lowest time consumption (excluding SGD), while Momentum and RMSprop have a relatively large time consumption. Although the improvement is slight, it can be deduced that Adacomp performs fewer computations than other first-order adaptive methods to adjust learning rate.

Results on CIFAR-10 Dataset
In this section, we conducted more experiments to further compare the robustness of eight iterative methods on CIFAR-10. Each of them was performed on three settings of learning rate (i.e., 0.5, 0.05, and 0.005) and six network architectures (namely, LeNet, VGG, ResNet, MobileNet, SEnet, SimpleDLA). In this setting, we chose β = 5 in Adacomp. Figure 4 shows the comparison of test accuracy on six network architectures when LR is 0.5, 0.05, 0.005. Figure 5a,b show improvements in test accuracy when LR is degraded from 0.5 to 0.05, and 0.05 to 0.005, respectively. We conclude the robustness of eight methods with respect to different network architectures.  Firstly, we separately conclude the model utility for each setting of learning rate. (1) From Figure 4a, it is observed that Adacomp has a similar performance to the best Adadelta. However, Momentum, RMSprop, Adam, and Adamax, are failures for almost all architectures when LR is 0.5. That is, these adaptive methods are not compatible with a large initial learning rate, which is also observed on MNIST (Figure 1). Among them, Adagrad is much more sensitive to the network architecture than Adadelta, Adacomp, and SGD. Surprisingly, SGD achieves a comparable result to Adadelta and Adacomp, except for the LeNet architecture. (2) From Figure 4b, it is observed that all methods, except RMSprop and Adam, have similar performances on each of the network architectures. Furthermore, except on the MobileNet architecture, Adacomp achieves the highest accuracy, followed by Adadelta and SGD (both have small degradation on SENet and SimpleDLA). Besides, it is observed that RMSprop and Adam are more sensitive to the network architectures than others. (3) From Figure 4c, it is observed that no method diverged when LR was set to 0.005. Specifically, adaptive methods, Momentum, RMSprop, Adam, Adamax, and Adacomp (ours) perform similarly on each of the six architectures, except that Adacomp has the advantage for SimpleLDA and disadvantage for MobileNet. Subsequently, Adagrad achieves a similar performance except on the MobileNet architecture. However, Adadelta and SGD have a total degradation compared to other methods. This shows that the small setting of the initial learning rate is correct for almost adaptive methods.
Secondly, we combined Figures 4 and 5a,b to conclude the robustness of each of the methods to network architectures and learning rates. (1) It is observed from Figure 4c that when the learning rate is small, all methods have a similar robustness to network architectures. Specifically, they perform poorer on LeNet and MobileNet than others. The reason is that these two architectures are relatively simple and more sensitive to inputs. However, it is observed from Figure 4a that when the learning rate is large, except Adadelta, Adacomp, and SGD, all other methods are sensitive to network architectures. (2) It is observed from Figure 5a,b that Momentum, Adagrad, RMSprop, Adam, and Adamax are more sensitive to learning rate than others. The reason is that when the learning rate is large, the acceleration in Momentum is too large to diverge, and also the accumulation of gradients in Adagrad, RMSprop, Adam, and Adamax is too large. In contrast, SGD, Adadelta, and Adacomp are relatively insensitive to learning rate, but SGD and Adadelta appear to have an overall degradation when the learning rate is small. In summary, we conclude that the proposed Adacomp is robust both to learning rates and network architectures.

Results for other Datasets
We conducted more experiments to further validate the robustness of Adacomp to datasets. In particular, we employed the network architecture that was used for MNIST in Section 4.2 but to datasets KMNIST and Fashion-MNIST. Furthermore, we employed a new network architecture (refer to https://github.com/junyuseu/pytorch-cifar-models.git, (accessed on 20 August 2021)) to fulfill the CIFAR-100 classification task. In each case, the learning rate was set as 0.001, 0.01, 0.1, 1 and we recorded the average time of CIFAR-100 after repeating 10 times. The final prediction accuracy after 150 epochs and the average time on CIFAR-100 are shown in Table 3. Moreover, the average and stand variance of prediction accuracy when the learning rate was 0.001, 0.01, 0.1, 1 are demonstrated in Figure 6. Three conclusions are observed as follows. First, each method has its effective learning rate setting. For example, as Table 3 shows, Adam and Adamax are more effective when the learning rate value is relatively small (e.g., 0.001). However, SGD and Adadelta are more effective when the learning rate value is relatively large (e.g., 0.1). However, when learning rate equals 1, all methods except Adadelta and Adacomp fail on at least one of the three datasets. Second, as Figure 6 shows, Adacomp is more robust than all other methods on three datasets. In Figure 6, points with large horizontal value mean a high average prediction accuracy and with a small vertical value mean a low variance. Then, it is obvious to observe that Adacomp is more robust to datasets than other methods. Third, Adacomp is computationally efficient. This is directly observed from the last column of Table 3, where the training time and total time are presented. This is because Adacomp only adjusts learning rate according to training loss, while other adaptive methods exploit gradients or additional model parameters.

Conclusions and Future Work
We proposed a method, Adacomp, to adaptively adjust learning rate by only exploiting the values of the loss function. Therefore, Adacomp has higher computational efficiency than other gradient-based adaptive methods, such as Adam and RMSprop. From a high abstract level, Adacomp penalizes large learning rates to ensure the convergence and compensates small learning rates to accelerate the training process. Therefore, Adacomp can help escape from local minima with a certain probability. Extensive experimental results show that Adacomp is robust to network architecture, network initialization, batch size, and learning rate. The experimental results show that Adacomp is inferior to Adadelta and others in the maximum validation accuracy over learning rate. Thus the presented algorithm cannot be an alternative to the state-of-the-art. However, the adaptive methods use different learning rates for different parameters and Adacomp determines only the global learning rate.
In future work, we will validate Adacomp using more tasks (besides convolutional neural networks) and extend Adacomp to a per-dimension first-order algorithm to improve the accuracy of SGD. Additionally, we will apply Adacomp to the distributed environments. Note that Adacomp only uses values of loss functions. Therefore, it is suitable for distributed environments where communication overhead is a bottleneck. Furthermore, we will study how to set the parameters of Adacomp in an end-to-end mode. This may be achieved by introducing feedback and control modules.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Extended Experiments
More experiments were conducted in the section to validate Adacomp from three aspects.

•
Setting different values of β in Adacomp to show its impacts. As shown in Equation (9), β is a parameter used for tuning the tradeoff between 1 and 21 , 22 . In Section 4, β was set as a constant 0.6 for all experiments, without presenting the impacts of β.
Here, experimental results on MNIST show that Adacomp is not so sensitive to β. • Using more metrics to provide an overall validation. In Section 4, we used prediction accuracy to compare algorithms for classification tasks. The single metric may not provide the overall validation. Here, we provided complementary experiments with three more metrics, precision, recall, and F1-score. Experimental results on MNIST, Fashion-MNIST, and CIFAR-10 show that Adacomp performs stably under the additional metrics. • Comparing with evolutionary algorithms to enrich experiments. In Section 4, we compared Adacomp with six first-order adaptive algorithms, lacking comparison with some state-of-the-art approaches. For completeness, we compared Adacomp with two typical evolutionary methods, the genetic algorithm and particle swarm optimization algorithm. Experimental results show that Adacomp can significantly save time cost at the sacrifice of little accuracy degradation.
Experimental details and results are as follows.
Appendix A.1. Impacts of β on Adacomp As in the discussion about Equation (9), the range of β is (0.5, 5]. To present a comprehensive investigation, we selected β from [0.6, 1] with step size 0.1 and from [1, 5] with step size 0.5. Furthermore, learning rate was set as 0.01, 0.1, 1, and other settings were the same as in Figure 2. Detailed results are shown in Figure A1, from which two conclusions are obtained. For simplicity, lr is an abbreviation for learning rate and c in the parentheses denotes clipping gradient with bound 10. First, Adacomp is more sensitive to β when the learning rate is small. When lr = 1, Adacomp fails at two settings β = 4, 5, and the number of failing settings decreases with lr. When lr = 0.01, the effective β almost lies in the range [0.6, 1]. This shows that Adacomp is insensitive to learning rate when β ≤ 1. Second, clipping gradient can significantly improve the robustness of Adacomp. Based on the training process, we found that the divergence happened at the early few iterations where the gradient norm tended to infinity. To overcome this, we clipped gradients with bound 10 (larger or less has no impact). Results when lr = 0.01(c), 0.1(c), 1(c) show that Adacomp is robust to both β and learning rate when additionally clipping the gradient.

Appendix A.2. Comparison Using More Metrics
To present an overall validation, we adopted three additional metrics, precision, recall (also sensitivity), and F1-score (weighted average of precision and recall) in the section. The datasets used include MNIST, Fashion-MNIST, and CIFAR-10. The learning rate was set as 0.01, 0.1, 1. For MNIST and Fashion-MNIST, the network architecture and parameter settings were same as the corresponding experiments in Section 4. For CIFAR-10, the network architecture was LeNet. Detailed results are shown in Table A2, Figure A1. Impacts of β of Adacomp on prediction accuracy. The dataset used was MNIST and the network architecture was as the same as in Section 4.2. Here LR is an abbreviation for learning rate and c in the parentheses denotes gradients are clipped.
First, each algorithm has a similar precision, recall, and F1-score when accuracy is high. For example, when accuracy is greater than 90% on MNIST (or 80% on Fashion-MNIST, 60% on CIFAR-10), the performance under different metrics almost remains unchanged. This is mainly because these datasets have even class distribution. Second, although Adacomp does not always achieve the highest accuracy or F1-score, it is the most robust to learning rate. For a large learning rate (lr = 1), only Adadelta and Adacomp (ours) work well. Within the two algorithms, Adacomp is more robust than Adadelta. Note that on CIFAR-10, Adadelta achieves 0.45, 0.61, 0.65 F1-score when lr = 0.01, 0.1, 1, respectively. Meanwhile, Adacomp achieves 0.65, 0.67, 0.65 F1-score. This can be also observed on Fashion-MNIST.

Appendix A.3. Comparison with Evolutionary Algorithms
We compared Adacomp with two evolutionary algorithms, the genetic algorithm (GA) and particle swarm optimization algorithm (PSO), on MNIST, Fashion-MNIST, and CIFAR-10 datasets. The code for GA was modified based on https://github.com/jishnup11/ -Fast-CNN-Fast-Optimisation-of-CNN-Architecture-Using-Genetic-Algorithm, (accessed on 24 October 2021 ) and the code for PSA was based on https://github.com/vinthony/ pso-cnn, (accessed on 24 October 2021). Specific parameter settings were as follows and detailed results are shown in Table A1, where results for Adacomp are the highest values  corresponding to Table A2. • For the GA, we set 10 generations with 20 populations in each generation. Mutation, random selection, and retain probability were set as 0.2, 0.1, 0.4, respectively. • For PSO, swarm size and number of iterations were both set as 100, inertia weight and acceleration coefficients were both set as 0.5.
Two conclusions are obtained from Table A1. First, GA and PSO outperform Adacomp on MNIST and Fashion-MNIST. This is because GA and PSO explore too many parameter settings compared to Adacomp. However, this does not hold for CIFAR-10. Therefore, GA and PSO have a high probability of finding the better solutions by exploring a larger space.
Second, Adacomp is more efficient than GA and PSO. For example, Adacomp performed similarly to GA and PSO; however, the consumed time was only about 1/10 compared to them.
Based on Table A2, it may be possible to combine GA or PSO with Adacomp to simultaneously improve the efficiency and utility.