An Improvement of Adam Based on a Cyclic Exponential Decay Learning Rate and Gradient Norm Constraints

: Aiming at a series of limitations of the Adam algorithm, such as hyperparameter sensitivity and unstable convergence, in this paper, an improved optimization algorithm, the Cycle-Norm-Adam (CN-Adam) algorithm, is proposed. The algorithm integrates the ideas of a cyclic exponential decay learning rate (CEDLR) and gradient paradigm constraintsand accelerates the convergence speed of the Adam model and improves its generalization performance by dynamically adjusting the learning rate. In order to verify the effectiveness of the CN-Adam algorithm, we conducted extensive experimental studies. The CN-Adam algorithm achieved significant performance improvementsin both standard datasets. The experimental results show that the CN-Adam algorithm achieved 98.54% accuracy in the MNIST dataset and 72.10% in the CIFAR10 dataset. Due to the complexity and specificity of medical images, the algorithm was tested in a medical dataset and achieved an accuracy of 78.80%, which was better than the other algorithms. The experimental results show that the CN-Adam optimization algorithm provides an effective optimization strategy for improving model performance and promoting medical research.


Introduction
With the rapid development of deep learning, the importance of optimization algorithms in training neural networks is increasing.Choosing an appropriate optimization algorithm is crucial for the training speed and performance of a model.In current deep learning practice, the Adam optimization algorithm is widely used, which has the advantages of fast convergence and good generalization performance.However, optimization algorithms like the Adam algorithm have some drawbacks, such as sensitivity to the learning rate and selection of hyperparameters.In order to solve this problem, researchers have started to improve the Adam algorithm regarding different aspects.To optimize the learning rate, Yiming Jiang et al. [1] proposed the UAdam algorithm, which introduces a generalized second-order momentum form.An increase in the parameter β leads to a decrease in the convergence neighborhood, and by adjusting β, the convergence performance of the algorithm can be controlled.Liu et al. [2] proposed the RAdam algorithm to enhance the optimization algorithm's performance by analyzing the impacts of changes and momentum during training.Wei Yuan [3] later proposed the EAdam algorithm, which outperforms the Adam algorithm by adjusting the position of the constant ε.Additionally, Mingrui Liu et al. [4] proposed the Adam-plus algorithm, which utilizes an adaptive step size based on the first-order momentum estimation paradigm, resulting in reduced that may be caused by a gradient that is too large or too small, such as unstable training or failure to converge.Combining both techniques, the CN-Adam algorithm performs well in deep learning and provides a feasible optimization strategy for solving practical problems.

Adam Optimization Algorithm
The Adam optimizer is an optimization algorithm that combines the momentum method and the adaptive learning rate property [23] to train neural networks and optimize objective functions.The core idea is to dynamically adjust the learning rate based on the gradient profile of each parameter, as well as to accelerate convergence using the momentum term.The Adam algorithm optimizes the parameters by maintaining first-order and second-order moment estimates for each parameter.When updating the parameters, Adam combines the two estimates and performs bias corrections to ensure that the initial stage estimates are unbiased.The Adam optimizer also employs a technique known as exponential moving average to compute first-order moment estimates and second-order moment estimates of the gradient.By using exponential moving average, Adam is able to efficiently aggregate and update historical gradient information to better reflect the optimization direction of the parameters.In addition, the Adam algorithm is adaptive in that it can automatically adjust the size of the learning rate to the specifics of each parameter, which is particularly useful when dealing with parameters with different gradient distributions.
In each iteration, the gradient of the loss function with respect to the parameters is calculated using Equation (1).
The parameter vector θ t is updated each time.g t represents the loss function of the neural network in the t-th iteration, while g t = ∇ θ f t (θ t ) represents the gradient of the loss function of the neural network with respect to the parameters in the t-th iteration.The first-order and second-order moment estimates in the Adam algorithm represent the moving average of the first-order and second-order of the gradient.These estimates are calculated as shown in Equations ( 2) and (3).
The variable β 1 is used to control the contribution of past gradients to the current estimate.Specifically, β 2 is used to control the contribution of squared past gradients to the current estimate.
The Adam optimization algorithm introduces a bias correction mechanism to address potential bias in the first-order moment estimates and second-order moment estimations when the decay rate is very small in the initial stage.The computation of bias correction is shown in Equations ( 4) and (5).
The update of parameter θ after each iteration is computed according to Equation (6).
where lr is the learning rate, and ε is a small positive number to avoid a zero denominator.One of the drawbacks of the Adam optimizer is its sensitivity to the learning rate.Although it is able to adaptively adjust the learning rate for each parameter, sometimes this adaptivity may cause the learning rate to decay too quickly during training, thus affecting the convergence performance.In addition, since the Adam algorithm performs second-order moment estimations of the gradient, this makes it potentially underperform when dealing with non-smooth objective functions, especially when the gradient varies over a wide range or the objective function is highly non-convex.Therefore, when applying the Adam optimizer, attention needs to be paid to its sensitivity to the learning rate and gradient, as well as the instability and degradation of the convergence performance that may occur in specific scenarios.

Cyclic Exponential Decay Learning Rate
Cyclical learning rate (CLR) is a method for optimizing the training process of neural networks, originally proposed by Leslie N. Smith [24].The algorithm introduces the concept of periodically adjusting the learning rate.Unlike traditional learning rate scheduling methods, cyclical learning rates allow the learning rate to fluctuate periodically during the training process, rather than remaining fixed or decaying linearly.This periodically adjusted learning rate strategy adds greater flexibility and dynamics to the model training.
During training, the cyclical learning rate cause the learning rate to periodically fluctuate within a predefined range, thus allowing the model to use different learning rates at different training stages.This flexibility allows the model to better adapt to different data distributions and complexities, thus improving the generalization ability of the model.In addition, cyclical learning rate can help the model jump out of the local optimal point and find the global optimal solution faster.Although cyclical learning rate needs to adjust some hyperparameters to define the period and range of the learning rate, they have proventheir effectiveness and superiority in many practical applications.
Triangular mode is the most commonly used mode in cyclical learning rate algorithms, where the learning rate periodically fluctuates over a fixed range, forming a triangular waveform.In this mode, the learning rate is first increased to a maximum value and then gradually decreased to a minimum value before repeating the process.The calculation of the triangular pattern with respect to the learning rate is shown in Equation (7).
where lr_delta is the current learning rate.lr_base is the minimum value of the learning rate during the change process.lr_max is the maximum value of the learning rate during the change process.step_size is half the cycle length.iteration is the number of parameter updates.
The cyclic exponential decay learning rate (CEDLR) method used in this paper was different from the traditional cyclical learning rate calculation method to calculate the learning rate.In this study, the calculation of the algorithm was modified, and the formulas are shown in Equations ( 8)- (11).
where lr_delta is the current learning rate, lr_base is the minimum value of the learning rate during the change process, and gamma is the decay coefficient.lr_max is the maximum value of the learning rate during the change process, cycle indicates the current cycle of the loop, state['step ′ ] is the total number of parameter updates throughout the training process of the model, and step_size is half the cycle length.By comparing Formulas ( 8)- (11), it is evident that the calculation methods for iterations and state ['step'] are completely different.The calculation method in the original paper updates parameters in each epoch by dividing the total number of samples by the batch size and rounding up.This means that multiple parameter update operations will be performed in each epoch, and each update is considered a new iteration.In this paper, the state ['step'] can determine the current training process based on its value and then adjust the learning rate as needed.The important advantage of using state ['step'] as the global step count is that it is an internal property of the optimizer, directly associated with the step count of parameter updates.This makes it more suitable as a benchmark for learning rate schedulers, as it ensures that the adjustment of learning rates is consistent with parameter updates.
By periodically adjusting the learning rate, the model can be explored and exploited at different learning rate levels during the training process, which helps to avoid falling into local optimal solutions.The exponential decay allows the learning rate to gradually decrease in the later stages of training, which helps to improve the convergence and generalization ability of the training.Taken together, this learning rate calculation method is able to provide appropriate learning rates for the model at different stages, which speeds up the training process and improves the model's performance.

Gradient Norm Constraint Strategy
The gradient norm constraint strategy limits the size of the overall gradient vector by setting a threshold for the maximum number of gradient norms, thereby controlling the magnitude of parameter changes during each update.This constraint helps to prevent the excessive adjustment of model parameters and improve the stability and generalization ability of the model during training.It plays a significant role in the entire algorithm.
One of the advantages of a gradient paradigm constraint is that it prevents gradient explosion and gradient vanishing.In deep neural networks, due to the large number of network layers and the choice of activation function, the gradient may grow or decay exponentially in the back-propagation process, which will affect the stability and convergence of the model.The size of the gradient can be effectively controlled by the gradient paradigm constraint, thus stabilizing the training process of the model.The gradient paradigm constraint calculation formula is defined as shown in Equation (12).
where grad denotes the original gradient vector, ||grad|| denotes the gradient vector's paradigm, and grad_norm_constraint denotes the threshold of the gradient paradigm.Equation (12) states that before performing the gradient update, the magnitude of the gradient vector is calculated.If the magnitude exceeds a set threshold, the vector is scaled to match the threshold.Otherwise, the original gradient vector remains unchanged.
Gradient norm constraints also help improve a model's ability to generalize.Excessive gradients can lead to the over-fitting of the model to the training data, and the gradient norm constraint defined in this article limits the size of model parameter updates.When combined with the learning rate modified by state ['step'], the experimental results demonstrate an accuracy that is 3.61% and 6.4% higher than that of the Adam algorithm, particularly in the CIFAR10 and Medical datasets.

CN-Adam Algorithm
The Adam optimizer may lead to unstable training or slower convergence at higher learning rates.This is mainly because the learning rate adjustment mechanism of the Adam optimizer cannot adapt well to the needs of different learning rate parameters.Its learning rate adjustment mechanism has limited flexibility.To solve this problem, an improved CN-Adam algorithm is proposed with the following procedure.
Initialization: The initial learning rate lr, the minimum learning rate lr_base, and the maximum learning rate lr_max are set;the first-order momentum and the second-order momentum are initialized as zero vectors; and the number of steps per complete cycle step_size is set to 1400.
Cyclic exponential decay learning rate phase: The learning rate gradually increases to a maximum value for a specified number of steps and then gradually decreases.The current number of training steps and the cycle size determine the current cycle position, which is adjusted according to the position and the pre-set minimum and maximum learning rates according to Equations ( 8)- (11).
The gradient paradigm constraint phase: The first step is to check whether the gradient exceeds the set threshold.If it does, it may lead to unstable training.To stabilize the training, the gradient is constrained, usually by a scaling operation to ensure that it does not exceed the threshold.The scaling operation is performed by multiplying a scaling factor that controls the gradient's size, improving the training stability and convergence speed.
The first-order momentum and second-order momentum of the bias is computed for parameter updates.The Adam optimization algorithm's update rules are used, along with bias correction terms.To prevent gradient explosion, gradient paradigm constraints are applied.The parameter values are updated using the updated gradient and bias correction term.Once the parameters are updated, the current gradient is saved as the last gradient to be used in the next iteration.
The cyclic exponential decay learning rate algorithm can automatically adjust the learning rate to promote faster convergence and better generalization of the model.The cyclic exponential decay learning rate algorithm has high adaptability and dynamically adjusts the learning rate based on the performance of the model during training, which helps to avoid gradient explosion or vanishing problems.Combined with the proposed gradient paradigm constraints, it can help control the size of gradients and prevent the occurrence of gradient explosion problems.By combining the two, the optimizer can be optimized to appear more stable, as shown in Algorithm 1.

Configuration of the Experimental Environment
The main software versions applied in the experiment are shown in Table 1.This experiment applied the deep learning framework Pytorch Lightning, and the algorithm used was the improved CN-Adam algorithm based on the Adam algorithm, which was jointly improved by the cyclic exponential decay learning rate algorithm and the customized gradient paradigm constraint method.The entire experimental code was written using the PyTorch lightning framework.The programming language was Python, Python language version is 3.10, the torch version was 2.0.1, the torchvision version was 0.15.0, the CUDA version was cu118, the Lightning version was 2.1.2,and the wandb version was 0.16.0.
The CN-Adam algorithm's performance was tested using two commonly used datasets, MNIST and CIFAR10, as well as medical domain image classification experiments.The experiment included three datasets: MNIST, CIFAR10, and a medical dataset.MNIST is a grayscale image dataset of handwritten numbers with an image size of 28*28.CIFAR10 is a color image dataset containing different types of items with an image size of 32*32.The medical dataset is divided into two parts: the upper and lower gastrointestinal tracts.For this experiment, we used the dataset for the upper gastrointestinal system.The experiment involved eight classifications, primarily consisting of various grades of hemorrhoids, polyps, and ulcerative colitis.The image size was processed as 224*224.The experimental dataset is presented in Table 2.

Experimental Results and Analysis
The CN-Adam algorithm was extended and improved based on the Adam algorithm.It now includes the cyclic exponential decay learning rate algorithm and a gradient paradigm constraint strategy, making it more comprehensive and flexible.These improvements enhance the algorithm's generalization ability and convergence speed for the model.To evaluate the advantages of the CN-Adam algorithm over other optimization algorithms, we conducted experimental comparisons using various optimization algorithms, including SGD, AdaGrad, Adadelta, Adam, and two variants of Adam:the NAdam and StochGradAdam algorithms.Several experiments were conducted to compare the accuracy and loss values in the test set.The results proving the best performance were selected and bolded as shown in   The CN-Adam algorithm aims to quickly discover locally optimal solutions to enhance the performance of the optimization algorithm by achieving optimal levels of test accuracy and loss values.To evaluate the performance of the CN-Adam algorithm in various neural networks and compare the results of the experiments, we extensively tested it on multiple datasets and neural network models.For the MNIST dataset, we chose a simple fully connected neural network, and for the CIFAR10 and medical datasets, we chose a lightweight neural network, MobileNetV2.
The experiment involved comparing the performance of the CN-Adam algorithm in the MNIST dataset using three different learning rate size ranges and three different sets of hyperparameters.The results of this comparison are presented, focusing on the different combinations of learning rates.The first step of the experiment was to determine the range interval of the learning rate.This allowed for a better determination of the values of the other hyperparameters.Figure 1 shows the performance comparison for the MNIST dataset at different learning rate ranges.After analyzing Figure 1, it was determined that the combination of a minimum learning rate of 1 × 10 -4 and a maximum learning rate of 1 × 10 -2 was optimal.This combination converged quickly and maintained a stable performance.The early performance was poorer when the minimum learning rate was 1 × 10 -5 , which may have been due to the learning rate being too small to adequately perform parameter updates.For the combination of a minimum learning rate of1 × 10 -1 , fast convergence could not be achieved early on.This may have been due to the learning rate being too large, resulting in unstable fluctuations in the model during training.Therefore, the optimal learning rate combination was between 1 × 10 -4 and 1 × 10 -2 .
After determining the optimal learning rate, appropriate hyperparameter settings can improve the model stability and highlight the CN-Adam algorithm's effect.For example, the effects of the gradient paradigm constraints of different hyperparameters on the performance in CIFAR10 are shown in Figure 2.These results verify the effectiveness of the CN-Adam algorithm and provide guidance for its practical application.After analyzing Figure 1, it was determined that the combination of a minimum learning rate of 1 × 10 −4 and a maximum learning rate of 1 × 10 −2 was optimal.This combination converged quickly and maintained a stable performance.The early performance was poorer when the minimum learning rate was 1 × 10 −5 , which may have been due to the learning rate being too small to adequately perform parameter updates.For the combination of a minimum learning rate of 1 × 10 −1 , fast convergence could not be achieved early on.This may have been due to the learning rate being too large, resulting in unstable fluctuations in the model during training.Therefore, the optimal learning rate combination was between 1 × 10 −4 and 1 × 10 −2 .
After determining the optimal learning rate, appropriate hyperparameter settings can improve the model stability and highlight the CN-Adam algorithm's effect.For example, the effects of the gradient paradigm constraints of different hyperparameters on the performance in CIFAR10 are shown in Figure 2.These results verify the effectiveness of the CN-Adam algorithm and provide guidance for its practical application.
The Figure 2 above shows that adjusting the gradient paradigm constraint value led to significant performance differences, even with the same number of steps in the loop learning rate.The best accuracy and lowest loss values were achieved when the gradient paradigm constraint fetch value was set to 0.9.This setting resulted in an average accuracy improvement of approximately 1% compared with the other values tested.The significance of selecting an appropriate value for the gradient paradigm constraint to optimize the algorithm's convergence speed is emphasized.Additionally, the impact on the model performance is highlighted.
After determining the optimal learning rate, appropriate hyperparameter settings can improve the model stability and highlight the CN-Adam algorithm's effect.For example, the effects of the gradient paradigm constraints of different hyperparameters on the performance in CIFAR10 are shown in Figure 2.These results verify the effectiveness of the CN-Adam algorithm and provide guidance for its practical application.Figure 3 shows a performance comparison of the seven optimization algorithms in the MNIST dataset.The figure2 above shows that adjusting the gradient paradigm constraint value led to significant performance differences, even with the same number of steps in the loop learning rate.The best accuracy and lowest loss values were achieved when the gradient paradigm constraint fetch value was set to 0.9.This setting resulted in an average accuracy improvement of approximately 1% compared with the other values tested.The significance of selecting an appropriate value for the gradient paradigm constraint to optimize the algorithm's convergence speed is emphasized.Additionally, the impact on the model performance is highlighted.
Figure 3 shows a performance comparison of the seven optimization algorithms in the MNIST dataset.The performance comparison of the seven optimization algorithms for the CIFAR10 dataset is shown in Figure 4.The performance comparison of the seven optimization algorithms for the CIFAR10 dataset is shown in Figure 4.
As shown in Figure 4, the CN-Adam algorithm could quickly reach the optimal solution in this dataset, and the accuracy was improved by 3.61%, 3.43%, and 4.03%, and the loss value was reduced by 0.308, 0.736, and 0.138 compared with the accuracy of the Adam algorithm, the NAdam algorithm, and the StochGradAdam algorithm, which are based on the Adam algorithm and improved by the Adam algorithm.The fast convergence of the CN-Adam algorithm not only improves the training efficiency of the model but also enables the model to reach the desired performance level faster.Among them, the periodic learning rate adjustment strategy of the cyclic exponential decay learning rate algorithm brings a broader search capability to the optimization algorithm, which helps it to jump out of the local optimal solution and accelerate the convergence, thus further improving the model performance.Meanwhile, the introduction of the gradient paradigm constraint technique effectively controls the size of the gradient, avoids the exploding or vanishing gradient problem, and enhances the stability of the optimization algorithm.The performance comparison of the seven optimization algorithms for the CIFAR10 dataset is shown in Figure 4.    Due to the specificity of the medical dataset, a cross-validation strategy was used in this dataset to demonstrate the experimental results, and the validation and test sets were used to compare the effectiveness of the optimization algorithms in terms of accuracy and loss values.Figure 5 shows a comparison of the accuracy of the seven optimization algorithms in the validation set, and the accuracies of the SGD and Adadelta algorithms are significantly lower than those of the other optimization algorithms.Although the accuracy of the CN-Adam algorithm was slightly lower than that of the Adam algorithm and NAdam algorithm in the early stage of training, the accuracy of the CN-Adam algorithm significantly increased in the late stage of training as the number of training rounds increased, and it achieved good results in the test set.The accuracy was higher than that of the Adam algorithm by 6.4%, the NAdam algorithm by 6%, and StochGradAdam algorithm by 8.6%.This observation reveals the superior performance of the CN-Adam algorithm in dealing with complex data such as medical datasets.The problems with the Adam optimization algorithm have been addressed to some extent.Although there may be performance degradation in the initial phase, the performance gradually improves as the algorithm understands the data more deeply and learns the process.This dynamic learning capability allows the CN-Adam algorithm to continuously adjust its strategy and gradually optimize the model parameters, ultimately achieving results that outperform other optimization algorithms in the later training stages.Its robustness and adaptability enable it to better cope with the challenges in the data domain, providing strong support for the modeling and analysis of complex data.
The loss values of the seven optimization algorithms in the test set are compared in Figure 6.The results show that the loss value of the Adadelta algorithm was significantly higher than that of the other six optimization algorithms.The loss value of the CN-Adam algorithm was significantly lower than those of the other six algorithms, which were reduced by 0.1475, 0.1325, and 0.3305 compared with the Adam algorithm, the NAdam algorithm, which is based on the Adam algorithm, the improved NAdam algorithm, and the StochGradAdam algorithm.Combined with the validation set and test set results, these results show that the optimization algorithm demonstrates a superior performance in the medical domain.This highlights the importance of optimization algorithms when applied in specific domains and emphasizes the need to select an appropriate optimization algorithm for optimal performance in the medical domain.
rithm in dealing with complex data such as medical datasets.The problems with the Adam optimization algorithm have been addressed to some extent.Although there may be performance degradation in the initial phase, the performance gradually improves as the algorithm understands the data more deeply and learns the process.This dynamic learning capability allows the CN-Adam algorithm to continuously adjust its strategy and gradually optimize the model parameters, ultimately achieving results that outperform other optimization algorithms in the later training stages.Its robustness and adaptability enable it to better cope with the challenges in the data domain, providing strong support for the modeling and analysis of complex data.The loss values of the seven optimization algorithms in the test set are compared in Figure 6.The results show that the loss value of the Adadelta algorithm was significantly higher than that of the other six optimization algorithms.The loss value of the CN-Adam   Figure 7 compares the GPU power wattage for the CIFAR10 dataset, showing that the CN-Adam algorithm consumed less power than the other algorithms.However, its training time was slightly longer due to the dynamic adjustment of the learning rate.Despite this, the CN-Adam algorithm demonstrated its strength in resource utilization by maintaining the lowest level of power consumption.This emphasizes the significance of taking into account both the power consumption and training time when choosing an optimization algorithm to attain optimal performance and efficiency in real-world applications.algorithm was significantly lower than those of the other six algorithms, which were reduced by 0.1475, 0.1325, and 0.3305 compared with the Adam algorithm, the NAdam algorithm, which is based on the Adam algorithm, the improved NAdam algorithm, and the StochGradAdam algorithm.Combined with the validation set and test set results, these results show that the optimization algorithm demonstrates a superior performance in the medical domain.This highlights the importance of optimization algorithms when applied in specific domains and emphasizes the need to select an appropriate optimization algorithm for optimal performance in the medical domain.Figure 7 compares the GPU power wattage for the CIFAR10 dataset, showing that the CN-Adam algorithm consumed less power than the other algorithms.However, its training time was slightly longer due to the dynamic adjustment of the learning rate.Despite this, the CN-Adam algorithm demonstrated its strength in resource utilization by maintaining the lowest level of power consumption.This emphasizes the significance of taking into account both the power consumption and training time when choosing an optimization algorithm to attain optimal performance and efficiency in real-world applications.

Conclusions
This study proposed a new optimization algorithm called CN-Adam, which a address the shortcomings of the Adam optimization algorithm.The experimental demonstrate that the CN-Adam algorithm outperformed other methods, includi StochGradAdam algorithm, in multiple datasets.The CN-Adam algorithm enhan stability and convergence speed of the Adam model while achieving more precise by combining the cyclic exponential decay learning rate algorithm with gradientpar constraints.Although it may require more computational resources than other opt tion algorithms, the CN-Adam algorithm improves the model performance and is cable to areas such as medical image processing.This approach improves the mod formance and generalization ability, providing new momentum to the developm deep learning.

( 1 )
Application domain: The algorithm used three datasets in this study, namely, the MNIST dataset on handwritten digit recognition, the CIFAR10 dataset of color images with 10 classifications, and the medical dataset in healthcare.The download paths of the corresponding datasets can be found in the author's GitHub code and the data availability statement of this article.(2)Optimization algorithms: The experiments covered seven optimization algorithms, namely, SGD, AdaGrad, Adadelta, Adam, NAdam, StochGradAdam, and CN-Adam, aiming at comparing the performance differences between them.(3) Batch size: The batch size used in each experiment was 128 to ensure the consistency and fairness of the experiment.(4) Learning rate setting: The initial learning rate for all three datasets was 0.001.For the CN-Adam algorithm, the maximum learning rate was 0.01, and the minimum learning rate was 0.0001.(5) Epoch size: To accurately assess the performance of each optimizer, all experiments were conducted with 100 epochs to ensure adequate and accurate model training.(6) Adjustment of key parameters: Key parameters in the algorithm were fine-tuned based on different datasets to ensure the comparability and accuracy of the experimental results.(7) Data preprocessing: Before conducting the experiments, necessary data processing operations, such as normalization, standardization, and data augmentation, were performed to ensure the quality and consistency of the input data.(8) Experimental results: Several comparison experiments were conducted on the MNIST, CIFAR10, and medical datasets, taking into account factors such as Acc, loss, and GPU power consumption to fully demonstrate the advantages of the CN-Adam algorithm.

Figure 1 .
Figure 1.Comparison of learning rates in the MNIST dataset: (a) comparison of accuracy; (b) comparison of loss values.

9 Figure 1 .
Figure 1.Comparison of learning rates in the MNIST dataset: (a) comparison of accuracy; (b) comparison of loss values.

Figure 2 .
Figure 2. Comparison of different key parameter values in the CIFAR10 dataset: (a) comparison of accuracy; (b) comparison of loss values.Figure 2. Comparison of different key parameter values in the CIFAR10 dataset: (a) comparison of accuracy; (b) comparison of loss values.

Figure 2 .
Figure 2. Comparison of different key parameter values in the CIFAR10 dataset: (a) comparison of accuracy; (b) comparison of loss values.Figure 2. Comparison of different key parameter values in the CIFAR10 dataset: (a) comparison of accuracy; (b) comparison of loss values.

Figure 3 .
Figure 3.Comparison of performance of seven algorithms in the MNIST dataset: (a) comparison of accuracy; (b) comparison of loss values.

Figure 3 .
Figure 3.Comparison of performance of seven algorithms in the MNIST dataset: (a) comparison of accuracy; (b) comparison of loss values.

Figure 3 .
Figure 3.Comparison of performance of seven algorithms in the MNIST dataset: (a) comparison of accuracy; (b) comparison of loss values.

Figure 4 .
Figure 4. Comparison of performance of seven algorithms in the CIFAR10 dataset: (a) comparison of accuracy; (b) comparison of loss values.Figure 4. Comparison of performance of seven algorithms in the CIFAR10 dataset: (a) comparison of accuracy; (b) comparison of loss values.

Figure 4 .
Figure 4. Comparison of performance of seven algorithms in the CIFAR10 dataset: (a) comparison of accuracy; (b) comparison of loss values.Figure 4. Comparison of performance of seven algorithms in the CIFAR10 dataset: (a) comparison of accuracy; (b) comparison of loss values.

Figure5.
Figure5.Comparison of validation accuracies of seven optimization algorithms in medical dataset.

Figure 5 .
Figure 5.Comparison of validation accuracies of seven optimization algorithms in medical dataset.

Figure 6 .
Figure 6.Comparison of test loss values of seven optimization algorithms in medical dataset.

Figure 7 .
Figure 7.Comparison of GPU power wattage for the CIFAR10 dataset.

Figure 8
Figure 8 displays the GPU power consumption percentages for the medical dataset.It is evident that the CN-Adam algorithm quickly recognized the medical dataset and had a shorter training time than all the other algorithms.This indicates its strong performance

Figure 6 .
Figure 6.Comparison of test loss values of seven optimization algorithms in medical dataset.

Figure 7
Figure7compares the GPU power wattage for the CIFAR10 dataset, showing that the CN-Adam algorithm consumed less power than the other algorithms.However, its training time was slightly longer due to the dynamic adjustment of the learning rate.Despite this, the CN-Adam algorithm demonstrated its strength in resource utilization by maintaining the lowest level of power consumption.This emphasizes the significance of taking into account both the power consumption and training time when choosing an optimization algorithm to attain optimal performance and efficiency in real-world applications.

Figure 6 .
Figure 6.Comparison of test loss values of seven optimization algorithms in medical dataset.

Figure 7 .
Figure 7.Comparison of GPU power wattage for the CIFAR10 dataset.

Figure 8
Figure8displays the GPU power consumption percentages for the medical dataset.It is evident that the CN-Adam algorithm quickly recognized the medical dataset and had a shorter training time than all the other algorithms.This indicates its strong performance in the medical domain, emphasizing its efficiency in processing medical data.Selecting the appropriate optimization algorithm in the medical field is essential for achieving the

Figure 7 .
Figure 7.Comparison of GPU power wattage for the CIFAR10 dataset.

Figure 8 Figure 8 .
Figure8displays the GPU power consumption percentages for the medical dataset.It is evident that the CN-Adam algorithm quickly recognized the medical dataset and had a shorter training time than all the other algorithms.This indicates its strong performance in the medical domain, emphasizing its efficiency in processing medical data.Selecting the appropriate optimization algorithm in the medical field is essential for achieving the prompt recognition and efficient processing of medical data.Electronics 2024, 13, x FOR PEER REVIEW 1 do 14: If lr > lr_max or lr < lr_base 15: end while 16: If ||grad|| > grad_norm_constraint 17: grad ← grad•

Table 1 .
Main software versions.
Table 3 to demonstrate the superiority of the CN-Adam algorithm.The table compares the experimental results of the different optimization algorithms.

Table 3 .
Comparison of experimental results of different optimization algorithms.
Electronics 2024, 13, x FOR PEER REVIEW 13 of 15 algorithm was significantly lower than those of the other six algorithms, which were reduced by 0.1475, 0.1325, and 0.3305 compared with the Adam algorithm, the NAdam algorithm, which is based on the Adam algorithm, the improved NAdam algorithm, and the StochGradAdam algorithm.Combined with the validation set and test set results, these results show that the optimization algorithm demonstrates a superior performance in the medical domain.This highlights the importance of optimization algorithms when applied in specific domains and emphasizes the need to select an appropriate optimization algorithm for optimal performance in the medical domain.