NadamClip: A Novel Optimization Algorithm for Improving Prediction Accuracy and Training Stability
Abstract
1. Introduction
2. Materials and Methods
2.1. Proposed Algorithm
2.1.1. Nadam Formula
Algorithm 1: Nesterov-accelerated Adaptive Moment Estimation (Nadam) |
Hyperparameters (first/second moment vectors) not converged do end while |
2.1.2. Gradient Clipping
2.1.3. Proposed New Algorithm
Algorithm 2: Nesterov-accelerated Adaptive Moment Estimation Clipping (NadamClip) |
Hyperparameters (clipping region) (first/second moment vectors) not converged do end while |
2.2. Experimental Setup
2.2.1. Data Collection and Preprocessing
- Data Correlation Analysis
- 2.
- Data Normalization
2.2.2. Model Settings
2.2.3. Gradient Clipping Threshold Range Design
2.2.4. Performance Metrics
3. Results
3.1. Model Training
3.1.1. Initial Training
3.1.2. Gradient Clipping Range Experiment
- The gradient threshold is set to 0.25 as a large clipping value, meaning that during training, the gradient is clipped only if its absolute value exceeds 0.25. The model’s performance evaluation yielded the following metrics: MAE of 0.27233783261115885, RMSE of 0.6707561284835654, R2 of 0.9734625679547213, and gradient norm max of 0.2625596225261688.
- To optimize model performance, the gradient threshold was set to 0.2 to systematically evaluate the effects of controlled gradient norms on model training. This adjustment from 0.25 to 0.2 imposes stricter constraints on gradient update magnitudes. After completing 100 training epochs, the model’s evaluation showed an MAE of 0.2631191206486008, an RMSE of 0.662192962731887, an R2 of 0.9741358197337582, and a gradient norm max of 0.26188525557518005.
- The gradient threshold was configured at 0.15 to stabilize model training dynamics. Following rigorous training over 100 epochs, quantitative evaluation results included an MAE of 0.26790952024984593, an RMSE of 0.6687321135613868, an R2 of 0.9736224803807937, and a gradient norm max of 0.31249192357063293.
- The gradient threshold is established at 0.21 to enhance the stability of model training. Following iterative training for 100 epochs, performance assessment recorded an MAE of 0.26790952024984593, an RMSE of 0.6687321135613868, an R2 of 0.9736224803807937, and a gradient norm max of 0.31249192357063293.
- Set the gradient threshold to 0.22. After training iterations with epoch = 100, model evaluation produced an MAE of 0.2675364506810387, an RMSE of 0.6716723562013663, an R2 of 0.973390020165747, and a gradient norm max of 0.25483882427215576.
- Set the gradient threshold to 0.23. After training iterations with epoch = 100, performance metrics showed an MAE of 0.2649279389603518, an RMSE of 0.6644888459309692, an R2 of 0.9739561618993233, and a gradient norm max of 0.25819098949432373.
- Set the gradient threshold to 0.24. After training iterations with epoch = 100, evaluation results indicated an MAE of 0.2643528489751698, an RMSE of 0.6594540048675472, an R2 of 0.9743493357271562, and a gradient norm max of 0.2894490659236908.
3.2. Comparison with Stochastic Gradient Descent Algorithm
3.2.1. SGD Algorithm
3.2.2. AdaGrad Algorithm
3.2.3. AdaDetla Algorithm
3.2.4. RMSProp Algorithm
3.2.5. Adam Algorithm
3.2.6. AdamW Algorithm
3.2.7. Nadam Algorithm
3.2.8. SGD with U-Clip Algorithm
3.2.9. DPSGD
3.2.10. Clip-RAdaGradD
3.2.11. CGAdam
3.2.12. AdamW_AGC
3.3. Comparison of Measurement Results
3.4. Comparison of Loss Results
3.4.1. Comparison of Training Loss
3.4.2. Verification Loss Comparison
3.5. Comparison of Computing Resource Consumption
3.6. Applications in Other Datasets
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Proof of Formula Correctness
Appendix A.1. Variational Interpretation of Gradient Clipping
- Non-expansiveness: .
- The sub-gradient relation: .
Appendix A.2. Lyapunov Analysis of Momentum System
Appendix A.3. Stochastic Stability of the Second Moment
Appendix A.4. Spectral Analysis of Bias Correction
Appendix A.5. Contractivity of Update Operator
Appendix A.6. Convergence Proof
Appendix A.7. Error Propagation vs. Original Nadam
Appendix B. Detailed Loss Curve Comparison
References
- Prakash, A.; Khanam, S. Nitrogen Pollution Threat to Mariculture and Other Aquatic Ecosystems: An Overview. J. Pharm. Pharmacol. 2021, 9, 428–433. [Google Scholar] [CrossRef]
- Kaur, G.; Basak, N.; Kumar, S. State-of-the-Art Techniques to Enhance Biomethane/Biogas Production in Thermophilic Anaerobic Digestion. Process Saf. Environ. Prot. 2024, 186, 104–117. [Google Scholar] [CrossRef]
- Wang, S.; Lin, Y.; Jia, Y.; Sun, J.; Yang, Z. Unveiling the Multi-Dimensional Spatio-Temporal Fusion Transformer (MDSTFT): A Revolutionary Deep Learning Framework for Enhanced Multi-Variate Time Series Forecasting. IEEE Access 2024, 12, 115895–115904. [Google Scholar] [CrossRef]
- He, Y.; Huang, P.; Hong, W.; Luo, Q.; Li, L.; Tsui, K.-L. In-Depth Insights into the Application of Recurrent Neural Networks (RNNs) in Traffic Prediction: A Comprehensive Review. Algorithms 2024, 17, 398. [Google Scholar] [CrossRef]
- Rosindell, J.; Wong, Y. Biodiversity, the Tree of Life, and Science Communication. In Phylogenetic Diversity: Applications and Challenges in Biodiversity Science; Springer: Cham, Switzerland, 2018; pp. 41–71. [Google Scholar] [CrossRef]
- Marshall, N.; Xiao, K.L.; Agarwala, A.; Paquette, E. To Clip or Not to Clip: The Dynamics of SGD with Gradient Clipping in High-Dimensions. arXiv 2024, arXiv:2406.11733. [Google Scholar] [CrossRef]
- Mai, V.V.; Johansson, M. Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and Smoothness. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 7325–7335. [Google Scholar]
- Zhang, J.; Karimireddy, S.P.; Veit, A.; Kim, S.; Reddi, S.; Kumar, S.; Sra, S. Why Are Adaptive Methods Good for Attention Models? Adv. Neural Inf. Process. Syst. 2020, 33, 15383–15393. [Google Scholar]
- Seetharaman, P.; Wichern, G.; Pardo, B.; Roux, J. Le Autoclip: Adaptive Gradient Clipping for Source Separation Networks. In Proceedings of the 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), Espoo, Finland, 21–24 September 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Liu, M.; Zhuang, Z.; Lei, Y.; Liao, C. A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks. Adv. Neural Inf. Process. Syst. 2022, 35, 26204–26217. [Google Scholar]
- Tang, X.; Panda, A.; Sehwag, V.; Mittal, P. Differentially Private Image Classification by Learning Priors from Random Processes. Adv. Neural Inf. Process. Syst. 2023, 36, 35855–35877. [Google Scholar] [CrossRef]
- Qian, J.; Wu, Y.; Zhuang, B.; Wang, S.; Xiao, J. Understanding Gradient Clipping In Incremental Gradient Methods. Proc. Mach. Learn. Res. 2021, 130, 1504–1512. [Google Scholar]
- Ramaswamy, A. Gradient Clipping in Deep Learning: A Dynamical Systems Perspective. Int. Conf. Pattern Recognit. Appl. Methods 2023, 1, 107–114. [Google Scholar] [CrossRef]
- Dozat, T. Incorporating Nesterov Momentum into Adam. ICLR Work. 2016, 2013–2016. Available online: https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ (accessed on 1 January 2025).
- Haji, S.H.; Abdulazeez, A.M. Comparison of Optimization Techniques Based on Gradient Descent Algorithm: A Review. PalArch’s J. Archaeol. Egypt/Egyptol. 2021, 18, 2715–2743. [Google Scholar]
- Praharsha, C.H.; Poulose, A.; Badgujar, C. Comprehensive Investigation of Machine Learning and Deep Learning Networks for Identifying Multispecies Tomato Insect Images. Sensors 2024, 24, 7858. [Google Scholar] [CrossRef] [PubMed]
- Kuppusamy, P.; Raga Siri, P.; Harshitha, P.; Dhanyasri, M.; Iwendi, C. Customized CNN with Adam and Nadam Optimizers for Emotion Recognition using Facial Expressions. In Proceedings of the 2023 International Conference on Wireless Communications Signal Processing and Networking (WiSPNET), Chennai, India, 29–31 March 2023. [Google Scholar]
- Pang, B.; Nijkamp, E.; Wu, Y.N. Deep Learning With TensorFlow: A Review. J. Educ. Behav. Stat. 2020, 45, 227–248. [Google Scholar] [CrossRef]
- Kanai, S.; Fujiwara, Y.; Iwamura, S. Preventing Gradient Explosions in Gated Recurrent Units. Adv. Neural Inf. Process. Syst. 2017, 30, 436–445. [Google Scholar]
- Botchkarev, A. Performance Metrics (Error Measures) in Machine Learning Regression, Forecasting and Prognostics: Properties and Typology. arXiv 2022, arXiv:1809.03006. [Google Scholar] [CrossRef]
- Jongjaraunsuk, R.; Taparhudee, W.; Suwannasing, P. Comparison of Water Quality Prediction for Red Tilapia Aquaculture in an Outdoor Recirculation System Using Deep Learning and a Hybrid Model. Water 2024, 16, 907. [Google Scholar] [CrossRef]
- Elesedy, B.; Hutter, M. U-Clip: On-Average Unbiased Stochastic Gradient Clipping. arXiv 2023, arXiv:2302.02971. [Google Scholar] [CrossRef]
- Hodson, T.O. Root-Mean-Square Error (RMSE) or Mean Absolute Error (MAE): When to Use Them or Not. Geosci. Model Dev. 2022, 15, 5481–5487. [Google Scholar] [CrossRef]
- Piepho, H.P. An Adjusted Coefficient of Determination (R2) for Generalized Linear Mixed Models in One Go. Biom. J. 2023, 65, 2200290. [Google Scholar] [CrossRef]
- Duchi, J.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
- Zeiler, M.D. ADADELTA: An Adaptive Learning Rate Method. arXiv 2012, arXiv:1212.5701. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J.L. Adam: A Method for Stochastic Optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Nguyen, T.N.; Nguyen, P.H.; Nguyen, L.M.; Van Dijk, M. Batch Clipping and Adaptive Layerwise Clipping for Differential Private Stochastic Gradient Descent. arXiv 2023, arXiv:2307.11939. [Google Scholar] [CrossRef]
- Chezhegov, S.; Klyukin, Y.; Semenov, A.; Beznosikov, A.; Gasnikov, A.; Horváth, S.; Takáč, M.; Gorbunov, E. Gradient Clipping Improves AdaGrad When the Noise Is Heavy-Tailed. arXiv 2024, arXiv:2406.04443. [Google Scholar] [CrossRef]
- Sun, H.; Cui, J.; Shao, Y.; Yang, J.; Xing, L.; Zhao, Q.; Zhang, L. A Gastrointestinal Image Classification Method Based on Improved Adam Algorithm. Mathematics 2024, 12, 2452. [Google Scholar] [CrossRef]
- Sun, H.; Yu, H.; Shao, Y.; Wang, J.; Xing, L.; Zhang, L.; Zhao, Q. An Improved Adam’s Algorithm for Stomach Image Classification. Algorithms 2024, 17, 272. [Google Scholar] [CrossRef]
Date | TP (°C) | DO (g/mL) | pH | Ammonia (g/mL) |
---|---|---|---|---|
2021-06-19 00:00:05 CET | 22.94504 | 4.219706 | 12.75389 | 0.45842 |
2021-06-19 00:01:02 CET | 22.99544 | 6.313918 | 12.79817 | 0.44741 |
2021-06-19 00:01:22 CET | 22.8894 | 15.50348 | 14.79148 | 0.40778 |
Created_at | TP (°C) | DO (g/mL) | pH | Ammonia (g/mL) |
---|---|---|---|---|
2021/6/19 0:00 | 0.152859 | 0.105493 | 0.203842 | 0.021545 |
2021/6/19 0:01 | 0.156401 | 0.157848 | 0.206831 | 0.02102 |
2021/6/19 0:01 | 0.148949 | 0.387587 | 0.341339 | 0.01913 |
2021/6/19 0:01 | 0.156271 | 0.118966 | 0.235081 | 0.020939 |
2021/6/19 0:02 | 0.157251 | 0.953043 | 0.641651 | 0.021545 |
Parameter | Value |
---|---|
Layer | 2 |
Unit1 | 144 |
Unit2 | 124 |
Learning rate | 0.001 |
Batch size | 32 |
Experiment No. | Epoch | Clipping Threshold | RMSE | MAE | R2 |
---|---|---|---|---|---|
1 | 100 | 0.1 | |||
2 | 100 | 0.2 | |||
3 | 100 | 0.3 | |||
4 | 100 | 0.4 |
Methodology | Value |
---|---|
MAE | 0.2771 |
RMSE | 0.6751 |
R2 | 0.9731 |
Gradient norm max | 0.2886 |
No. | Epoch | Threshold | RMSE | MAE | R2 |
---|---|---|---|---|---|
1 | 100 | 100 | 0.6752 | 0.2772 | 0.9731 |
2 | 100 | 0.25 | 0.6708 | 0.2723 | 0.9735 |
3 | 100 | 0.24 | 0.6595 | 0.2644 | 0.9743 |
4 | 100 | 0.23 | 0.6645 | 0.2649 | 0.9740 |
5 | 100 | 0.22 | 0.6717 | 0.2675 | 0.9734 |
6 | 100 | 0.21 | 0.6688 | 0.2711 | 0.9736 |
7 | 100 | 0.2 | 0.6622 | 0.2631 | 0.9741 |
8 | 100 | 0.15 | 0.6687 | 0.2679 | 0.9736 |
Algorithm | RMSE | MAE | R2 |
---|---|---|---|
NadamClip | 0.2644 | 0.6595 | 0.9743 |
SGD | 0.7506 | 1.1712 | 0.9191 |
AdaGrad | 1.2047 | 0.7821 | 0.9144 |
AdaDetla | 0.9211 | 1.4274 | 0.8798 |
RMSProp | 0.3221 | 0.6785 | 0.9729 |
Adam | 0.2652 | 0.6520 | 0.9749 |
Nadam | 0.2685 | 0.6679 | 0.9737 |
AdamW | 0.2858 | 0.6771 | 0.9730 |
SGD with U-Clip | 0.8108 | 1.2470 | 0.9083 |
DPSGD | 1.2594 | 0.8108 | 0.9064 |
Clip-RAdaGradD | 0.6615 | 1.0721 | 0.9322 |
CGAdam | 0.2619 | 0.6582 | 0.9745 |
AdamW_AGC | 0.2731 | 0.6829 | 0.9725 |
Algorithm | Time | Current Memory Usage | Peak Memory Usage |
---|---|---|---|
NadamClip | 12,114.70 s | 10.74 MB | 10.97 MB |
SGD | 1022.55 s | 13.84 MB | 14.24 MB |
AdaGrad | 1065.48 s | 16.98 MB | 17.27 MB |
AdaDetla | 1183.29 s | 12.39 MB | 12.80 MB |
RMSProp | 1079.64 s | 12.60 MB | 12.92 MB |
Adam | 1069.97 s | 12.00 MB | 12.40 MB |
Nadam | 1123.18 s | 13.90 MB | 14.30 MB |
AdamW | 642.65 s | 11.79 MB | 12.19 MB |
SGD with U-Clip | 10,245.24 s | 4.75 MB | 4.96 MB |
DPSGD | 15,721.67 s | 13.61 MB | 42.53 MB |
Clip-RAdaGradD | 488.26 s | 14.53 MB | 16.55 MB |
CGAdam | 12,567.70 s | 7.14 MB | 7.55 MB |
AdamW_AGC | 7856.51 s | 28.81 MB | 32.00 MB |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tu, J.; Yasin, A.; Mansor, N.S. NadamClip: A Novel Optimization Algorithm for Improving Prediction Accuracy and Training Stability. Processes 2025, 13, 2145. https://doi.org/10.3390/pr13072145
Tu J, Yasin A, Mansor NS. NadamClip: A Novel Optimization Algorithm for Improving Prediction Accuracy and Training Stability. Processes. 2025; 13(7):2145. https://doi.org/10.3390/pr13072145
Chicago/Turabian StyleTu, Jun, Azman Yasin, and Nur Suhaili Mansor. 2025. "NadamClip: A Novel Optimization Algorithm for Improving Prediction Accuracy and Training Stability" Processes 13, no. 7: 2145. https://doi.org/10.3390/pr13072145
APA StyleTu, J., Yasin, A., & Mansor, N. S. (2025). NadamClip: A Novel Optimization Algorithm for Improving Prediction Accuracy and Training Stability. Processes, 13(7), 2145. https://doi.org/10.3390/pr13072145