The Module Gradient Descent Algorithm via L2 Regularization for Wavelet Neural Networks
Abstract
1. Introduction
- (a)
- A linear-head regime over a fixed wavelet dictionary. With explicit rates controlled by the eigenvalues of , we demonstrate global linear convergence to the unique ridge minimizer. This results in useful guidelines for considering () as a conditioning lever instead of just an anti-overfitting knob.
- (b)
- WNNs that are fully trainable (nonconvex). GD converges to stationary points and enjoys linear rates within regions meeting PL under the conditions of natural smoothness and boundedness for wavelets (within a restricted dilation/shift domain). We give implementable step-size limitations and demonstrate how L2 dampens flat directions to widen PL basins.
- (c)
- A regime that is over-parameterized (NTK). We extract rate constants related to the kernel spectrum induced by the wavelet dictionary and demonstrate that L2 directs GD toward the minimum-RKHS-norm interpolant associated with the WNN-specific NTK.
2. Related Work
3. Preliminaries and Problem Setup
3.1. Wavelet Neural Network (WNN) Model
3.2. Assumptions (Wavelets, Data, and Loss)
- (A1) is twice continuously differentiable with bounded value, gradient, and Hessian;
- (A2) dilations/translations are bounded ();
- (A3) inputs lie in a compact set;
- (A4) the loss is smooth (and strongly convex in its first argument for squared loss);
- (A5) feature Jacobians w.r.t. are uniformly bounded. Under (A1–A5), is Lipschitz on the feasible set; in particular, the head-only objective in is strongly convex due to .
3.3. Training Algorithms (GD/SGD with Weight Decay)
3.4. Problem Decompositions: Three Regimes
- (R1) Fixed-feature (linear head/ridge). Freezing reduces the problem to ridge regression, with Hessian , and GD enjoys global linear convergence for .
- (R2) Fully trainable WNN (nonconvex). Both and Θ are updated; we obtain convergence to stationary points and linear phases under a Polyak–Łojasiewicz (PL) inequality on .
- (R3) Over-parameterized (NTK/linearization). For large m and small , dynamics linearize around initialization; function updates follow kernel GD with a WNN-specific NTK K:
3.5. PL Inequality and Its Role
3.6. NTK for WNNs (Overview)
3.7. Step-Size and Regularization Prescriptions (Preview)
- Head-only (R1): choose with ; increasing λ raises and improves the condition number .
- Full WNN (R2): use conservative (empirical Lipschitz proxy); increase if gradient-norm contraction stalls.
- NTK (R3): stable ; L2 controls norm growth and selects the minimum-RKHS-norm interpolant.
4. Mmethodology
4.1. Objective and Gradient Updates
4.2. Fixed-Feature (Ridge) Training of the Linear Head
4.3. Fully Trainable WNN: Block GD, Schedules, and Stability
4.4. Choosing η and λ: Prescriptions and Diagnostics
- R1: choose and sweep logarithmically.
- R2: use and increase λ if gradient-norm contraction stalls.
- R3: ensure ; L2 controls norm growth and selects the minimum-RKHS-norm interpolant.
5. Theoretical Results
5.1. Linear Convergence for the Fixed-Feature (Ridge) Regime
5.2. Fully Trainable WNN Under a PL Inequality
6. Experiments and Evaluation
6.1. Datasets and Tasks
- (1)
- Smooth Low-Frequency Signal
- (2)
- Localized Bump Function
- (3)
- Ridge/Non-Smooth Function
6.2. Metrics
6.3. Synthetic Regression (Approximation)
6.4. Denoising Robustness
6.5. Sensitivity to Learning Rate and Weight Decay
6.6. Learning Dynamics
6.7. Prediction Fidelity
6.8. Reproducibility Checklist
7. Discussion and Limitations
7.1. Practical Implications
7.2. Sensitivity and Stability
7.3. Robustness Under Distribution Shift
7.4. Limitations
7.5. Future Work
8. Conclusions and Future Directions
Future Directions
- (a)
- For canonical wavelet families (Mexican hat, Morlet, and Daubechies), we will derive closed-form WNN-specific NTKs and examine their spectra with realistic initializations.
- (b)
- In order to quantify expansion as a function of λ, we will determine the conditions under which L2 causes global or broader PL regions for trainable dilations/translations.
- (c)
- We will look to create adaptive controllers with theoretical stability guarantees that simultaneously adjust η and λ utilizing real-time spectral/gradient diagnostics.
- (d)
- We will use wavelet priors to expand the analysis to structured outputs (such as graphs and sequences) and classification losses (logistic and cross-entropy).
- (e)
- We will examine robustness in the presence of adversarial perturbations and covariate shift, when wavelet localization might provide demonstrable stability benefits.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
- If is singular , then . In particular, any makes strictly positive definite.
- is monotone nonincreasing in ; moreover,
- As , when , and when .
References
- Wu, J.; Li, J.; Yang, J.; Mei, S. Wavelet-integrated deep neural networks: A systematic review of applications and synergistic architectures. Neurocomputing 2025, 657, 131648. [Google Scholar] [CrossRef]
- Kio, A.E.; Xu, J.; Gautam, N.; Ding, Y. Wavelet decomposition and neural networks: A potent combination for short term wind speed and power forecasting. Front. Energy Res. 2024, 12, 1277464. [Google Scholar] [CrossRef]
- Cui, Z.; Ke, R.; Pu, Z.; Ma, X.; Wang, Y. Learning traffic as a graph: A gated graph wavelet recurrent neural network for network-scale traffic prediction. Transp. Res. Part C Emerg. Technol. 2020, 115, 102620. [Google Scholar] [CrossRef]
- Lucas, F.; Costa, P.; Batalha, R.; Leite, D.; Škrjanc, I. Fault detection in smart grids with time-varying distributed generation using wavelet energy and evolving neural networks. Evol. Syst. 2020, 11, 165–180. [Google Scholar] [CrossRef]
- Baharlouei, Z.; Rabbani, H.; Plonka, G. Wavelet scattering transform application in classification of retinal abnormalities using OCT images. Sci. Rep. 2023, 13, 19013. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Wang, Z.; Li, J.; Wu, J. Multilevel wavelet decomposition network for interpretable time series analysis. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 2437–2446. [Google Scholar]
- Grieshammer, M.; Pflug, L.; Stingl, M.; Uihlein, A. The continuous stochastic gradient method: Part I–convergence theory. Comput. Optim. Appl. 2024, 87, 935–976. [Google Scholar] [CrossRef]
- Xia, L.; Massei, S.; Hochstenbach, M.E. On the convergence of the gradient descent method with stochastic fixed-point rounding errors under the Polyak–Łojasiewicz inequality. Comput. Optim. Appl. 2025, 90, 753–799. [Google Scholar] [CrossRef]
- Galanti, T.; Siegel, Z.S.; Gupte, A.; Poggio, T. SGD and Weight Decay Provably Induce a Low-Rank Bias in Neural Networks. 2023. Available online: https://hdl.handle.net/1721.1/148231 (accessed on 28 November 2025).
- Jacot, A.; Gabriel, F.; Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Adv. Neural Inf. Process. Syst. 2018, 31, 1–10. [Google Scholar]
- Somvanshi, S.; Javed, S.A.; Islam, M.M.; Pandit, D.; Das, S. A survey on kolmogorov-arnold network. ACM Comput. Surv. 2025, 58, 1–35. [Google Scholar] [CrossRef]
- Medvedev, M.; Vardi, G.; Srebro, N. Overfitting behaviour of gaussian kernel ridgeless regression: Varying bandwidth or dimensionality. Adv. Neural Inf. Process. Syst. 2024, 37, 52624–52669. [Google Scholar]
- Shang, Z.; Zhao, Z.; Yan, R. Denoising fault-aware wavelet network: A signal processing informed neural network for fault diagnosis. Chin. J. Mech. Eng. 2023, 36, 9. [Google Scholar] [CrossRef]
- Saragadam, V.; LeJeune, D.; Tan, J.; Balakrishnan, G.; Veeraraghavan, A.; Baraniuk, R.G. Wire: Wavelet implicit neural representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18507–18516. [Google Scholar]
- Sadoon, G.A.A.S.; Almohammed, E.; Al-Behadili, H.A. Wavelet neural networks in signal parameter estimation: A comprehensive review for next-generation wireless systems. In Proceedings of the AIP Conference Proceedings, Pune, India, 18–19 May 2024; AIP Publishing LLC.: College Park, MD, USA, 2025; Volume 3255, p. 020014. [Google Scholar]
- Akujuobi, C.M. Wavelets and Wavelet Transform Systems and Their Applications; Springer International Publishing: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
- Wang, P.; Wen, Z. A spatiotemporal graph wavelet neural network (ST-GWNN) for association mining in timely social media data. Sci. Rep. 2024, 14, 31155. [Google Scholar] [CrossRef]
- Uddin, Z.; Ganga, S.; Asthana, R.; Ibrahim, W. Wavelets based physics informed neural networks to solve non-linear differential equations. Sci. Rep. 2023, 13, 2882. [Google Scholar] [CrossRef] [PubMed]
- Jung, H.; Lodhi, B.; Kang, J. An automatic nuclei segmentation method based on deep convolutional neural networks for histopathology images. BMC Biomed. Eng. 2019, 1, 24. [Google Scholar] [CrossRef]
- Xiao, Q.; Lu, S.; Chen, T. An alternating optimization method for bilevel problems under the Polyak-Łojasiewicz condition. Adv. Neural Inf. Process. Syst. 2023, 36, 63847–63873. [Google Scholar]
- Yazdani, K.; Hale, M. Asynchronous parallel nonconvex optimization under the polyak-łojasiewicz condition. IEEE Control Syst. Lett. 2021, 6, 524–529. [Google Scholar] [CrossRef]
- Chen, K.; Yi, C.; Yang, H. Towards Better Generalization: Weight Decay Induces Low-rank Bias for Neural Networks. arXiv 2024, arXiv:2410.02176. [Google Scholar] [CrossRef]
- Kobayashi, S.; Akram, Y.; Von Oswald, J. Weight decay induces low-rank attention layers. Adv. Neural Inf. Process. Syst. 2024, 37, 4481–4510. [Google Scholar]
- Seleznova, M.; Kutyniok, G. Analyzing finite neural networks: Can we trust neural tangent kernel theory? In Mathematical and Scientific Machine Learning; PMLR: Birmingham, UK, 2022; pp. 868–895. [Google Scholar]
- Tan, Y.; Liu, H. How does a kernel based on gradients of infinite-width neural networks come to be widely used: A review of the neural tangent kernel. Int. J. Multimed. Inf. Retr. 2024, 13, 8. [Google Scholar] [CrossRef]
- Tang, A.; Wang, J.B.; Pan, Y.; Wu, T.; Chen, Y.; Yu, H.; Elkashlan, M. Revisiting XL-MIMO channel estimation: When dual-wideband effects meet near field. IEEE Trans. Wirel. Commun. 2025. [Google Scholar] [CrossRef]
- Cui, Z.X.; Zhu, Q.; Cheng, J.; Zhang, B.; Liang, D. Deep unfolding as iterative regularization for imaging inverse problems. Inverse Probl. 2024, 40, 025011. [Google Scholar] [CrossRef]















| η | λ | Val MSE | PSNR (dB) |
|---|---|---|---|
| 3 × 10−3 | 3 × 10−4 | 0.032 | 31.2 |
| 1 × 10−3 | 1 × 10−4 | 0.036 | 30.8 |
| 5 × 10−3 | 1 × 10−3 | 0.038 | 30.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mohamed, K.S.; Suliman, I.M.A.; Alhalangy, A.; Adam, A.; Suhail, M.; Ibrahim, H.; Mohamed, M.A.; Saad, S.A.A.; Mohammed, Y.S. The Module Gradient Descent Algorithm via L2 Regularization for Wavelet Neural Networks. Axioms 2025, 14, 899. https://doi.org/10.3390/axioms14120899
Mohamed KS, Suliman IMA, Alhalangy A, Adam A, Suhail M, Ibrahim H, Mohamed MA, Saad SAA, Mohammed YS. The Module Gradient Descent Algorithm via L2 Regularization for Wavelet Neural Networks. Axioms. 2025; 14(12):899. https://doi.org/10.3390/axioms14120899
Chicago/Turabian StyleMohamed, Khidir Shaib, Ibrahim. M. A. Suliman, Abdalilah Alhalangy, Alawia Adam, Muntasir Suhail, Habeeb Ibrahim, Mona A. Mohamed, Sofian A. A. Saad, and Yousif Shoaib Mohammed. 2025. "The Module Gradient Descent Algorithm via L2 Regularization for Wavelet Neural Networks" Axioms 14, no. 12: 899. https://doi.org/10.3390/axioms14120899
APA StyleMohamed, K. S., Suliman, I. M. A., Alhalangy, A., Adam, A., Suhail, M., Ibrahim, H., Mohamed, M. A., Saad, S. A. A., & Mohammed, Y. S. (2025). The Module Gradient Descent Algorithm via L2 Regularization for Wavelet Neural Networks. Axioms, 14(12), 899. https://doi.org/10.3390/axioms14120899

