The Module Gradient Descent Algorithm via L2 Regularization for Wavelet Neural Networks

Khidir Shaib Mohamed; Ibrahim. M. A. Suliman; Abdalilah Alhalangy; Alawia Adam; Muntasir Suhail; Habeeb Ibrahim; Mona A. Mohamed; Sofian A. A. Saad; Yousif Shoaib Mohammed

doi:10.3390/axioms14120899

,

and

¹

Department of Mathematics, College of Sciences, Qassim University, Buraydah 51452, Saudi Arabia

²

General Department of College of Technical Engineering, Bright Star University, Al-Brega P.O. Box 858, Libya

³

Department of Computer Engineering, College of Computer, Qassim University, Buraydah 51452, Saudi Arabia

⁴

Department of Physics, College of Science, Qassim University, Buraydah 51452, Saudi Arabia

Axioms2025, 14(12), 899;https://doi.org/10.3390/axioms14120899
(registering DOI)

Version Notes

Order Reprints

Abstract

Although wavelet neural networks (WNNs) combine the expressive capability of neural models with multiscale localization, there are currently few theoretical guarantees for their training. We investigate the weight decay (

L_{2}

regularization) optimization dynamics of gradient descent (GD) for WNNs. Using explicit rates controlled by the spectrum of the regularized Gram matrix, we first demonstrate global linear convergence to the unique ridge solution for the feature regime when wavelet atoms are fixed and only the linear head is trained. Second, for fully trainable WNNs, we demonstrate linear rates in regions satisfying a Polyak–Łojasiewicz (PL) inequality and establish convergence of GD to stationary locations under standard smoothness and boundedness of wavelet parameters; weight decay enlarges these regions by suppressing flat directions. Third, we characterize the implicit bias in the over-parameterized neural tangent kernel (NTK) regime: GD converges to the minimum reproducing kernel Hilbert space (RKHS) norm interpolant associated with the WNN kernel with

L_{2}

. In addition to an assessment process on synthetic regression, denoising, and ablations across λ and stepsize, we supplement the theory with useful recommendations on initialization, stepsize schedules, and regularization scales. Together, our findings give a principled prescription for dependable training that has broad applicability to signal processing applications and shed light on when and why

L_{2}

-regularized GD is stable and quick for WNNs.

Keywords:

wavelet neural networks; Mexican hat wavelet; gradient descent algorithm; L₂ regularization; numerical results

MSC:

92B20; 68Q32; 74P05

1. Introduction

Wavelet neural networks (WNNs) are an appealing family of models because they combine multiscale, well-localized representations with the universal approximation capability of neural structures. Recent work ranges from graph-based wavelet neural networks to “decompose-then-learn” pipelines to deep models enhanced with wavelet transforms, demonstrating that wavelet structure can improve stability, efficiency, and interpretability across applications in vision, signal processing, and energy systems [1,2,3,4]. The optimization theory governing the training dynamics of WNNs is significantly less studied than that of other architectures, especially when gradient descent (GD) with L₂ regularization (weight decay) is employed. This gap between theoretical knowledge and real progress is the main topic of this research. Wavelet advantages in practice manifest along two axes: (a) multiscale localization, which captures local phenomena at various frequencies without losing broader context, and (b) computational efficiency when employing learnable or fixed wavelet filters, which reduces sensitivity to small perturbations in the data and speeds up training.

Baharlouei et al. categorize retinal optical coherence tomography anomalies using wavelet scattering, which has a lower processing cost and more robustness than more complex deep alternatives. Graph wavelet neural networks efficiently capture spatiotemporal dependencies, while other recent studies integrate wavelet decompositions into deep models for complex timeseries, such as wind power [5,6]. These empirical data pose a fundamental question: when and why does GD with L₂ converge steadily and rapidly for WNNs?

Two regimes make the interaction of L₂ with GD particularly noticeable: (1) the fixed-feature regime, in which only a linear head is trained and the wavelet dictionary is frozen; this results in ridge regression as the objective, which is substantially convex and hence admits global linear convergence rates. (2) The fully trainable regime, in which wavelet parameters (translations/dilations) and weights are tuned; in this case, generic tools to demonstrate linear rates within appropriate landscape regions are provided by conditions such as the Polyak–Łojasiewicz (PL) inequality [7,8]. L₂ regularization is one of the most used tools from an optimization perspective. L₂ has an implicit bias in deep networks in addition to its traditional use in conditioning and in imparting effective curvature (for example, for a linear head on fixed features). Weight decay encourages low-rank tendencies and chooses smaller-norm solutions in parameter or function space, as demonstrated by recent works. These phenomena are closely related to convergence speed and stability [9].

Another theoretical viewpoint is provided by the neural tangent kernel (NTK). Between 2022 and 2024, NTK theory became a solid foundation for understanding over-parameterized training dynamics. In this regime, a nonlinear network behaves linearly in function space around initialization with respect to an effective kernel, and GD with L₂ becomes equivalent to GD in the corresponding RKHS. Surveys and practical evaluations suggest that this perspective accounts for both the minimum-norm bias and convergence (via the kernel spectrum): L₂-regularized GD selects the minimum-RKHS-norm solution among all interpolants during interpolation [10]. Making this image unique to WNNs is our contribution, and we therefore ask: What is the appearance of the WNN-specific NTK with realistic initiation and limited dilation/shift ranges? What effects do wavelet selections have on the spectral constant governing the rate?

In-depth analyses of wavelet–deep integration reveal that training dynamics guarantees, especially with L₂, remain absent despite steady progress, and most contributions concentrate on architectural design or empirical advantages. Even with forward-looking hybrids like Wav-KAN (2024) [11], sharp statements concerning rates, step-size/regularization requirements, and stability bounds are still uncommon when compared to linear models or neural networks under restrictive assumptions. This disparity has practical implications: without principled direction, practitioners are forced to rely on costly trial-and-error to calculate the learning rate (

η

) and regularization (

λ

), and it becomes more challenging to ascertain why training is effective or ineffective.

This paper’s contributions are as follows. We provide a unified analysis of convergence for gradient descent on WNNs in three regimes with L₂ regularization:

(a): A linear-head regime over a fixed wavelet dictionary. With explicit rates controlled by the eigenvalues of $(Φ^{T} Φ / n + λ I)$ , we demonstrate global linear convergence to the unique ridge minimizer. This results in useful guidelines for considering ( $λ$ ) as a conditioning lever instead of just an anti-overfitting knob.
(b): WNNs that are fully trainable (nonconvex). GD converges to stationary points and enjoys linear rates within regions meeting PL under the conditions of natural smoothness and boundedness for wavelets (within a restricted dilation/shift domain). We give implementable step-size limitations and demonstrate how L₂ dampens flat directions to widen PL basins.
(c): A regime that is over-parameterized (NTK). We extract rate constants related to the kernel spectrum induced by the wavelet dictionary and demonstrate that L₂ directs GD toward the minimum-RKHS-norm interpolant associated with the WNN-specific NTK.

To detail our practice and methodology, the following is a training regimen derived from our theory. To ensure stability without sacrificing fit, select dilations log-uniformly over a bounded range by sampling translations from the empirical input distribution; estimate a Lipschitz proxy to set (

η

); sweep (

λ

) over practical ranges; and monitor the near-constant contraction in gradient norms on a semi-log scale to track the onset of a linearconvergence phase. We also propose an evaluation protocol (synthetic approximation and denoising benchmarks) to support our theoretical claims and provide insight into regime transitions as (

λ

) and (

η

) change, in compliance with existing guidelines on convergence diagnostics and implicit regularization [12,13,14]. In doing so, we bridge a persistent gap between the empirical richness of wavelet-based models and the theoretical understanding of their training dynamics under L₂-regularized gradient descent.

We organize the remaining portions of this brief as follows In Section 2, we give a literature review. In Section 3, we provide an explanation of the pi-sigma network. In Section 4, we briefly describe a neural network model with the batch gradient method and smoothing regularization. Section 5 presents the main convergence theorem. Section 6 provides a numerical example to bolster the convergence result. We summarize the findings and conclusions in Section 7. We have relegated the proof of the theorem to the Appendix A.

2. Related Work

Wavelet neural networks (WNNs) are a hybrid of learnable parametric models and multiresolution signal analysis. Wavelet atoms (translations/dilations) are described as localized, multiscale features inside neural predictors and system identifiers in a 2025 topical review that focuses on WNNs for signal parameter estimation and next-generation wireless technology [15]. Wavelet scattering, which is a cascade of wavelet modulus/averaging with an analytical foundation, has demonstrated strong performance in biomedical imaging. Baharlouei et al. demonstrated better optical coherence tomography abnormality classification using wavelet scattering as opposed to heavier deep baselines, highlighting stability to noise and minor deformations [16]. Recent graph wavelet neural networks go beyond Euclidean grids by combining graph wavelet operators with temporal modeling to effectively capture localized spatio-temporal relationships; Wang et al. (2024) demonstrate improvements in accuracy and efficiency when mining rapidly changing social media graphs [17]. Wavelets’ joint time–frequency localization has been credited with the success of employing them as bases or activations inside PINNs for stiff nonlinear differential equations (such as Blasius flow) in physics-informed settings [18]. When taken as a whole, these lines of study encourage examining the relationship between optimization dynamics and wavelet structure during training. Recent hybrids have incorporated discrete wavelet transformations as differentiable layers to preserve high-frequency detail during down/up-sampling in deep nets, reducing aliasing and enabling reconstruction. This trend is also seen in segmentation and restoration models [19]. To improve interpretability and training speed, Wav-KAN (2024) integrates wavelet bases into KolmogorovArnold networks, unlike its MLP/Spl-KAN counterparts. Wav-KAN (2024) open-source implementations accelerate repeatability [11]. These studies show potential in reality, while often evaluating performance experimentally; they hardly ever define optimization landscapes or provide rates for first-order approaches under regularization.

Gradient technique convergence under PL/Kρ conditions. The Polyak–Łojasiewicz (PL) inequality ensures linear (geometric) convergence of gradient descent with suitable step sizes for nonconvex objectives with benign curvature, even in the absence of convexity. As a useful instrument for assessing contemporary learning issues and creating diagnostics (such gradient-norm contraction) in deep training, PL is consolidated in recent pedagogical notes and lecture compendia (2021, and 2023) [20,21]. Although PL has been used in a variety of network classes, there is still a lack of research on explicit PL verification for wavelet-parameterized objectives. Additionally, it is not yet clear how L₂ regularization alters PL basins in WNNs. Our PL-centric analysis with wavelet-boundedness assumptions and confined dilation/shift domains is motivated by this gap. Regarding implicit bias and weight decay (L₂), weight decay influences the implicit bias of gradient-based training in addition to explicit conditioning in linearized tasks (ridge). L₂ significantly changes optimization geometry and convergence behavior, as evidenced by recent theory (2024) that weight decay can increase generalization bounds and induce low-rank structure in learnt matrices [22]. Regularization is linked to faster and more stable descent along well-aligned routes, as demonstrated by related work that shows similar low-rank effects in attention layers where multiplicative parameterizations interact with L₂ to prefer compact spectra [23]. However, these findings have not been thoroughly examined in relation to WNNs, where wavelet dilations and translations are among the parameters. They are established for ReLU/transformer-style architectures.

In an RKHS controlled by the neural tangent kernel (NTK), network training follows kernel gradient descent at initialization and at large width. This results in minimum-norm interpolation and spectral-rate predictions under explicit/implicit regularization. When NTK accurately forecasts optimization trajectories and generalization is described in surveys and empirical analyses extensions adjust NTK to surrogate gradient regimes and non-smooth activations [24,25]. A WNN-specific NTK that takes parameterized wavelet atoms into account has not yet been fully developed, despite the fact that NTK has been calculated for popular MLP/CNN designs. Our work explains how L₂ drives convergence to the minimum-RKHS-norm interpolant caused by the wavelet dictionary and initialization and offers a specialization of the NTK perspective on WNNs.

Convergence restrictions have been added to unrolled optimization algorithms in recent years to increase stability. Using learnable step sizes and a monotonic descent constraint, Zheng et al. [26] presented a deep unrolling method using a proximal gradient descent framework. The convergence of gradient descent techniques with regularization in wavelet neural networks is relevant, and their theoretical study demonstrates that such limitations can ensure convergence behavior. Additionally, iterative regularization techniques have been used to examine the theoretical foundations of regularization effects in neural network training. In the context of inverse problems, Cui et al. [27] showed that unfolding iterative algorithms can be used as regularization strategies, guaranteeing stability and convergence. This viewpoint supports the notion that incorporating L₂ regularization into gradient descent enhances convergence to stable solutions by acting as an iterative regularization procedure.

Despite the fact that most research is focused on specific applications or general neural network topologies, the field of gradient descent with L₂ regularization in wavelet neural networks allows for the synthesis of theoretical insights from various studies. With carefully designed gradient descent algorithms improved by L₂ regularization, wavelet neural network training may achieve stable and efficient convergence, according to the research’ explanations of acceleration strategies, implicit regularization effects, and convergence limitations. This convergence is further improved by the theoretical frameworks created for iterative regularization and unrolled optimization approaches, which also provide a foundation for further research into specialized structures like wavelet neural networks.

3. Preliminaries and Problem Setup

3.1. Wavelet Neural Network (WNN) Model

Let

ψ : R^{d} \to R

be a mother wavelet. For parameters

θ_{j} = (a_{j}, b_{j})

with dilation

a_{j} > 0

and translation

b_{j} \in R^{d}

, define the atom

φ_{j} (x_{i}; θ_{j}) = ψ ((x - b_{j}) / a_{j})

. A single-hidden-layer WNN with m atoms outputs

z (x; W, Θ) = \sum_{j = 1}^{m} w_{j} φ_{j} (x_{i}; θ_{j})

. Given data

{(x_{i}, y_{i})}_{i = 1}^{n}, let Φ \in R^{n \times m}

with

Φ_{i j} = φ_{j} (x_{i}; θ_{j})

.

ϕ_{j} (x; θ_{j}) = ψ (\frac{x - b_{j}}{a_{j}}), θ_{j} = (a_{j}, b_{j})

z (x; W, Θ) = \sum_{j = 1}^{m} w_{j} θ_{j} (x; θ_{j})

3.2. Assumptions (Wavelets, Data, and Loss)

Wavelet atoms at multiple dilations refer to a mathematical concept that extends the standard wavelet transform by using a more complex set of scaling and transforming functions to analyze signals. For example, for the Mexican hat wavelet or Marr wavelet (see Figure 1), wavelet atoms are obtained by scaling the Mexicanhat mother wavelet:

ψ (x) = (1 - x^{2}) e^{- x^{1 / 2}}

The curves show

ψ_{a} (x) = a^{- \frac{1}{2}} ψ (\frac{x}{a}), x \in \{- 5, + 5\}

for dilations

a \in \{0.5, 1, 2, 4\}

. The number of points (N = 2000) allows for the drawing of smooth, high-precision curves. The amplitude normalization

a^{- 1 / 2}

preserves the L₂ energy across scales.

Figure 1. Wavelet atoms at multiple dilations (Mexican hat).

Here, we adopt standard assumptions:

(A1) $ψ$ is twice continuously differentiable with bounded value, gradient, and Hessian;
(A2) dilations/translations are bounded ( $0 < b_{m i n} \leq b_{j} \leq b_{m a x} and ‖a_{j}‖ \leq A$ );
(A3) inputs lie in a compact set;
(A4) the loss is smooth (and strongly convex in its first argument for squared loss);
(A5) feature Jacobians w.r.t. $θ$ are uniformly bounded. Under (A1–A5), $\nabla L$ is Lipschitz on the feasible set; in particular, the head-only objective in $W$ is strongly convex due to $λ$ .

3.3. Training Algorithms (GD/SGD with Weight Decay)

We minimize the L₂-regularized empirical risk:

L (W, Θ) = \frac{1}{n} \sum_{i = 1}^{n} {(z (x_{i}; W, Θ) - y_{i})}^{2} + \frac{λ}{2} ({|W|}_{2}^{2} - {|Θ|}_{2}^{2})

With squared loss, the gradient w.r.t.

W

is

\nabla_{W} L = \frac{1}{n} (θ W - y) + λ W

3.4. Problem Decompositions: Three Regimes

(R1) Fixed-feature (linear head/ridge). Freezing $Θ$ reduces the problem to ridge regression, with Hessian $H = (Φ^{T} Φ / n + λ I) ⪰ λ I$ , and GD enjoys global linear convergence for $0 < η < 2 / λ_{m a x (H)}$ .

\min_{W} \frac{1}{2 n} {|θ W - y|}_{2}^{2} + \frac{λ}{2} {|W|}_{2}^{2}

(R2) Fully trainable WNN (nonconvex). Both $W$ and Θ are updated; we obtain convergence to stationary points and linear phases under a Polyak–Łojasiewicz (PL) inequality on $L$ .

\frac{1}{2} {|\nabla L (θ)|}_{2}^{2} \geq μ_{P L} (L (θ) - L^{*})

(R3) Over-parameterized (NTK/linearization). For large m and small $η$ , dynamics linearize around initialization; function updates follow kernel GD with a WNN-specific NTK K:

z_{t +!} = z_{t} - η k (z_{t} - y)

3.5. PL Inequality and Its Role

If L satisfies a PL inequality with constant

μ_{P L}

on a domain

D

and

\nabla L

is

L

-Lipschitz, then for

0 < η \leq 1 / L

, gradient descent yields geometric decay

L (θ^{t}) - L^{*} \leq {(1 - η μ_{P L})}^{t} (L (θ^{0}) - L^{*})

. In WNNs, L₂ enlarges PL regions by damping flat directions along scale/shift parameters.

3.6. NTK for WNNs (Overview)

The neural tangent kernel (NTK) is a kernel function that describes how a neural network’s output changes during training with gradient descent. It allows for the theoretical analysis of neural network training by reframing the problem as a kernel method, especially in the limit of infinite width where the kernel becomes constant during training. The NTK is calculated as the dot product of the gradients of the network’s output with respect to its parameters for two different inputs. Let K denote the NTK at initialization. With small learning rates and large width, training follows kernel GD in the RKHS induced by K; convergence speed is governed by the spectrum

{λ_{i} (K)}

. With L₂, GD converges to the minimum-RKHS-norm solution among all interpolants.

Let the WNN output be parameterized by

θ

(all trainable parameters). For an input

x,

the network output is

f (x; θ)

. At initialization

θ_{0},

the NTK matrix

K \in R^{N \times N}

on inputs

{x_{i}}_{i = 1}^{N}

is

K_{i j} = {\nabla_{0} f (x_{i}; θ_{0})}^{T} \nabla_{0} f (x_{j}; θ_{0})

Equivalently, if

J \in R^{N \times P}

is the Jacobian whose i-th row is

J_{i} = {\nabla_{0} f (x_{i}; θ_{0})}^{T}

, then

K = J J^{T}

.

Using Figure 2, we can visualize how the NTK evolves during gradient descent. The input grid has N = 150 points uniformly spaced on [0, 1]. The WNN architecture has a single hidden layer with M = 30 wavelet atoms. The parameter vector

θ

has length P = 3. Initialization (used to compute NTK at init) features amplitudes

c_{j} ~ N (0, {0.5}^{2})

, scales

a_{j} ~ N (1.0, {0.1}^{2})

, and translations

b_{j} ~ U (0,1)

. The full

N \times N

NTK matrix is used to produce Figure 2. This figure plots the eigenvalues of the NTK for the WNN, typically on a log scale, showing how the spectrum decays across modes.

Figure 2. WNN-NTK eigenvalue decay (illustrative).

Figure 2 reveals different critical advancements of studying NTK eigenvalue behavior in wavelet neural networks, especially under L₂ regularization. Figure 2 is essential because it visually and mathematically explains why WNNs train faster, converge more consistently, and benefit more from L₂ regularization than conventional networks. It links optimization theory, NTK spectral properties, convergence behavior and multiscale feature learning.

K_{t o y} = [\begin{matrix} 1.20 & 0.83 & 0.45 & 0.12 & 0.05 & 0.02 \\ 0.83 & 1.30 & 0.76 & 0.22 & 0.07 & 0.03 \\ 0.45 & 0.76 & 1.10 & 0.48 & 0.18 & 0.06 \\ 0.12 & 0.22 & 0.48 & 0.98 & 0.54 & 0.02 \\ 0.05 & 0.07 & 0.18 & 0.54 & 0.95 & 0.48 \\ 0.02 & 0.03 & 0.06 & 0.20 & 0.48 & 0.88 \end{matrix}]

3.7. Step-Size and Regularization Prescriptions (Preview)

Head-only (R1): choose $η \in (0, 2 / L)$ with $L = λ_{m a x (H)}$ ; increasing λ raises $μ = λ_{m a x (H)}$ and improves the condition number $L / μ$ .
Full WNN (R2): use conservative $η \leq 1 / L_{e m p}$ (empirical Lipschitz proxy); increase $λ$ if gradient-norm contraction stalls.
NTK (R3): stable $η < 2 / λ_{m a x (K)}$ ; L₂ controls norm growth and selects the minimum-RKHS-norm interpolant.

Gradient-norm decay (linear phase evident on semi-log Figure 3). This corresponds to a typical semi-log plot showing that decays approximately exponentially in

t

.

‖\nabla L (θ_{t})‖

Figure 3. Gradient-norm decay (linear phase evident on semi-log).

Target function:

f (x) = \sin (2 π x)

where there are N = 100 samples uniformly spaced on

[0, 1]

. A random Gaussian initialization

W_{0} [i] ~ N (0, {0.05}^{2})

has a learning rate

η = 0.1

, regularization

λ = \{0, 1 \times 10^{- 4}, 1 \times 10^{- 3}\},

and 200 epochs.

The curves in Figure 3 show different significant improvements in L₂ regularization. Linear decline indicates exponential convergence on a semi-log plot. Gradient magnitudes decrease more quickly, the slope of log

(‖ \nabla L ‖)

becomes steeper, and the optimization phase enters the linear convergence area earlier when L₂ regularization is used. In the unregularized situation (

λ = 0

), plateaus are frequent, gradient norms frequently oscillate, and weak curvature causes noise amplification. When (

λ > 0

), curves exhibit more steady descent, no abrupt oscillations, and smooth monotonic decay. This illustrates how L₂ regularization reduces saddle-point sensitivity, eliminates flat areas, and enforces strong convexity in the local basin to stabilize the loss landscape.

4. Mmethodology

4.1. Objective and Gradient Updates

We consider the L₂-regularized objective and its gradients w.r.t.

W

and

Θ

.

L (W, Θ) = \frac{1}{n} \sum_{i = 1}^{n} 𝓁 [z (x_{i}; W, Θ), y_{i}] + \frac{λ}{2} ({|W|}_{2}^{2} + {|Θ|}_{2}^{2})

The first-order updates (vanilla GD) read

W^{t + 1} = W^{t} - η \nabla_{W} L (W^{t}, Θ^{t})

Θ^{t + 1} = Θ^{t} - η \nabla_{Θ} L (W^{t}, Θ^{t})

4.2. Fixed-Feature (Ridge) Training of the Linear Head

Freezing Θ reduces training to ridge regression on Φ; the regularized Hessian

H

is wellconditioned for

λ > 0

.

\min_{W} \frac{1}{2 n} |Θ W - y| + \frac{λ}{2} {|W|}_{2}^{2}

{|W^{t + 1} - W^{t}|}_{2} \leq ρ^{t} {|W^{0} - W^{*}|}_{2}, ρ = \max_{i} |1 - η λ_{i} (H)|

K (H) = \frac{λ_{m a x} (H)}{λ_{m i n} (H)} = \frac{(σ_{m a x}^{2} / n) + λ}{(σ_{m i n}^{2} / n) + λ}

4.3. Fully Trainable WNN: Block GD, Schedules, and Stability

Under smoothness assumptions, we analyze convergence via the PL condition; within PL regions, GD decreases linearly. We adopt practical schedules for

η

(constant, step, cosine) and a Polyak-type step when a reliable target is available. We use the target function the same as in Section 3.7, with 100 points. The training parameters are as follows: constant

η = 0.01

, decay at steps 50 and 100, and cosinegiven by

λ_{t} = \frac{λ_{0}}{2} (1 + c o s (\frac{π t}{T}))

Study stability and convergence under different LR schedules are plotted in Figure 4.

Figure 4. Learning rate schedules (constant, step, cosine).

4.4. Choosing η and λ: Prescriptions and Diagnostics

R1: choose $η \in (0,2 / λ_m a x (H))$ and sweep $λ$ logarithmically.
R2: use $η \leq 1 / L_e m p$ and increase λ if gradient-norm contraction stalls.
R3: ensure $0 < η < 2 / λ_{m a x (K)}$ ; L₂ controls norm growth and selects the minimum-RKHS-norm interpolant.

5. Theoretical Results

5.1. Linear Convergence for the Fixed-Feature (Ridge) Regime

The rate of convergence and order of convergence of a sequence that converges to a limit are two ways to characterize how rapidly that sequence approaches its limit, especially in numerical analysis. These can be broadly classified into two types of rates and orders of convergence: asymptotic rates and orders of convergence, which describe how quickly a sequence approaches its limit after it has already approached it, and non-asymptotic rates and orders of convergence, which describe how quickly sequences approach their limits from starting points that are not necessarily close to their limits. Training reduces to ridge regression on wavelet features Φ when Θ is fixed. Below, let S stand for the regularized Hessian.

H = \frac{1}{n} Φ^{T} Φ + λ I

For step sizes

0 < η < 2 / λ_{m a x (S)}

, gradient descent converges linearly to the unique minimizer

W^{*}

. The error contracts at a rate controlled by the spectrum of

H

in Figure 5. The function used is a linear model (ridge regression

f_{3} (x)

):

y = X W^{*}

where

W^{*} = [\begin{matrix} 2 \\ - 0.5 \\ 0.5 \end{matrix}]

Figure 5. Linear convergence in the ridge: theoretical

ρ^{t}

vs. observed residual decay (semilog).

The error contracts at a rate controlled by the spectrum of

H

:

{|W^{t + 1} - W^{t}|}_{2} \leq ρ^{t} {|W^{0} - W^{*}|}_{2}, ρ = \max_{i} |1 - η λ_{i} (H)|

Figure 5 evaluates linear convergence using a simple linear model

(f (x) = 2 x + 1)

with

η = 0.01,

λ = 0.1

, and

ρ = 0.998

.

5.2. Fully Trainable WNN Under a PL Inequality

We assume smoothness and boundedness of wavelet atoms over a constrained dilation/shift domain. If

L

satisfies a PL inequality with constant

μ_{P L}

and

\nabla L

is

L

-Lipschitz, then GD with

0 < η \leq 1 / L

enjoys a linear decrease in the objective:

\frac{1}{2} {|\nabla L (θ)|}_{2}^{2} \geq μ_{P L} |L (θ) - L^{*}| ⟹ L (θ^{t}) - L^{*} \leq {(1 - η μ_{P L})}^{t} (L (θ^{0}) - L^{*})

PL-like regions are enlarged, and stability is enhanced by L₂ regularization, which intuitively dampens flat directions linked to scale/shift redundancy. The impact of λ on landscape smoothness is shown in Figure 6, where larger λ increases the size of benign (PL-like) areas. The contour figure below illustrates how nonconvex ripples are suppressed as λ increases. This type of figure visualizes how the loss landscape changes as the ridge (L₂) regularization parameter λ increases, using a small WNN and a simple 1-D regression function:

f (x) = \sin (2 π x) + 0.3 c o s (4 π x)

where

x_{i} ~ U n i f o r m (0,1), i = 1, \dots, 100

y_{i} = f (x_{i})

Figure 6. Effect of λ on Landscape Smoothness.

A small wavelet network is used to make landscape plots feasible: there is one input dimension and six wavelet neurons, and the mother wavelet is Mexican hat. To show how L₂ regularization smooths the loss landscape, we vary λ values:

λ \in \{0, 0.01, 0.1, 1\}

.

Figure 6 shows how the shape of the optimization landscape for WNNs is affected by changing the L₂ regularization parameter λ. The loss surface is shown in the picture, along with areas where the function acts in accordance with a PL inequality—a requirement that is known to ensure global linear convergence of gradient descent.

A 2-D slice or contour of the loss over a tiny parameter plane (two directions in parameter space) is shown in each panel of Figure 7. The L₂ regularization λ is the only difference between panels. Typical observations (compare small → large λ). For λ = 0 (no regularization) the surface is rough, with several tiny ridges/valleys and stronger curvature in areas. Contours are uneven and anisotropic. There are saddle-like landforms and little basins. When λ (e.g., 1 × 10⁻² → 0.1), the loss surface gets notably rounder and more convex-looking in this slice. Narrow wells are filled in and the central basin grows. The spacing between contours is more consistent (gradients vary gently). While for big λ (e.g., 1.0), the regularization term dominates; the surface is exceedingly smooth and globally similar to a quadratic (the λ‖w‖² term). The basin is large and steep toward center; little nonconvex ripples are mostly gone.

Figure 7. 2D Loss Landscape with three different values of λ.

6. Experiments and Evaluation

6.1. Datasets and Tasks

We evaluate wavelet neural networks (WNNs) with L₂-regularized gradient descent across three complementary experimental settings, each designed to isolate a different aspect of optimization, generalization, and landscape behavior.

(A) One-Dimensional Synthetic Function Approximation (Optimization Analysis)

To study optimization dynamics under full control of ground truth and curvature, we employ three canonical synthetic regression tasks, each selected to emphasize a different structural property relevant to wavelets and convergence theory.

(1): Smooth Low-Frequency Signal

f_{1} (x) = \sin (2 π x) + 0.3 c o s (4 π x)

Samples: N = 500 uniformly spaced points

(2): Localized Bump Function

f_{2} (x) = e x p (- \frac{{(x - 0.3)}^{3}}{0.02}), x \in [- 1, 1]

Samples: N = 400.

(3): Ridge/Non-Smooth Function

f_{3} (x) = |x|, x \in [- 1, 1]

Samples: N = 300.

(B) Image-Like Denoising Task (PSNR vs. Noise Level σ)

To evaluate practical generalization and robustness, we construct image-like signals using 2D tensorized wavelets.

The dataset consists of the following:

Synthetic Image Function

f_{4} (x_{1}, x_{2}) = \sin (π x_{1}) + 0.3 \cos (π x_{1}), (x_{1}, x_{2}) \in {[- 1, 1]}^{2}

where the grid has 50 × 50 points and the noise model is

y = f_{4} + ϵ, ϵ ~ N (0, σ^{2})

Noise levels are

σ = 0.01, 0.05, 0.1, 0.2

, and PSNR (peak signal-to-noise ratio) is the chosen metric.

6.2. Metrics

We report mean squared error (MSE) and peak signal-to-noise ratio (PSNR).

M S N = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

P S N R = 10 \log_{10} (\frac{{M A X}_{i}^{2}}{M S E})

6.3. Synthetic Regression (Approximation)

The WNN captures the target function with small bias and controlled variance. Using localized bump function (

f_{2} (x)

), the overlay in Figure 8 compares ground truth vs. prediction. Figure 8 persuasively shows that L₂ regularization improves training stability and prevents overfitting; WNNs provide superior approximation of functions with localized and multiscale behavior; and the WNN learns a smooth, accurate, and well-conditioned outcome while preserving the structure that comprises the underlying function.

Figure 8. Synthetic regression: ground truth vs. WNN prediction.

6.4. Denoising Robustness

The capacity of a model to sustain consistent performance in the face of noise, whether from adversarial attacks or natural sources like picture noise, is known as denoising robustness. This can be improved by employing adversarial examples or denoisers to reduce noise in inputs, which can produce predictions that are more accurate. Figure 9 shows the PSNR after simulating additive noise with standard deviation

σ

by using the smooth low-frequency signal (

f_{1} (x)

) function. Stronger structure preservation under noise is indicated by the WNN’s greater PSNR across σ levels.

Figure 9. PSNR vs. σ: WNN vs. baseline CNN.

6.5. Sensitivity to Learning Rate and Weight Decay

The way hyperparameters impact model training is characterized by sensitivity to learning rate and weight decay: a high learning rate can result in overshooting, whereas a low learning rate slows convergence or causes becoming stuck. Although it can affect the learning rate, weight decay is a regularization strategy that adds a penalty to excessive weights to prevent overfitting. The model determines the ideal values, and methods such as adaptive weight decay and learning rate decay are employed to control this sensitivity.

In Figure 10, we sweep

η \in [1 \times 10^{- 4}, 1 \times 10^{- 1}]

and

λ \in [1 \times 10^{- 6}, 1 \times 10^{- 2}]

logarithmically. A broad valley of lowvalidation MSE appears near

(η \approx 3 \times 10^{- 3}, λ \approx 3 \times 10^{- 4})

. Figure 10 shows a plot created using the smooth low-frequency signal (

f_{1} (x)

) function.

Figure 10. Sensitivity heatmap: validation MSE vs. (

η, λ

).

6.6. Learning Dynamics

The purpose of this subsection is to evaluate the denoising ability of WNN; to do so, we use the following function:

y = f_{4} + ϵ, ϵ ~ N (0, σ^{2})

Learning curves (train/val) show a clear linear phase on a semi-log scale, consistent with PL-based analysis (see Figure 11). No overfitting is observed over the epochs that are shown.

Figure 11. Learning curves (train/val).

6.7. Prediction Fidelity

The degree to which the model’s predictions match the actual data is known as fidelity. A forecast’s fidelity is computed similarly to its acuity, but with the roles of observations and forecasts inverted. To carry out a side-by-side comparison of noisy vs. ground truth vs. reconstructed, we use the following function:

y = f_{4} + ϵ, ϵ ~ N (0, σ^{2})

It is evident from Figure 12 that there is little bias and controlled variance because the distribution of predictions versus ground truth is clustered close to the identity line.

Figure 12. Scatter: predictions vs. ground truth.

6.8. Reproducibility Checklist

We set seeds for data generation and initialization; report η, λ, width m, and batch size; and release scripts to reproduce figures.

7. Discussion and Limitations

7.1. Practical Implications

Our analysis advocates viewing

λ

as a conditioning lever rather than only a regularization knob: a small λ value risks ill-conditioning and slow convergence; overly large λ values bias solutions excessively and harm accuracy. A U-shaped validation error curve is expected as λ varies (see Figure 13). There are sinusoidal and small high-frequency spikes, and the parameters are constant LR vs. cosine LR. We track gradient norms over iterations:

Figure 13. Validation error vs. λ (bias–variance trade-off).

7.2. Sensitivity and Stability

Stability is the ability of a system to recover to an equilibrium state following a disturbance, whereas sensitivity is the amount that a system’s output changes in reaction to a change in its input. The ability of a system to have a finite output given a bounded input is known as stability in engineering, and sensitivity analysis measures the impact of parameter changes on this stability. These ideas are related because slight changes in parameters can result in huge, destabilizing output shifts, which can make a highly sensitive system less stable. Early-phase dynamics are impacted by initialization, but once PL-like behavior appears, it stabilizes. Weight decay dampens flat directions, reducing variance in trajectories; a slight spread among random seeds is normal. Figure 14 below illustrates how stability and sensitivity are avoided.

Figure 14. Initialization sensitivity: validation MSE across runs.

7.3. Robustness Under Distribution Shift

To show how WNN captures sharp features compared to MLP/CNN, we use a highly localized “peak”:

f (x) = e x p (- 100 {(x - 0.5)}^{2})

The capacity of a model to continue performing as the statistical characteristics of the data it meets in the actual world differ from those of its training data is known as robustness under distribution. Wavelet localization preserves important patterns and increases robustness to moderate covariate fluctuations (Figure 15). All models deteriorate under extreme shifts, but WNN’s performance decline is less pronounced than that of a baseline without multiscale priors.

Figure 15. Robustness under shift (performance vs. severity).

7.4. Limitations

Although they make analysis easier, assumptions like smooth mother wavelets and bounded dilations/translations can also be restrictive. Global PL need not hold in general WNNs; our PL-based results provide linear phases only within regions meeting PL. Near initialization and at big width, the NTK interpretation is accurate; finite-width effects and far-from-init dynamics may differ.

7.5. Future Work

Future work will include deriving explicit WNN-specific NTKs for common wavelet families; tighter conditions under which L₂ induces PL globally; adaptive schedules that jointly tune η and λ (see Table 1); and extensions to classification losses and structured outputs.

Table 1. Ablations (illustrative).

8. Conclusions and Future Directions

This study used controlled hyperparameter sweeps, NTK-based interpretation, and synthetic benchmarks to provide a thorough examination of neural network training under L₂ regularization. We confirmed the L₂ penalty’s sparsity-inducing effect through function approximation tasks, and we consistently saw improvements in gradient stability and convergence behavior. The image-like denoising studies further showed that L₂ regularization achieves competitive PSNR while avoiding overfitting by effectively balancing reconstruction accuracy and model compactness. Together, our theoretical and empirical analyses of loss decay, gradient-norm evolution, weight-norm trajectories, and NTK eigenvalue analysis demonstrated that the L₂ penalty encourages structured sparsity without compromising the underlying model’s expressive power. However, the method requires careful calibration to ensure stability because it is sensitive to the choice of regularization weight λ and learning rate η. All things considered, the findings validate that L₂ regularization is a viable method for constructing sparse, effective neural networks with well-behaved optimization landscapes. These results may be extended to deeper architectures, adaptive regularization schedules, and higher-dimensional data in future studies. Additionally, the theoretical relationships between sparsity and NTK dynamics may be further investigated.

Future Directions

(a): For canonical wavelet families (Mexican hat, Morlet, and Daubechies), we will derive closed-form WNN-specific NTKs and examine their spectra with realistic initializations.
(b): In order to quantify expansion as a function of λ, we will determine the conditions under which L₂ causes global or broader PL regions for trainable dilations/translations.
(c): We will look to create adaptive controllers with theoretical stability guarantees that simultaneously adjust η and λ utilizing real-time spectral/gradient diagnostics.
(d): We will use wavelet priors to expand the analysis to structured outputs (such as graphs and sequences) and classification losses (logistic and cross-entropy).
(e): We will examine robustness in the presence of adversarial perturbations and covariate shift, when wavelet localization might provide demonstrable stability benefits.

Author Contributions

K.S.M.: Conceptualization, methodology, investigation, software, resources, project administration, Writing—original draft. I.M.A.S.: Writing—original draft, writing—review and editing. A.A. (Abdalilah Alhalangy): Formal analysis, investigation, writing—review and editing. A.A. (Alawia Adam): Writing—original draft, writing—review and editing. M.S.: Formal analysis, writing—review and editing. H.I.: Writing—original draft, writing—review and editing. M.A.M.: Formal analysis, investigation, writing—review and editing. S.A.A.S.: Investigation, writing—review and editing. Y.S.M.: Writing—original draft, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

The Researchers would like to thank the Deanship of Graduate Studies and Scientific Research at Qassim University for financial support (QU-APC-2025).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the Reviewers’ for their valuable comments and suggestions, which have improved the presentation of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Main Theoretical Results

All norms are Euclidean. For any matrix A,

λ_{m i n} (A)

and

λ_{m a x} (A)

denote the minimal and maximal eigenvalues. We need the following assumptions in our proofs.

Assumption A1 (Wavelet Feature Boundedness).

Let

ϕ_{j} (x; θ_{j}) = ψ_{1} (\frac{x - b_{1}}{a_{1}}), \dots, ψ_{N} (\frac{x - b_{N}}{a_{N}})

be the wavelet activation vector. There exists

M_{ψ} > 0

such that for all

x \in X \subset R^{d}

:

{‖ϕ (x)‖}_{2} \leq M_{ψ}

Assumption A2 (Lipschitz Gradient/Smoothness).

The empirical loss with L₂ regularization

F (W) = \frac{1}{2 n} |Θ W - y| + \frac{λ}{2} {|W|}_{2}^{2}

has an

L

-Lipschitz continuous gradient:

‖\nabla F (W) - \nabla F (W^{'})‖ \leq L ‖W - W^{'}‖

Equivalently, the Hessian

H = \frac{1}{n} Φ^{T} Φ + λ I

satisfies

{0 < λ}_{m i n} (H) \leq λ_{m a x} (H) = L

Assumption A3 (Strong Convexity).

The Hessian satisfies

{μ ≔ λ}_{m i n} (H) > 0

This holds automatically for any

λ > 0

, even if

Φ^{T} Φ

is singular.

Lemma A1 (Closed-Form Minimizer of L₂-regularized WNN).

Under A1–A3, the unique minimizer of

\min_{W} \frac{1}{2 n} |Θ W - y| + \frac{λ}{2} {|W|}_{2}^{2}

is

W^{*} = {(\frac{1}{n} Φ^{T} Φ + λ I)}^{- 1} \frac{1}{n} Φ^{T} y

Thus

W^{*}

exists, is unique, and depends smoothly on

λ

.

Proof.

Let

F

be continuously differentiable, and its gradient with respect to

W \in R^{m}

is

\nabla F (W) = \frac{1}{n} Φ^{T} (Φ W - y) + λ W

Any critical point

W

satisfies

\nabla F (W) = 0

. Therefore, any stationary point must solve the linear equation

(\frac{1}{n} Φ^{T} Φ + λ I) W = \frac{1}{n} Φ^{T} y

Let

G = \frac{1}{n} Φ^{T} Φ

. Then,

G

is symmetric positive semidefinite. Because λ > 0, the matrix

H ≔ G + λ I = \frac{1}{n} Φ^{T} Φ + λ I

is symmetric and strictly positive definite: for any nonzero

v \in R^{m}

,

v^{T} H v ≔ v^{T} G v + λ {‖v‖}_{2}^{2} \geq λ {‖v‖}_{2}^{2} > 0

Hence,

H

is invertible. Because

H

is invertible, it has the unique solution

W^{*} = H^{- 1} {\frac{1}{n} Φ^{T} y = (\frac{1}{n} Φ^{T} Φ + λ I)}^{- 1} \frac{1}{n} Φ^{T} y

The uniqueness of the stationary point implies uniqueness of the global minimizer provided the function is convex; we show convexity next.

The Hessian of

F

is constant and equals

H

(because the objective is quadratic in

W

). Since

H

is positive definite,

F

is strongly convex (with a strongconvexity constant

{μ ≔ λ}_{m i n} (H) > λ

). A strongly convex function has a single global minimizer, and the unique stationary point is that minimizer. Therefore, the

W^{*}

above is the unique global minimizer.

From

W^{*} = H^{- 1} (\frac{1}{n} Φ^{T} y)

and

{‖H^{- 1}‖}_{o p} = \frac{1}{λ_{m i n} (H)} \leq \frac{1}{λ},

we obtain

{‖W^{*}‖}_{2} \leq \frac{1}{λ_{m i n} (H)} \cdot \frac{1}{n} {‖Φ^{T} y‖}_{2} \leq \frac{1}{λ} \frac{1}{n} {‖Φ^{T} y‖}_{2}

This shows the explicit dependence of the minimizer magnitude on

λ

. The map

λ \mapsto H^{- 1}

is continuous on

(0, \infty)

(indeed differentiable), so

W^{*} (λ) = H {(λ)}^{- 1} \frac{1}{n} Φ^{T} y

depends smoothly on λ (this follows from standard matrix perturbation theory/inverse mapping continuity.).□

Lemma A2 (Gradient Descent Update as a Linear Dynamical System).

The gradient descent with step size

η > 0

follows

W^{t + 1} = W^{t} - η \nabla F (W) = W^{t} - η H (W^{t} - W^{*})

Hence

W^{t + 1} - W^{t} = (I - η H) (W^{t} - W^{*})

The error evolves linearly with transition matrix

I - η H

.

Proof.

From standard differentiation of the quadratic objective,

\begin{matrix} \nabla F (W) = \frac{1}{n} Φ^{T} (Φ W - y) + λ W \\ = (\frac{1}{n} Φ^{T} Φ + λ I) W - \frac{1}{n} Φ^{T} Φ = H W - \frac{1}{n} Φ^{T} y \end{matrix}

Using Lemma A1, the minimizer

W^{*}

satisfies the first-order condition

\nabla F (W^{*}) = 0

, i.e.,

H W^{*} - \frac{1}{n} Φ^{T} y = 0 ⟹ \frac{1}{n} Φ^{T} y = H W^{*}

The GD step becomes

W^{t + 1} = W^{t} - η (H W^{*} - \frac{1}{n} Φ^{T} y)

Replace

\frac{1}{n} Φ^{T} y

with

H W^{*}

:

W^{t + 1} = W^{t} - η (H W^{t} - H W^{*}) = W^{t} - η H (W^{t} - W^{*})

Subtract

W^{*}

from both sides:

W^{t + 1} - W^{t} = (I - η H) (W^{t} - W^{*})

This completes the algebraic part of the proof; the error evolves linearly under the constant matrix

I - η H

. □

Theorem A1 (Spectral Convergence Condition).

Under A2–A3, gradient descent converges if and only if

0 < η < \frac{2}{λ_{m a x} (H)}

In this regime, for all

t \geq 0

:

‖W^{t} - W^{*}‖ \leq ρ^{t} ‖W^{0} - W^{*}‖

where

ρ = \max_{i} |1 - η λ_{i} (H)| < 1

Proof.

From Lemma A2, we have the exact linear recursion for the error:

{℮^{t + 1} ≔ W}^{t + 1} - W^{*} = (1 - η H) ℮^{t}

Iterating,

℮^{t} = {(1 - η H)}^{t} ℮^{0}

Because

H

is symmetric positive definite, it has the eigen-decomposition

H = Q ⋀ Q^{T}

with orthogonal

Q

and

Λ = d i a g (λ_{1}, \dots, λ_{m}), 0 < μ = λ_{1}, \dots, λ_{m} = L

Conjugating the recursion decouples coordinates

{Z^{t + 1} ≔ Q}^{T} ℮^{t + 1} = (1 - η Λ) Z^{t}

where

{Z^{t} ≔ Q}^{T} ℮^{t}

. Hence, for each coordinate

i

,

{Z_{i}^{t + 1} ≔ Q}^{T} ℮^{t + 1} = (1 - η λ_{i}) Z_{i}^{t}

The matrix

1 - η H

is diagonalizable with eigenvalues

{\{1 - η λ_{i}\}}_{i = 1}^{m}

. Iterates

℮^{t} = {(1 - η H)}^{t}

converge to the zero vector for every initial

℮^{0}

if and only if the spectral radius satisfies

ρ (1 - η λ_{i}) = \max_{i} |1 - η H| < 1

Because each

λ_{i} > 0

, the inequalities

|1 - η λ_{i}| < 1

are equivalent (for each

i

) to

- 1 < 1 - η λ_{i} < 1 ⟺ 0 < η λ_{i} < 2

Since this must hold for every eigenvalue, it reduces to the single requirement

{0 < η λ}_{m a x} (H) < 2 ⟺ 0 < η < \frac{2}{λ_{m a x} (H)}

Thus, the interval

{0 < η < 2 / λ}_{m a x} (H)

is necessary and sufficient for

ρ (I - η H) = < 1

.

Under this, we have

ρ = \max_{i} |1 - η λ_{i}| < 1

. Using the orthogonality of

Q

,

\begin{matrix} {‖℮^{t}‖}_{2} = {‖{(I - η H)}^{t} ℮^{t}‖}_{2} = {‖Q {(I - η Λ)}^{t} {Q^{T} ℮}^{0}‖}_{2} \\ = {‖{(I - η Λ)}^{t} Z^{0}‖}_{2} \leq ρ^{t} {‖Z^{0}‖}_{2} = ρ^{t} {‖℮^{0}‖}_{2} \end{matrix}

proving the stated linear (geometric) contraction.

Remarks about the boundary and outside the interval: If

η = 0,

the iterates are constant (no progress). If

{η = 2 / λ}_{m a x} (H)

, then for the index iii achieving

{λ_{i} = λ}_{m a x}

, we have

1 - η λ_{i} = - 1

, so the corresponding coordinate oscillates in magnitude and convergence generally fails (unless that coordinate is zero initially). If

{η = 2 / λ}_{m a x} (H)

(or

η \leq 0

), then some eigen-coordinate satisfies

|1 - η λ_{i}| \geq 1

, hence there exists an initial

℮^{0}

for which

‖℮^{t}‖

does not go to zero (it either diverges or does not converge).

This completes the proof. □

Theorem A2 (Linear/Exponential Convergence Rate).

Under A1–A3 with

0 < η < 2 / L

, the iterates converge linearly:

{‖W^{t} - W^{*}‖}_{2} \leq - η \nabla F (W) = {(1 - η μ)}^{t} ‖W^{0} - W^{*}‖

Equivalently,

F (W^{t}) - F (W^{*}) \leq {(1 - η μ)}^{2 t} (F (W^{0}) - F (W^{*}))

Proof.

Since

F

is quadratic and

H

is positive definite (

λ > 0

), the minimizer is unique and satisfies

\nabla F (W^{*}) = 0

. From Lemma A2, we have the error recursion:

{℮^{t + 1} ≔ W}^{t + 1} - W^{*} = (1 - η H) ℮^{t}

Because

H

is symmetric positive definite, it has the eigen-decomposition

H = Q ⋀ Q^{T}

with orthogonal

Q

and

Λ = d i a g (λ_{1}, \dots, λ_{m}), 0 < μ = λ_{1}, \dots, λ_{m} = L

We express the recursion in eigen-coordinates:

{Z^{t} ≔ Q}^{T} ℮^{t}, Z^{t + 1} = (1 - η Λ) Z^{t}

Thus, each coordinate evolves as follows:

Z_{i}^{t} = {(1 - η λ_{i})}^{t} Z_{i}^{0}

Therefore,

{‖℮^{t}‖}_{2} = {‖Z^{t}‖}_{2} \leq {(\max_{i} |1 - η λ_{i}|)}^{t} {‖Z^{0}‖}_{2} {= ρ {(η)}^{t} ‖℮^{0}‖}_{2}

The condition

0 < η < 2 / L

ensures

|1 - η λ_{i}| < 1 \forall i

Thus,

ρ (η) < 1

, proving linear convergence. If

0 < η < 2 / L

, then for all

i

, we have

0 \leq 1 - η λ_{i} \leq 1 - η μ

Thus,

ρ (η) = 1 - η μ

and hence

{‖W^{t} - W^{*}‖}_{2} \leq {(1 - η μ)}^{t} {‖W^{0} - W^{*}‖}_{2}

Since

F

is quadratic,

F (W^{t}) - F (W^{*}) = \frac{1}{2} {(W - W^{*})}^{T} H (W - W^{*})

Using

‖H‖ = L

and the previous contraction

F (W^{t}) - F (W^{*}) \leq \frac{L}{2} {‖W^{0} - W^{*}‖}_{2}^{2} \leq \frac{L}{2} {(1 - η μ)}^{2 t} {‖W^{0} - W^{*}‖}_{2}^{2}

Comparison with the expression for

F (W^{0}) - F (W^{*})

yields

F (W^{t}) - F (W^{*}) \leq {(1 - η μ)}^{2 t} (F (W^{t}) - F (W^{*}))

This completes the proof. □

Theorem A3 (Effect of L₂ Regularization on Conditioning).

Let

G = \frac{1}{n} Φ^{T} Φ

be the (symmetric) Gram matrix of the wavelet features and fix

λ > 0

. Define

H (λ) = G + λ I

Then, the condition number of

H

satisfies

K (H (λ)) = \frac{λ_{m a x} (G) + λ}{λ_{m i n} (G) + λ}

The consequences are as follows:

If $G$ is singular $(λ_{m a x} (G) = 0)$ , then $K (H (λ)) = λ_{m a x} (G) / λ$ . In particular, any $λ > 0$ makes $H$ strictly positive definite.
$K (H (λ))$ is monotone nonincreasing in $λ > 0$ ; moreover, $\lim_{λ \to \infty} K (H (λ)) = 1 .$
As $λ \to 0^{+}$ , $K (H (λ)) \to K (G)$ when $λ_{m i n} (G) > 0$ , and $K (H (λ)) \to \infty$ when $λ_{m i n} (G) = 0$ .

Proof.

Step 1: Eigenvalue shift identity.

Because

G

is symmetric, it has a real eigen-decomposition

G = Q ⋀ Q^{T}

with

⋀_{G} = d i a g (γ_{1}, \dots, γ_{m})

and

γ_{1} \leq \dots \leq γ_{m}

(each

γ_{i} \geq 0

since

G ≽ 0

). Then,

H (λ) = G + λ I = Q (⋀_{G} + λ I) Q^{T}

Hence the eigenvalues of

H (λ)

are

γ_{i} + λ

for

i = 1, \dots, m

. In particular,

λ_{m i n} (H (λ)) = γ_{1} + λ = λ_{m i n} (G) + λ

λ_{m a x} (H (λ)) = γ_{m} + λ = λ_{m a x} (G) + λ

Therefore, the condition number is

K (H (λ)) = \frac{λ_{m a x} (H (λ))}{λ_{m i n} (H (λ))} = \frac{λ_{m a x} (G) + λ}{λ_{m i n} (G) + λ}

which proves the identity in the theorem.

Step 2: Singular

G

case.

If

λ_{m i n} (G) = 0

(i.e.,

G

is singular), then the identity above yields

K (H (λ)) = \frac{λ_{m a x} (G) + λ}{0 + λ} = \frac{λ_{m a x} (G) + λ}{λ}

Since

λ > 0

, the denominator is positive; hence

H (λ)

is strictly positive definite for every

λ > 0

. This shows any positive

λ

regularizes away singular directions.

Step 3: Monotonicity in λ.

Set

{a = λ}_{m a x} (G)

and

{b = λ}_{m i n} (G)

(s o 0 \leq b \leq a)

. Define

K (λ) = \frac{a + λ}{b + λ}, λ > 0

Differentiate with respect to

λ

:

K^{'} (λ) = \frac{(b + λ) \cdot 1 - (a + λ) \cdot 1}{{(b + λ)}^{2}} = \frac{b - a}{{(b + λ)}^{2}}

Because

a \geq b

, the numerator

b - a \leq 0

; hence

K^{'} (λ) > 0

for all

λ > 0

. If

a > b

then

K^{'} (λ) < 0

, so κ\kappaκ is strictly decreasing; if

a = b

(i.e.,

G

is a scalar multiple of the identity) then

K^{'} (λ) = 0

, and

K

is constant (equal to 1). Thus,

K (λ)

is monotone nonincreasing in

λ

.

Step 4: Limit as λ → ∞:

Compute

\lim_{λ \to \infty} K (λ) = \lim_{λ \to \infty} \frac{a + λ}{b + λ} = \lim_{λ \to \infty} \frac{1 + a / λ}{1 + b / λ} = 1

Thus, conditioning improves (approaches optimal value 1) as

λ

becomes very large.

Step 5: Limit as

λ \to 0^{+}

If

b > 0

(i.e.,

G

is full-rank), then

\lim_{λ \to 0^{+}} K (λ) = \frac{a}{b} = K (G)

If

b = 0

(singular

G

), then

K (λ) = \frac{a + λ}{λ} = 1 + \frac{a}{λ} \to + \infty

Thus, the condition number tends to the unregularized condition number when

G

is invertible and diverges to

+ \infty

when

G

is singular.

Step 6: Practical interpretation (optional short remark).

Because the convergence rate of gradient descent (optimal fixed-step) depends monotonically on

K (H)

—e.g., the optimal contraction factor

ρ^{*} = (K - 1) / (K + 1)

improves as

K

decreases—the monotone decrease in

K (λ)

with

λ

shows that increasing

λ

(weight decay) strictly improves the spectral conditioning of the optimization problem and thus can accelerate linear convergence phases. The trade-off is that large

λ

also biases the estimator (adds shrinkage).

This completes the proof of Theorem A3 and the derived consequences. □

Theorem A4 (Polyak–Łojasiewicz Inequality for Wavelet Neural Networks with L₂ Regularization).

Under A1–A3, the loss satisfies the Polyak–Łojasiewicz (PL) inequality:

\frac{1}{2} {‖\nabla F (W)‖}^{2} \geq μ (F (W) - F (W^{*}))

Thus, WNNs with L₂ regularization behave like strictly convex quadratics in optimization, even though the underlying model is nonlinear in input space.

Proof.

Proof of this permanent status is achieved in two stages

Part A—Exact PL for the ridge (fixed-feature) case

Set

v = W - W^{*}

. Then,

\nabla F (W) = H v, F (W) - F (W^{*}) = \frac{1}{2} v^{T} H

Compute the squared gradient norm, then

{‖\nabla F (W)‖}_{2}^{2} = v^{T} H^{2} v

Because

H

is symmetric positive definite, we may compare

H^{2}

and

H

. Write

H^{2} - μ H = H^{1 / 2} (H - μ I) H^{1 / 2} ≽ 0

Since

H - μ I ≽ 0

by definition of

μ = λ_{m i n} (H)

. Therefore,

H^{2} ≽ μ H

, and hence

v^{T} H^{2} v \geq {μ v}^{T} H v = 2 μ (F (W) - F (W^{*}))

Dividing both sides by two yields the PL inequality:

\frac{1}{2} {‖\nabla F (W)‖}^{2} \geq μ (F (W) - F (W^{*}))

Part B—PL from a uniform Hessian lower bound (sufficient condition for trainable WNN)

Fix any

θ \in D

. Using the fundamental theorem of calculus for gradients (mean value form), we can write

\nabla F (θ) - \nabla F (θ^{*}) = (\int_{0}^{1} \nabla F (θ + t (θ - θ^{*})) d t) (θ - θ^{*})

But

\nabla F (θ^{*}) = 0

(a first-order optimality). We define the averaged Hessian:

\bar{H} = \int_{0}^{1} \nabla^{2} F (θ^{*} + t (θ - θ^{*})) d t

Using the pointwise lower bound

\nabla^{2} F (\cdot) ≽ μ I

, we have

\bar{H} ≽ μ I

. Hence

{‖\nabla F (W)‖}_{2}^{2} = {(θ - θ^{*})}^{T} {\bar{H}}^{2} (θ - θ^{*}) \geq μ {(θ - θ^{*})}^{T} \bar{H} (θ - θ^{*})

But by the same integral representation,

F (θ) - F (θ^{*}) = \frac{1}{2} {(θ - θ^{*})}^{T} \bar{H} (θ - θ^{*})

Combining these two, we have

{\frac{1}{2} ‖\nabla F (W)‖}_{2}^{2} \geq μ \cdot \frac{1}{2} {(θ - θ^{*})}^{T} \bar{H} (θ - θ^{*}) = μ (F (θ) - F (θ^{*}))

This proves the PL inequality on

D

. □

References

Wu, J.; Li, J.; Yang, J.; Mei, S. Wavelet-integrated deep neural networks: A systematic review of applications and synergistic architectures. Neurocomputing 2025, 657, 131648. [Google Scholar] [CrossRef]
Kio, A.E.; Xu, J.; Gautam, N.; Ding, Y. Wavelet decomposition and neural networks: A potent combination for short term wind speed and power forecasting. Front. Energy Res. 2024, 12, 1277464. [Google Scholar] [CrossRef]
Cui, Z.; Ke, R.; Pu, Z.; Ma, X.; Wang, Y. Learning traffic as a graph: A gated graph wavelet recurrent neural network for network-scale traffic prediction. Transp. Res. Part C Emerg. Technol. 2020, 115, 102620. [Google Scholar] [CrossRef]
Lucas, F.; Costa, P.; Batalha, R.; Leite, D.; Škrjanc, I. Fault detection in smart grids with time-varying distributed generation using wavelet energy and evolving neural networks. Evol. Syst. 2020, 11, 165–180. [Google Scholar] [CrossRef]
Baharlouei, Z.; Rabbani, H.; Plonka, G. Wavelet scattering transform application in classification of retinal abnormalities using OCT images. Sci. Rep. 2023, 13, 19013. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Wang, Z.; Li, J.; Wu, J. Multilevel wavelet decomposition network for interpretable time series analysis. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 2437–2446. [Google Scholar]
Grieshammer, M.; Pflug, L.; Stingl, M.; Uihlein, A. The continuous stochastic gradient method: Part I–convergence theory. Comput. Optim. Appl. 2024, 87, 935–976. [Google Scholar] [CrossRef]
Xia, L.; Massei, S.; Hochstenbach, M.E. On the convergence of the gradient descent method with stochastic fixed-point rounding errors under the Polyak–Łojasiewicz inequality. Comput. Optim. Appl. 2025, 90, 753–799. [Google Scholar] [CrossRef]
Galanti, T.; Siegel, Z.S.; Gupte, A.; Poggio, T. SGD and Weight Decay Provably Induce a Low-Rank Bias in Neural Networks. 2023. Available online: https://hdl.handle.net/1721.1/148231 (accessed on 28 November 2025).
Jacot, A.; Gabriel, F.; Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Adv. Neural Inf. Process. Syst. 2018, 31, 1–10. [Google Scholar]
Somvanshi, S.; Javed, S.A.; Islam, M.M.; Pandit, D.; Das, S. A survey on kolmogorov-arnold network. ACM Comput. Surv. 2025, 58, 1–35. [Google Scholar] [CrossRef]
Medvedev, M.; Vardi, G.; Srebro, N. Overfitting behaviour of gaussian kernel ridgeless regression: Varying bandwidth or dimensionality. Adv. Neural Inf. Process. Syst. 2024, 37, 52624–52669. [Google Scholar]
Shang, Z.; Zhao, Z.; Yan, R. Denoising fault-aware wavelet network: A signal processing informed neural network for fault diagnosis. Chin. J. Mech. Eng. 2023, 36, 9. [Google Scholar] [CrossRef]
Saragadam, V.; LeJeune, D.; Tan, J.; Balakrishnan, G.; Veeraraghavan, A.; Baraniuk, R.G. Wire: Wavelet implicit neural representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18507–18516. [Google Scholar]
Sadoon, G.A.A.S.; Almohammed, E.; Al-Behadili, H.A. Wavelet neural networks in signal parameter estimation: A comprehensive review for next-generation wireless systems. In Proceedings of the AIP Conference Proceedings, Pune, India, 18–19 May 2024; AIP Publishing LLC.: College Park, MD, USA, 2025; Volume 3255, p. 020014. [Google Scholar]
Akujuobi, C.M. Wavelets and Wavelet Transform Systems and Their Applications; Springer International Publishing: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
Wang, P.; Wen, Z. A spatiotemporal graph wavelet neural network (ST-GWNN) for association mining in timely social media data. Sci. Rep. 2024, 14, 31155. [Google Scholar] [CrossRef]
Uddin, Z.; Ganga, S.; Asthana, R.; Ibrahim, W. Wavelets based physics informed neural networks to solve non-linear differential equations. Sci. Rep. 2023, 13, 2882. [Google Scholar] [CrossRef] [PubMed]
Jung, H.; Lodhi, B.; Kang, J. An automatic nuclei segmentation method based on deep convolutional neural networks for histopathology images. BMC Biomed. Eng. 2019, 1, 24. [Google Scholar] [CrossRef]
Xiao, Q.; Lu, S.; Chen, T. An alternating optimization method for bilevel problems under the Polyak-Łojasiewicz condition. Adv. Neural Inf. Process. Syst. 2023, 36, 63847–63873. [Google Scholar]
Yazdani, K.; Hale, M. Asynchronous parallel nonconvex optimization under the polyak-łojasiewicz condition. IEEE Control Syst. Lett. 2021, 6, 524–529. [Google Scholar] [CrossRef]
Chen, K.; Yi, C.; Yang, H. Towards Better Generalization: Weight Decay Induces Low-rank Bias for Neural Networks. arXiv 2024, arXiv:2410.02176. [Google Scholar] [CrossRef]
Kobayashi, S.; Akram, Y.; Von Oswald, J. Weight decay induces low-rank attention layers. Adv. Neural Inf. Process. Syst. 2024, 37, 4481–4510. [Google Scholar]
Seleznova, M.; Kutyniok, G. Analyzing finite neural networks: Can we trust neural tangent kernel theory? In Mathematical and Scientific Machine Learning; PMLR: Birmingham, UK, 2022; pp. 868–895. [Google Scholar]
Tan, Y.; Liu, H. How does a kernel based on gradients of infinite-width neural networks come to be widely used: A review of the neural tangent kernel. Int. J. Multimed. Inf. Retr. 2024, 13, 8. [Google Scholar] [CrossRef]
Tang, A.; Wang, J.B.; Pan, Y.; Wu, T.; Chen, Y.; Yu, H.; Elkashlan, M. Revisiting XL-MIMO channel estimation: When dual-wideband effects meet near field. IEEE Trans. Wirel. Commun. 2025. [Google Scholar] [CrossRef]
Cui, Z.X.; Zhu, Q.; Cheng, J.; Zhang, B.; Liang, D. Deep unfolding as iterative regularization for imaging inverse problems. Inverse Probl. 2024, 40, 025011. [Google Scholar] [CrossRef]

Figure 1. Wavelet atoms at multiple dilations (Mexican hat).

Figure 2. WNN-NTK eigenvalue decay (illustrative).

Figure 3. Gradient-norm decay (linear phase evident on semi-log).

Figure 4. Learning rate schedules (constant, step, cosine).

Figure 5. Linear convergence in the ridge: theoretical

ρ^{t}

vs. observed residual decay (semilog).

Figure 6. Effect of λ on Landscape Smoothness.

Figure 7. 2D Loss Landscape with three different values of λ.

Figure 8. Synthetic regression: ground truth vs. WNN prediction.

Figure 9. PSNR vs. σ: WNN vs. baseline CNN.

Figure 10. Sensitivity heatmap: validation MSE vs. (

η, λ

).

Figure 11. Learning curves (train/val).

Figure 12. Scatter: predictions vs. ground truth.

Figure 13. Validation error vs. λ (bias–variance trade-off).

Figure 14. Initialization sensitivity: validation MSE across runs.

Figure 15. Robustness under shift (performance vs. severity).

Table 1. Ablations (illustrative).

η	λ	Val MSE	PSNR (dB)
3 × 10⁻³	3 × 10⁻⁴	0.032	31.2
1 × 10⁻³	1 × 10⁻⁴	0.036	30.8
5 × 10⁻³	1 × 10⁻³	0.038	30.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

The Module Gradient Descent Algorithm via L2 Regularization for Wavelet Neural Networks

Abstract

1. Introduction

2. Related Work

3. Preliminaries and Problem Setup

3.1. Wavelet Neural Network (WNN) Model

3.2. Assumptions (Wavelets, Data, and Loss)

3.3. Training Algorithms (GD/SGD with Weight Decay)

3.4. Problem Decompositions: Three Regimes

3.5. PL Inequality and Its Role

3.6. NTK for WNNs (Overview)

3.7. Step-Size and Regularization Prescriptions (Preview)

4. Mmethodology

4.1. Objective and Gradient Updates

4.2. Fixed-Feature (Ridge) Training of the Linear Head

4.3. Fully Trainable WNN: Block GD, Schedules, and Stability

4.4. Choosing η and λ: Prescriptions and Diagnostics

5. Theoretical Results

5.1. Linear Convergence for the Fixed-Feature (Ridge) Regime

5.2. Fully Trainable WNN Under a PL Inequality

6. Experiments and Evaluation

6.1. Datasets and Tasks

6.2. Metrics

6.3. Synthetic Regression (Approximation)

6.4. Denoising Robustness

6.5. Sensitivity to Learning Rate and Weight Decay

6.6. Learning Dynamics

6.7. Prediction Fidelity

6.8. Reproducibility Checklist

7. Discussion and Limitations

7.1. Practical Implications

7.2. Sensitivity and Stability

7.3. Robustness Under Distribution Shift

7.4. Limitations

7.5. Future Work

8. Conclusions and Future Directions

Future Directions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Article Metrics

Citations

Article Access Statistics

The Module Gradient Descent Algorithm via L₂ Regularization for Wavelet Neural Networks