You are currently viewing a new version of our website. To view the old version click .
Axioms
  • Article
  • Open Access

4 December 2025

The Module Gradient Descent Algorithm via L2 Regularization for Wavelet Neural Networks

,
,
,
,
,
,
,
and
1
Department of Mathematics, College of Sciences, Qassim University, Buraydah 51452, Saudi Arabia
2
General Department of College of Technical Engineering, Bright Star University, Al-Brega P.O. Box 858, Libya
3
Department of Computer Engineering, College of Computer, Qassim University, Buraydah 51452, Saudi Arabia
4
Department of Physics, College of Science, Qassim University, Buraydah 51452, Saudi Arabia
Axioms2025, 14(12), 899;https://doi.org/10.3390/axioms14120899 
(registering DOI)

Abstract

Although wavelet neural networks (WNNs) combine the expressive capability of neural models with multiscale localization, there are currently few theoretical guarantees for their training. We investigate the weight decay ( L 2 regularization) optimization dynamics of gradient descent (GD) for WNNs. Using explicit rates controlled by the spectrum of the regularized Gram matrix, we first demonstrate global linear convergence to the unique ridge solution for the feature regime when wavelet atoms are fixed and only the linear head is trained. Second, for fully trainable WNNs, we demonstrate linear rates in regions satisfying a Polyak–Łojasiewicz (PL) inequality and establish convergence of GD to stationary locations under standard smoothness and boundedness of wavelet parameters; weight decay enlarges these regions by suppressing flat directions. Third, we characterize the implicit bias in the over-parameterized neural tangent kernel (NTK) regime: GD converges to the minimum reproducing kernel Hilbert space (RKHS) norm interpolant associated with the WNN kernel with L 2 . In addition to an assessment process on synthetic regression, denoising, and ablations across λ and stepsize, we supplement the theory with useful recommendations on initialization, stepsize schedules, and regularization scales. Together, our findings give a principled prescription for dependable training that has broad applicability to signal processing applications and shed light on when and why L 2 -regularized GD is stable and quick for WNNs.

1. Introduction

Wavelet neural networks (WNNs) are an appealing family of models because they combine multiscale, well-localized representations with the universal approximation capability of neural structures. Recent work ranges from graph-based wavelet neural networks to “decompose-then-learn” pipelines to deep models enhanced with wavelet transforms, demonstrating that wavelet structure can improve stability, efficiency, and interpretability across applications in vision, signal processing, and energy systems [1,2,3,4]. The optimization theory governing the training dynamics of WNNs is significantly less studied than that of other architectures, especially when gradient descent (GD) with L2 regularization (weight decay) is employed. This gap between theoretical knowledge and real progress is the main topic of this research. Wavelet advantages in practice manifest along two axes: (a) multiscale localization, which captures local phenomena at various frequencies without losing broader context, and (b) computational efficiency when employing learnable or fixed wavelet filters, which reduces sensitivity to small perturbations in the data and speeds up training.
Baharlouei et al. categorize retinal optical coherence tomography anomalies using wavelet scattering, which has a lower processing cost and more robustness than more complex deep alternatives. Graph wavelet neural networks efficiently capture spatiotemporal dependencies, while other recent studies integrate wavelet decompositions into deep models for complex timeseries, such as wind power [5,6]. These empirical data pose a fundamental question: when and why does GD with L2 converge steadily and rapidly for WNNs?
Two regimes make the interaction of L2 with GD particularly noticeable: (1) the fixed-feature regime, in which only a linear head is trained and the wavelet dictionary is frozen; this results in ridge regression as the objective, which is substantially convex and hence admits global linear convergence rates. (2) The fully trainable regime, in which wavelet parameters (translations/dilations) and weights are tuned; in this case, generic tools to demonstrate linear rates within appropriate landscape regions are provided by conditions such as the Polyak–Łojasiewicz (PL) inequality [7,8]. L2 regularization is one of the most used tools from an optimization perspective. L2 has an implicit bias in deep networks in addition to its traditional use in conditioning and in imparting effective curvature (for example, for a linear head on fixed features). Weight decay encourages low-rank tendencies and chooses smaller-norm solutions in parameter or function space, as demonstrated by recent works. These phenomena are closely related to convergence speed and stability [9].
Another theoretical viewpoint is provided by the neural tangent kernel (NTK). Between 2022 and 2024, NTK theory became a solid foundation for understanding over-parameterized training dynamics. In this regime, a nonlinear network behaves linearly in function space around initialization with respect to an effective kernel, and GD with L2 becomes equivalent to GD in the corresponding RKHS. Surveys and practical evaluations suggest that this perspective accounts for both the minimum-norm bias and convergence (via the kernel spectrum): L2-regularized GD selects the minimum-RKHS-norm solution among all interpolants during interpolation [10]. Making this image unique to WNNs is our contribution, and we therefore ask: What is the appearance of the WNN-specific NTK with realistic initiation and limited dilation/shift ranges? What effects do wavelet selections have on the spectral constant governing the rate?
In-depth analyses of wavelet–deep integration reveal that training dynamics guarantees, especially with L2, remain absent despite steady progress, and most contributions concentrate on architectural design or empirical advantages. Even with forward-looking hybrids like Wav-KAN (2024) [11], sharp statements concerning rates, step-size/regularization requirements, and stability bounds are still uncommon when compared to linear models or neural networks under restrictive assumptions. This disparity has practical implications: without principled direction, practitioners are forced to rely on costly trial-and-error to calculate the learning rate ( η ) and regularization ( λ ), and it becomes more challenging to ascertain why training is effective or ineffective.
This paper’s contributions are as follows. We provide a unified analysis of convergence for gradient descent on WNNs in three regimes with L2 regularization:
(a)
A linear-head regime over a fixed wavelet dictionary. With explicit rates controlled by the eigenvalues of (   Φ T Φ / n + λ I ) , we demonstrate global linear convergence to the unique ridge minimizer. This results in useful guidelines for considering ( λ ) as a conditioning lever instead of just an anti-overfitting knob.
(b)
WNNs that are fully trainable (nonconvex). GD converges to stationary points and enjoys linear rates within regions meeting PL under the conditions of natural smoothness and boundedness for wavelets (within a restricted dilation/shift domain). We give implementable step-size limitations and demonstrate how L2 dampens flat directions to widen PL basins.
(c)
A regime that is over-parameterized (NTK). We extract rate constants related to the kernel spectrum induced by the wavelet dictionary and demonstrate that L2 directs GD toward the minimum-RKHS-norm interpolant associated with the WNN-specific NTK.
To detail our practice and methodology, the following is a training regimen derived from our theory. To ensure stability without sacrificing fit, select dilations log-uniformly over a bounded range by sampling translations from the empirical input distribution; estimate a Lipschitz proxy to set ( η ); sweep ( λ ) over practical ranges; and monitor the near-constant contraction in gradient norms on a semi-log scale to track the onset of a linearconvergence phase. We also propose an evaluation protocol (synthetic approximation and denoising benchmarks) to support our theoretical claims and provide insight into regime transitions as ( λ ) and ( η ) change, in compliance with existing guidelines on convergence diagnostics and implicit regularization [12,13,14]. In doing so, we bridge a persistent gap between the empirical richness of wavelet-based models and the theoretical understanding of their training dynamics under L2-regularized gradient descent.
We organize the remaining portions of this brief as follows In Section 2, we give a literature review. In Section 3, we provide an explanation of the pi-sigma network. In Section 4, we briefly describe a neural network model with the batch gradient method and smoothing regularization. Section 5 presents the main convergence theorem. Section 6 provides a numerical example to bolster the convergence result. We summarize the findings and conclusions in Section 7. We have relegated the proof of the theorem to the Appendix A.

3. Preliminaries and Problem Setup

3.1. Wavelet Neural Network (WNN) Model

Let ψ :   R d R be a mother wavelet. For parameters θ j = ( a j , b j ) with dilation a j > 0 and translation b j R d , define the atom φ j ( x i ; θ j ) = ψ ( ( x b j ) / a j ) . A single-hidden-layer WNN with m atoms outputs z ( x ; W , Θ ) = j = 1 m w j   φ j ( x i ; θ j )   . Given data { ( x i , y i ) } i = 1 n ,   let   Φ R n × m with Φ i j = φ j ( x i ; θ j ) .
ϕ j x ; θ j = ψ x b j a j ,       θ j = a j , b j
z x ; W , Θ = j = 1 m w j θ j x ; θ j

3.2. Assumptions (Wavelets, Data, and Loss)

Wavelet atoms at multiple dilations refer to a mathematical concept that extends the standard wavelet transform by using a more complex set of scaling and transforming functions to analyze signals. For example, for the Mexican hat wavelet or Marr wavelet (see Figure 1), wavelet atoms are obtained by scaling the Mexicanhat mother wavelet:
ψ x = 1 x 2 e x 1 / 2
The curves show ψ a x = a 1 2 ψ x a ,   x 5 , + 5 for dilations a 0.5 ,   1 ,   2 ,   4 . The number of points (N = 2000) allows for the drawing of smooth, high-precision curves. The amplitude normalization a 1 / 2 preserves the L2 energy across scales.
Figure 1. Wavelet atoms at multiple dilations (Mexican hat).
Here, we adopt standard assumptions:
  • (A1) ψ is twice continuously differentiable with bounded value, gradient, and Hessian;
  • (A2) dilations/translations are bounded ( 0 < b m i n b j b m a x   and   a j A );
  • (A3) inputs lie in a compact set;
  • (A4) the loss is smooth (and strongly convex in its first argument for squared loss);
  • (A5) feature Jacobians w.r.t. θ are uniformly bounded. Under (A1–A5), L is Lipschitz on the feasible set; in particular, the head-only objective in W is strongly convex due to λ .

3.3. Training Algorithms (GD/SGD with Weight Decay)

We minimize the L2-regularized empirical risk:
L W , Θ = 1 n i = 1 n z x i ; W , Θ y i 2 + λ 2 W 2 2 Θ 2 2
With squared loss, the gradient w.r.t. W is
W L = 1 n θ W y + λ W

3.4. Problem Decompositions: Three Regimes

  • (R1) Fixed-feature (linear head/ridge). Freezing Θ reduces the problem to ridge regression, with Hessian H = ( Φ T Φ / n   +   λ I )     λ I , and GD enjoys global linear convergence for 0 < η < 2 / λ m a x ( H ) .
min W 1 2 n θ W y 2 2 + λ 2 W 2 2
  • (R2) Fully trainable WNN (nonconvex). Both W and Θ are updated; we obtain convergence to stationary points and linear phases under a Polyak–Łojasiewicz (PL) inequality on L .
1 2 L θ 2 2 μ P L L θ L
  • (R3) Over-parameterized (NTK/linearization). For large m and small η , dynamics linearize around initialization; function updates follow kernel GD with a WNN-specific NTK K:
z t + ! = z t η k ( z t y )

3.5. PL Inequality and Its Role

If L satisfies a PL inequality with constant μ P L on a domain D and L is L -Lipschitz, then for 0 < η 1 / L , gradient descent yields geometric decay L ( θ t ) L   ( 1 η μ P L ) t ( L ( θ 0 ) L ) . In WNNs, L2 enlarges PL regions by damping flat directions along scale/shift parameters.

3.6. NTK for WNNs (Overview)

The neural tangent kernel (NTK) is a kernel function that describes how a neural network’s output changes during training with gradient descent. It allows for the theoretical analysis of neural network training by reframing the problem as a kernel method, especially in the limit of infinite width where the kernel becomes constant during training. The NTK is calculated as the dot product of the gradients of the network’s output with respect to its parameters for two different inputs. Let K denote the NTK at initialization. With small learning rates and large width, training follows kernel GD in the RKHS induced by K; convergence speed is governed by the spectrum { λ i ( K ) } . With L2, GD converges to the minimum-RKHS-norm solution among all interpolants.
Let the WNN output be parameterized by θ (all trainable parameters). For an input x , the network output is f ( x ; θ ) . At initialization θ 0 , the NTK matrix K R N × N on inputs { x i } i = 1 N is
K i j = 0 f ( x i ; θ 0 ) T 0 f ( x j ; θ 0 )
Equivalently, if J R N × P is the Jacobian whose i-th row is J i = 0 f ( x i ; θ 0 ) T , then K = J J T .
Using Figure 2, we can visualize how the NTK evolves during gradient descent. The input grid has N = 150 points uniformly spaced on [0, 1]. The WNN architecture has a single hidden layer with M = 30 wavelet atoms. The parameter vector θ has length P = 3. Initialization (used to compute NTK at init) features amplitudes c j ~ N 0 , 0.5 2 , scales a j ~ N 1.0 , 0.1 2 , and translations b j ~ U 0,1 . The full N × N NTK matrix is used to produce Figure 2. This figure plots the eigenvalues of the NTK for the WNN, typically on a log scale, showing how the spectrum decays across modes.
Figure 2. WNN-NTK eigenvalue decay (illustrative).
Figure 2 reveals different critical advancements of studying NTK eigenvalue behavior in wavelet neural networks, especially under L2 regularization. Figure 2 is essential because it visually and mathematically explains why WNNs train faster, converge more consistently, and benefit more from L2 regularization than conventional networks. It links optimization theory, NTK spectral properties, convergence behavior and multiscale feature learning.
K t o y = 1.20 0.83 0.45 0.12 0.05 0.02 0.83 1.30 0.76 0.22 0.07 0.03 0.45 0.76 1.10 0.48 0.18 0.06 0.12 0.22 0.48 0.98 0.54 0.02 0.05 0.07 0.18 0.54 0.95 0.48 0.02 0.03 0.06 0.20 0.48 0.88

3.7. Step-Size and Regularization Prescriptions (Preview)

  • Head-only (R1): choose η ( 0 ,   2 / L ) with L = λ m a x ( H ) ; increasing λ raises μ = λ m a x ( H ) and improves the condition number L / μ .
  • Full WNN (R2): use conservative η 1 / L e m p (empirical Lipschitz proxy); increase λ if gradient-norm contraction stalls.
  • NTK (R3): stable η < 2 / λ m a x ( K ) ; L2 controls norm growth and selects the minimum-RKHS-norm interpolant.
Gradient-norm decay (linear phase evident on semi-log Figure 3). This corresponds to a typical semi-log plot showing that decays approximately exponentially in t .
L θ t
Figure 3. Gradient-norm decay (linear phase evident on semi-log).
Target function:
f x = sin 2 π x
where there are N = 100 samples uniformly spaced on [ 0 ,   1 ] . A random Gaussian initialization W 0 [ i ] ~ N 0 , 0.05 2 has a learning rate η = 0.1 , regularization λ = 0 , 1 × 10 4 , 1 × 10 3 , and 200 epochs.
The curves in Figure 3 show different significant improvements in L2 regularization. Linear decline indicates exponential convergence on a semi-log plot. Gradient magnitudes decrease more quickly, the slope of log ( L ) becomes steeper, and the optimization phase enters the linear convergence area earlier when L2 regularization is used. In the unregularized situation ( λ =   0 ), plateaus are frequent, gradient norms frequently oscillate, and weak curvature causes noise amplification. When ( λ > 0 ), curves exhibit more steady descent, no abrupt oscillations, and smooth monotonic decay. This illustrates how L2 regularization reduces saddle-point sensitivity, eliminates flat areas, and enforces strong convexity in the local basin to stabilize the loss landscape.

4. Mmethodology

4.1. Objective and Gradient Updates

We consider the L2-regularized objective and its gradients w.r.t. W and Θ .
L W , Θ = 1 n i = 1 n 𝓁 z x i ; W , Θ , y i + λ 2 W 2 2 + Θ 2 2
The first-order updates (vanilla GD) read
W t + 1 = W t η W L W t , Θ t
Θ t + 1 = Θ t η Θ L W t , Θ t

4.2. Fixed-Feature (Ridge) Training of the Linear Head

Freezing Θ reduces training to ridge regression on Φ; the regularized Hessian H is wellconditioned for λ > 0 .
min W 1 2 n Θ W y + λ 2 W 2 2
W t + 1 W t 2 ρ t W 0 W 2 ,       ρ = max i 1 η λ i H
K H = λ m a x H λ m i n H = σ m a x 2 / n + λ σ m i n 2 / n + λ

4.3. Fully Trainable WNN: Block GD, Schedules, and Stability

Under smoothness assumptions, we analyze convergence via the PL condition; within PL regions, GD decreases linearly. We adopt practical schedules for η (constant, step, cosine) and a Polyak-type step when a reliable target is available. We use the target function the same as in Section 3.7, with 100 points. The training parameters are as follows: constant η = 0.01 , decay at steps 50 and 100, and cosinegiven by
λ t = λ 0 2 1 + c o s π t T
Study stability and convergence under different LR schedules are plotted in Figure 4.
Figure 4. Learning rate schedules (constant, step, cosine).

4.4. Choosing η and λ: Prescriptions and Diagnostics

  • R1: choose η ( 0,2 / λ _ m a x ( H ) ) and sweep λ logarithmically.
  • R2: use η 1 / L _ e m p and increase λ if gradient-norm contraction stalls.
  • R3: ensure 0 < η < 2 / λ m a x ( K ) ; L2 controls norm growth and selects the minimum-RKHS-norm interpolant.

5. Theoretical Results

5.1. Linear Convergence for the Fixed-Feature (Ridge) Regime

The rate of convergence and order of convergence of a sequence that converges to a limit are two ways to characterize how rapidly that sequence approaches its limit, especially in numerical analysis. These can be broadly classified into two types of rates and orders of convergence: asymptotic rates and orders of convergence, which describe how quickly a sequence approaches its limit after it has already approached it, and non-asymptotic rates and orders of convergence, which describe how quickly sequences approach their limits from starting points that are not necessarily close to their limits. Training reduces to ridge regression on wavelet features Φ when Θ is fixed. Below, let S stand for the regularized Hessian.
H = 1 n Φ T Φ + λ I
For step sizes 0 < η < 2 / λ m a x ( S ) , gradient descent converges linearly to the unique minimizer W . The error contracts at a rate controlled by the spectrum of H in Figure 5. The function used is a linear model (ridge regression f 3 x ):
y = X W
where
W = 2 0.5 0.5
Figure 5. Linear convergence in the ridge: theoretical ρ t vs. observed residual decay (semilog).
The error contracts at a rate controlled by the spectrum of H :
W t + 1 W t 2 ρ t W 0 W 2 ,       ρ = max i 1 η λ i H
Figure 5 evaluates linear convergence using a simple linear model ( f ( x ) = 2 x + 1 ) with η = 0.01 ,   λ = 0.1 , and ρ = 0.998 .

5.2. Fully Trainable WNN Under a PL Inequality

We assume smoothness and boundedness of wavelet atoms over a constrained dilation/shift domain. If L satisfies a PL inequality with constant μ P L and L is L -Lipschitz, then GD with 0 < η 1 / L enjoys a linear decrease in the objective:
1 2 L θ 2 2 μ P L L θ L L θ t L 1 η μ P L t L θ 0 L
PL-like regions are enlarged, and stability is enhanced by L2 regularization, which intuitively dampens flat directions linked to scale/shift redundancy. The impact of λ on landscape smoothness is shown in Figure 6, where larger λ increases the size of benign (PL-like) areas. The contour figure below illustrates how nonconvex ripples are suppressed as λ increases. This type of figure visualizes how the loss landscape changes as the ridge (L2) regularization parameter λ increases, using a small WNN and a simple 1-D regression function:
f x = sin 2 π x + 0.3 c o s ( 4 π x )
where
x i ~ U n i f o r m 0,1 ,   i = 1 , , 100
y i = f x i
Figure 6. Effect of λ on Landscape Smoothness.
A small wavelet network is used to make landscape plots feasible: there is one input dimension and six wavelet neurons, and the mother wavelet is Mexican hat. To show how L2 regularization smooths the loss landscape, we vary λ values: λ 0 ,   0.01 ,   0.1 ,   1 .
Figure 6 shows how the shape of the optimization landscape for WNNs is affected by changing the L2 regularization parameter λ. The loss surface is shown in the picture, along with areas where the function acts in accordance with a PL inequality—a requirement that is known to ensure global linear convergence of gradient descent.
A 2-D slice or contour of the loss over a tiny parameter plane (two directions in parameter space) is shown in each panel of Figure 7. The L2 regularization λ is the only difference between panels. Typical observations (compare small → large λ). For λ = 0 (no regularization) the surface is rough, with several tiny ridges/valleys and stronger curvature in areas. Contours are uneven and anisotropic. There are saddle-like landforms and little basins. When λ (e.g., 1 × 10−2 → 0.1), the loss surface gets notably rounder and more convex-looking in this slice. Narrow wells are filled in and the central basin grows. The spacing between contours is more consistent (gradients vary gently). While for big λ (e.g., 1.0), the regularization term dominates; the surface is exceedingly smooth and globally similar to a quadratic (the λ‖w‖2 term). The basin is large and steep toward center; little nonconvex ripples are mostly gone.
Figure 7. 2D Loss Landscape with three different values of λ.

6. Experiments and Evaluation

6.1. Datasets and Tasks

We evaluate wavelet neural networks (WNNs) with L2-regularized gradient descent across three complementary experimental settings, each designed to isolate a different aspect of optimization, generalization, and landscape behavior.
(A) One-Dimensional Synthetic Function Approximation (Optimization Analysis)
To study optimization dynamics under full control of ground truth and curvature, we employ three canonical synthetic regression tasks, each selected to emphasize a different structural property relevant to wavelets and convergence theory.
(1)
Smooth Low-Frequency Signal
f 1 x = sin 2 π x + 0.3 c o s ( 4 π x )
Samples: N = 500 uniformly spaced points
(2)
Localized Bump Function
f 2 x = e x p x 0.3 3 0.02 ,     x [ 1 , 1 ]
Samples: N = 400.
(3)
Ridge/Non-Smooth Function
f 3 x = x ,   x [ 1 , 1 ]
Samples: N = 300.
(B) Image-Like Denoising Task (PSNR vs. Noise Level σ)
To evaluate practical generalization and robustness, we construct image-like signals using 2D tensorized wavelets.
The dataset consists of the following:
Synthetic Image Function
f 4 x 1 , x 2 = sin π x 1 + 0.3 cos π x 1 ,         x 1 , x 2 [ 1 , 1 ] 2
where the grid has 50 × 50 points and the noise model is
y = f 4 + ϵ ,     ϵ ~ N 0 , σ 2
Noise levels are σ = 0.01 ,   0.05 ,   0.1 ,   0.2 , and PSNR (peak signal-to-noise ratio) is the chosen metric.

6.2. Metrics

We report mean squared error (MSE) and peak signal-to-noise ratio (PSNR).
M S N = 1 n i = 1 n y i y ^ i 2
P S N R = 10 log 10 M A X i 2 M S E

6.3. Synthetic Regression (Approximation)

The WNN captures the target function with small bias and controlled variance. Using localized bump function ( f 2 x ), the overlay in Figure 8 compares ground truth vs. prediction. Figure 8 persuasively shows that L2 regularization improves training stability and prevents overfitting; WNNs provide superior approximation of functions with localized and multiscale behavior; and the WNN learns a smooth, accurate, and well-conditioned outcome while preserving the structure that comprises the underlying function.
Figure 8. Synthetic regression: ground truth vs. WNN prediction.

6.4. Denoising Robustness

The capacity of a model to sustain consistent performance in the face of noise, whether from adversarial attacks or natural sources like picture noise, is known as denoising robustness. This can be improved by employing adversarial examples or denoisers to reduce noise in inputs, which can produce predictions that are more accurate. Figure 9 shows the PSNR after simulating additive noise with standard deviation σ by using the smooth low-frequency signal ( f 1 x ) function. Stronger structure preservation under noise is indicated by the WNN’s greater PSNR across σ levels.
Figure 9. PSNR vs. σ: WNN vs. baseline CNN.

6.5. Sensitivity to Learning Rate and Weight Decay

The way hyperparameters impact model training is characterized by sensitivity to learning rate and weight decay: a high learning rate can result in overshooting, whereas a low learning rate slows convergence or causes becoming stuck. Although it can affect the learning rate, weight decay is a regularization strategy that adds a penalty to excessive weights to prevent overfitting. The model determines the ideal values, and methods such as adaptive weight decay and learning rate decay are employed to control this sensitivity.
In Figure 10, we sweep η [ 1 × 10 4 , 1 × 10 1 ] and λ [ 1 × 10 6 , 1 × 10 2 ] logarithmically. A broad valley of lowvalidation MSE appears near ( η 3 × 10 3 ,   λ 3 × 10 4 ) . Figure 10 shows a plot created using the smooth low-frequency signal ( f 1 x ) function.
Figure 10. Sensitivity heatmap: validation MSE vs. ( η ,   λ ).

6.6. Learning Dynamics

The purpose of this subsection is to evaluate the denoising ability of WNN; to do so, we use the following function:
y = f 4 + ϵ ,     ϵ ~ N 0 , σ 2
Learning curves (train/val) show a clear linear phase on a semi-log scale, consistent with PL-based analysis (see Figure 11). No overfitting is observed over the epochs that are shown.
Figure 11. Learning curves (train/val).

6.7. Prediction Fidelity

The degree to which the model’s predictions match the actual data is known as fidelity. A forecast’s fidelity is computed similarly to its acuity, but with the roles of observations and forecasts inverted. To carry out a side-by-side comparison of noisy vs. ground truth vs. reconstructed, we use the following function:
y = f 4 + ϵ ,     ϵ ~ N 0 , σ 2
It is evident from Figure 12 that there is little bias and controlled variance because the distribution of predictions versus ground truth is clustered close to the identity line.
Figure 12. Scatter: predictions vs. ground truth.

6.8. Reproducibility Checklist

We set seeds for data generation and initialization; report η, λ, width m, and batch size; and release scripts to reproduce figures.

7. Discussion and Limitations

7.1. Practical Implications

Our analysis advocates viewing λ as a conditioning lever rather than only a regularization knob: a small λ value risks ill-conditioning and slow convergence; overly large λ values bias solutions excessively and harm accuracy. A U-shaped validation error curve is expected as λ varies (see Figure 13). There are sinusoidal and small high-frequency spikes, and the parameters are constant LR vs. cosine LR. We track gradient norms over iterations:
Figure 13. Validation error vs. λ (bias–variance trade-off).

7.2. Sensitivity and Stability

Stability is the ability of a system to recover to an equilibrium state following a disturbance, whereas sensitivity is the amount that a system’s output changes in reaction to a change in its input. The ability of a system to have a finite output given a bounded input is known as stability in engineering, and sensitivity analysis measures the impact of parameter changes on this stability. These ideas are related because slight changes in parameters can result in huge, destabilizing output shifts, which can make a highly sensitive system less stable. Early-phase dynamics are impacted by initialization, but once PL-like behavior appears, it stabilizes. Weight decay dampens flat directions, reducing variance in trajectories; a slight spread among random seeds is normal. Figure 14 below illustrates how stability and sensitivity are avoided.
Figure 14. Initialization sensitivity: validation MSE across runs.

7.3. Robustness Under Distribution Shift

To show how WNN captures sharp features compared to MLP/CNN, we use a highly localized “peak”:
f ( x ) = e x p ( 100 ( x 0.5 ) 2 )
The capacity of a model to continue performing as the statistical characteristics of the data it meets in the actual world differ from those of its training data is known as robustness under distribution. Wavelet localization preserves important patterns and increases robustness to moderate covariate fluctuations (Figure 15). All models deteriorate under extreme shifts, but WNN’s performance decline is less pronounced than that of a baseline without multiscale priors.
Figure 15. Robustness under shift (performance vs. severity).

7.4. Limitations

Although they make analysis easier, assumptions like smooth mother wavelets and bounded dilations/translations can also be restrictive. Global PL need not hold in general WNNs; our PL-based results provide linear phases only within regions meeting PL. Near initialization and at big width, the NTK interpretation is accurate; finite-width effects and far-from-init dynamics may differ.

7.5. Future Work

Future work will include deriving explicit WNN-specific NTKs for common wavelet families; tighter conditions under which L2 induces PL globally; adaptive schedules that jointly tune η and λ (see Table 1); and extensions to classification losses and structured outputs.
Table 1. Ablations (illustrative).

8. Conclusions and Future Directions

This study used controlled hyperparameter sweeps, NTK-based interpretation, and synthetic benchmarks to provide a thorough examination of neural network training under L2 regularization. We confirmed the L2 penalty’s sparsity-inducing effect through function approximation tasks, and we consistently saw improvements in gradient stability and convergence behavior. The image-like denoising studies further showed that L2 regularization achieves competitive PSNR while avoiding overfitting by effectively balancing reconstruction accuracy and model compactness. Together, our theoretical and empirical analyses of loss decay, gradient-norm evolution, weight-norm trajectories, and NTK eigenvalue analysis demonstrated that the L2 penalty encourages structured sparsity without compromising the underlying model’s expressive power. However, the method requires careful calibration to ensure stability because it is sensitive to the choice of regularization weight λ and learning rate η. All things considered, the findings validate that L2 regularization is a viable method for constructing sparse, effective neural networks with well-behaved optimization landscapes. These results may be extended to deeper architectures, adaptive regularization schedules, and higher-dimensional data in future studies. Additionally, the theoretical relationships between sparsity and NTK dynamics may be further investigated.

Future Directions

(a)
For canonical wavelet families (Mexican hat, Morlet, and Daubechies), we will derive closed-form WNN-specific NTKs and examine their spectra with realistic initializations.
(b)
In order to quantify expansion as a function of λ, we will determine the conditions under which L2 causes global or broader PL regions for trainable dilations/translations.
(c)
We will look to create adaptive controllers with theoretical stability guarantees that simultaneously adjust η and λ utilizing real-time spectral/gradient diagnostics.
(d)
We will use wavelet priors to expand the analysis to structured outputs (such as graphs and sequences) and classification losses (logistic and cross-entropy).
(e)
We will examine robustness in the presence of adversarial perturbations and covariate shift, when wavelet localization might provide demonstrable stability benefits.

Author Contributions

K.S.M.: Conceptualization, methodology, investigation, software, resources, project administration, Writing—original draft. I.M.A.S.: Writing—original draft, writing—review and editing. A.A. (Abdalilah Alhalangy): Formal analysis, investigation, writing—review and editing. A.A. (Alawia Adam): Writing—original draft, writing—review and editing. M.S.: Formal analysis, writing—review and editing. H.I.: Writing—original draft, writing—review and editing. M.A.M.: Formal analysis, investigation, writing—review and editing. S.A.A.S.: Investigation, writing—review and editing. Y.S.M.: Writing—original draft, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

The Researchers would like to thank the Deanship of Graduate Studies and Scientific Research at Qassim University for financial support (QU-APC-2025).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the Reviewers’ for their valuable comments and suggestions, which have improved the presentation of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Main Theoretical Results
All norms are Euclidean. For any matrix A, λ m i n A and λ m a x A denote the minimal and maximal eigenvalues. We need the following assumptions in our proofs.
Assumption A1 (Wavelet Feature Boundedness).
Let
ϕ j x ; θ j = ψ 1 x b 1 a 1 ,   , ψ N x b N a N
be the wavelet activation vector. There exists  M ψ > 0  such that for all  x X R d :
ϕ x 2 M ψ
Assumption A2 (Lipschitz Gradient/Smoothness).
The empirical loss with L2 regularization
F W = 1 2 n Θ W y + λ 2 W 2 2
has an  L -Lipschitz continuous gradient:
F W F W L W W
Equivalently, the Hessian  H = 1 n Φ T Φ + λ I  satisfies
0 < λ m i n H λ m a x H = L
Assumption A3 (Strong Convexity).
The Hessian satisfies
μ λ m i n H > 0
This holds automatically for any  λ > 0 , even if  Φ T Φ  is singular.
Lemma A1 (Closed-Form Minimizer of L2-regularized WNN).
Under A1–A3, the unique minimizer of
min W 1 2 n Θ W y + λ 2 W 2 2
is
W = 1 n Φ T Φ + λ I 1 1 n Φ T y
Thus  W  exists, is unique, and depends smoothly on  λ .
Proof. 
Let F be continuously differentiable, and its gradient with respect to W R m is
F W = 1 n Φ T Φ W y + λ W
Any critical point W satisfies F W = 0 . Therefore, any stationary point must solve the linear equation
1 n Φ T Φ + λ I W = 1 n Φ T y
Let G = 1 n Φ T Φ . Then, G is symmetric positive semidefinite. Because λ > 0, the matrix
H G + λ I = 1 n Φ T Φ + λ I
is symmetric and strictly positive definite: for any nonzero v R m ,
v T H v v T G v + λ v 2 2 λ v 2 2 > 0
Hence, H is invertible. Because H is invertible, it has the unique solution
W = H 1 1 n Φ T y = 1 n Φ T Φ + λ I 1 1 n Φ T y
The uniqueness of the stationary point implies uniqueness of the global minimizer provided the function is convex; we show convexity next.
The Hessian of F is constant and equals H (because the objective is quadratic in W ). Since H is positive definite, F is strongly convex (with a strongconvexity constant μ λ m i n H > λ ). A strongly convex function has a single global minimizer, and the unique stationary point is that minimizer. Therefore, the W above is the unique global minimizer.
From W = H 1 1 n Φ T y and H 1 o p = 1 λ m i n H 1 λ , we obtain
W 2 1 λ m i n H · 1 n Φ T y 2 1 λ 1 n Φ T y 2
This shows the explicit dependence of the minimizer magnitude on λ . The map λ H 1 is continuous on ( 0 , )   (indeed differentiable), so W λ = H λ 1 1 n Φ T y depends smoothly on λ (this follows from standard matrix perturbation theory/inverse mapping continuity.).□
Lemma A2 (Gradient Descent Update as a Linear Dynamical System).
The gradient descent with step size  η > 0  follows
W t + 1 = W t η F W = W t η H W t W
Hence
W t + 1 W t = I η H ( W t W )
The error evolves linearly with transition matrix  I η H .
Proof. 
From standard differentiation of the quadratic objective,
F W = 1 n Φ T Φ W y + λ W                                                                                                         = 1 n Φ T Φ + λ I W 1 n Φ T Φ = H W 1 n Φ T y
Using Lemma A1, the minimizer W * satisfies the first-order condition F W = 0 , i.e.,
H W 1 n Φ T y = 0       1 n Φ T y = H W
The GD step becomes
W t + 1 = W t η H W 1 n Φ T y
Replace 1 n Φ T y with H W :
W t + 1 = W t η H W t H W = W t η H W t W
Subtract W * from both sides:
W t + 1 W t = I η H ( W t W )
This completes the algebraic part of the proof; the error evolves linearly under the constant matrix I η H . □
Theorem A1 (Spectral Convergence Condition).
Under A2–A3, gradient descent converges if and only if
0 < η < 2 λ m a x H
In this regime, for all  t 0 :
W t W ρ t W 0 W
where  ρ = max i 1 η λ i H < 1
Proof. 
From Lemma A2, we have the exact linear recursion for the error:
t + 1 W t + 1 W = 1 η H t
Iterating,
t = 1 η H t 0
Because H is symmetric positive definite, it has the eigen-decomposition H = Q Q T with orthogonal Q and
Λ = d i a g λ 1 , , λ m ,           0 < μ = λ 1 , , λ m = L
Conjugating the recursion decouples coordinates
Z t + 1 Q T t + 1 = 1 η Λ Z t
where Z t Q T t . Hence, for each coordinate i ,
Z i t + 1 Q T t + 1 = 1 η λ i Z i t
The matrix 1 η H is diagonalizable with eigenvalues 1 η λ i i = 1 m . Iterates t = 1 η H t converge to the zero vector for every initial 0 if and only if the spectral radius satisfies
ρ 1 η λ i = max i 1 η H < 1  
Because each λ i > 0 , the inequalities 1 η λ i < 1 are equivalent (for each   i ) to
1 < 1 η λ i < 1       0 < η λ i < 2  
Since this must hold for every eigenvalue, it reduces to the single requirement
0 < η λ m a x H < 2 0 < η < 2 λ m a x H
Thus, the interval 0 < η < 2 / λ m a x H is necessary and sufficient for ρ I η H = < 1 .
Under this, we have ρ = max i 1 η λ i < 1 . Using the orthogonality of Q ,
t 2 = I η H t t 2 = Q I η Λ t Q T 0 2                                                                         = I η Λ t Z 0 2 ρ t Z 0 2 = ρ t 0 2
proving the stated linear (geometric) contraction.
Remarks about the boundary and outside the interval: If η = 0 , the iterates are constant (no progress). If η = 2 / λ m a x H , then for the index iii achieving λ i = λ m a x , we have 1 η λ i = 1 , so the corresponding coordinate oscillates in magnitude and convergence generally fails (unless that coordinate is zero initially). If η = 2 / λ m a x H (or η 0 ), then some eigen-coordinate satisfies 1 η λ i 1 , hence there exists an initial 0 for which t does not go to zero (it either diverges or does not converge).
This completes the proof. □
Theorem A2 (Linear/Exponential Convergence Rate).
Under A1–A3 with  0 < η < 2 / L , the iterates converge linearly:
W t W 2 η F W = 1 η μ t W 0 W
Equivalently,
F W t F W 1 η μ 2 t F W 0 F W
Proof. 
Since F is quadratic and H is positive definite ( λ > 0 ), the minimizer is unique and satisfies F W = 0 . From Lemma A2, we have the error recursion:
t + 1 W t + 1 W = 1 η H t
Because H is symmetric positive definite, it has the eigen-decomposition H = Q Q T with orthogonal Q and
Λ = d i a g λ 1 , , λ m ,           0 < μ = λ 1 , , λ m = L
We express the recursion in eigen-coordinates:
Z t Q T t ,   Z t + 1 = 1 η Λ Z t
Thus, each coordinate evolves as follows:
Z i t = 1 η λ i t Z i 0
Therefore,
t 2 = Z t 2 max i 1 η λ i t Z 0 2 = ρ η t 0 2
The condition 0 < η < 2 / L ensures
1 η λ i < 1       i  
Thus, ρ ( η ) < 1 , proving linear convergence. If 0 < η < 2 / L , then for all i , we have
0 1 η λ i 1 η μ
Thus,
ρ η = 1 η μ
and hence
W t W 2 1 η μ t W 0 W 2
Since F is quadratic,
F W t F W = 1 2 W W T H W W
Using H = L and the previous contraction
F W t F W L 2 W 0 W 2 2 L 2 1 η μ 2 t W 0 W 2 2
Comparison with the expression for F W 0 F W yields
F W t F W 1 η μ 2 t F W t F W
This completes the proof. □
Theorem A3 (Effect of L2 Regularization on Conditioning).
Let  G = 1 n Φ T Φ  be the (symmetric) Gram matrix of the wavelet features and fix  λ > 0 . Define
H λ = G + λ I
Then, the condition number of  H  satisfies
K H λ = λ m a x G + λ λ m i n G + λ
The consequences are as follows:
  • If G is singular λ m a x G = 0 , then K H λ = λ m a x G / λ . In particular, any λ > 0 makes H strictly positive definite.
  • K H λ is monotone nonincreasing in λ > 0 ; moreover, lim λ K H λ = 1 .
  • As λ 0 + , K H λ K G when λ m i n G > 0 , and K H λ when λ m i n G = 0 .
Proof. 
Step 1: Eigenvalue shift identity.
Because G is symmetric, it has a real eigen-decomposition G = Q Q T with G = d i a g γ 1 , , γ m and γ 1 γ m (each γ i 0 since G 0 ). Then,
H λ = G + λ I = Q G + λ I Q T
Hence the eigenvalues of H λ are γ i + λ for i   = 1 , , m . In particular,
λ m i n H λ = γ 1 + λ = λ m i n G + λ
λ m a x H λ = γ m + λ = λ m a x G + λ
Therefore, the condition number is
K H λ = λ m a x H λ λ m i n H λ = λ m a x G + λ λ m i n G + λ
which proves the identity in the theorem.
Step 2: Singular G case.
If λ m i n G = 0 (i.e., G is singular), then the identity above yields
K H λ = λ m a x G + λ 0 + λ = λ m a x G + λ λ
Since λ > 0 , the denominator is positive; hence H λ is strictly positive definite for every λ > 0 . This shows any positive λ regularizes away singular directions.
Step 3: Monotonicity in λ.
Set a = λ m a x G and b = λ m i n G s o   0 b   a . Define
K λ = a + λ b + λ ,                   λ > 0
Differentiate with respect to λ :
K λ = b + λ · 1 a + λ · 1 b + λ 2 = b a b + λ 2
Because a b , the numerator b a 0 ; hence K λ > 0 for all λ > 0 . If a > b then K λ < 0 , so κ\kappaκ is strictly decreasing; if a = b (i.e.,  G is a scalar multiple of the identity) then K λ = 0 , and K is constant (equal to 1). Thus, K λ is monotone nonincreasing in λ .
Step 4: Limit as λ → ∞:
Compute
lim λ K λ = lim λ a + λ b + λ = lim λ 1 + a / λ 1 + b / λ = 1
Thus, conditioning improves (approaches optimal value 1) as λ becomes very large.
Step 5: Limit as λ 0   +
If b > 0 (i.e., G is full-rank), then
lim λ 0   + K λ = a b = K G
If b = 0 (singular G ), then
K λ = a + λ λ = 1 + a λ +
Thus, the condition number tends to the unregularized condition number when G is invertible and diverges to + when G is singular.
Step 6: Practical interpretation (optional short remark).
Because the convergence rate of gradient descent (optimal fixed-step) depends monotonically on K H —e.g., the optimal contraction factor ρ = K 1 / K + 1 improves as K decreases—the monotone decrease in K λ with λ shows that increasing λ (weight decay) strictly improves the spectral conditioning of the optimization problem and thus can accelerate linear convergence phases. The trade-off is that large λ also biases the estimator (adds shrinkage).
This completes the proof of Theorem A3 and the derived consequences. □
Theorem A4 (Polyak–Łojasiewicz Inequality for Wavelet Neural Networks with L2 Regularization).
Under A1–A3, the loss satisfies the Polyak–Łojasiewicz (PL) inequality:
1 2 F W 2 μ F W F ( W )
Thus, WNNs with L2 regularization behave like strictly convex quadratics in optimization, even though the underlying model is nonlinear in input space.
Proof. 
Proof of this permanent status is achieved in two stages
Part A—Exact PL for the ridge (fixed-feature) case
Set v = W W . Then,
F W = H v ,             F W F W = 1 2 v T H
Compute the squared gradient norm, then
F W 2 2 = v T H 2 v
Because H is symmetric positive definite, we may compare H 2 and H . Write
H 2 μ H = H 1 / 2 H μ I H 1 / 2 0
Since H μ I 0 by definition of μ = λ m i n H . Therefore, H 2 μ H , and hence
v T H 2 v μ v T H v = 2 μ F W F ( W )
Dividing both sides by two yields the PL inequality:
1 2 F W 2 μ F W F ( W )
Part B—PL from a uniform Hessian lower bound (sufficient condition for trainable WNN)
Fix any θ D . Using the fundamental theorem of calculus for gradients (mean value form), we can write
F θ F θ = 0 1 F θ + t θ θ d t θ θ
But F θ = 0 (a first-order optimality). We define the averaged Hessian:
H ¯ = 0 1 2 F θ + t θ θ d t
Using the pointwise lower bound 2 F · μ I , we have H ¯ μ I . Hence
F W 2 2 = θ θ T H ¯ 2 θ θ μ θ θ T H ¯ θ θ
But by the same integral representation,
F θ F θ = 1 2 θ θ T H ¯ θ θ
Combining these two, we have
1 2 F W 2 2 μ · 1 2 θ θ T H ¯ θ θ = μ F θ F θ
This proves the PL inequality on D . □

References

  1. Wu, J.; Li, J.; Yang, J.; Mei, S. Wavelet-integrated deep neural networks: A systematic review of applications and synergistic architectures. Neurocomputing 2025, 657, 131648. [Google Scholar] [CrossRef]
  2. Kio, A.E.; Xu, J.; Gautam, N.; Ding, Y. Wavelet decomposition and neural networks: A potent combination for short term wind speed and power forecasting. Front. Energy Res. 2024, 12, 1277464. [Google Scholar] [CrossRef]
  3. Cui, Z.; Ke, R.; Pu, Z.; Ma, X.; Wang, Y. Learning traffic as a graph: A gated graph wavelet recurrent neural network for network-scale traffic prediction. Transp. Res. Part C Emerg. Technol. 2020, 115, 102620. [Google Scholar] [CrossRef]
  4. Lucas, F.; Costa, P.; Batalha, R.; Leite, D.; Škrjanc, I. Fault detection in smart grids with time-varying distributed generation using wavelet energy and evolving neural networks. Evol. Syst. 2020, 11, 165–180. [Google Scholar] [CrossRef]
  5. Baharlouei, Z.; Rabbani, H.; Plonka, G. Wavelet scattering transform application in classification of retinal abnormalities using OCT images. Sci. Rep. 2023, 13, 19013. [Google Scholar] [CrossRef] [PubMed]
  6. Wang, J.; Wang, Z.; Li, J.; Wu, J. Multilevel wavelet decomposition network for interpretable time series analysis. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 2437–2446. [Google Scholar]
  7. Grieshammer, M.; Pflug, L.; Stingl, M.; Uihlein, A. The continuous stochastic gradient method: Part I–convergence theory. Comput. Optim. Appl. 2024, 87, 935–976. [Google Scholar] [CrossRef]
  8. Xia, L.; Massei, S.; Hochstenbach, M.E. On the convergence of the gradient descent method with stochastic fixed-point rounding errors under the Polyak–Łojasiewicz inequality. Comput. Optim. Appl. 2025, 90, 753–799. [Google Scholar] [CrossRef]
  9. Galanti, T.; Siegel, Z.S.; Gupte, A.; Poggio, T. SGD and Weight Decay Provably Induce a Low-Rank Bias in Neural Networks. 2023. Available online: https://hdl.handle.net/1721.1/148231 (accessed on 28 November 2025).
  10. Jacot, A.; Gabriel, F.; Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Adv. Neural Inf. Process. Syst. 2018, 31, 1–10. [Google Scholar]
  11. Somvanshi, S.; Javed, S.A.; Islam, M.M.; Pandit, D.; Das, S. A survey on kolmogorov-arnold network. ACM Comput. Surv. 2025, 58, 1–35. [Google Scholar] [CrossRef]
  12. Medvedev, M.; Vardi, G.; Srebro, N. Overfitting behaviour of gaussian kernel ridgeless regression: Varying bandwidth or dimensionality. Adv. Neural Inf. Process. Syst. 2024, 37, 52624–52669. [Google Scholar]
  13. Shang, Z.; Zhao, Z.; Yan, R. Denoising fault-aware wavelet network: A signal processing informed neural network for fault diagnosis. Chin. J. Mech. Eng. 2023, 36, 9. [Google Scholar] [CrossRef]
  14. Saragadam, V.; LeJeune, D.; Tan, J.; Balakrishnan, G.; Veeraraghavan, A.; Baraniuk, R.G. Wire: Wavelet implicit neural representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18507–18516. [Google Scholar]
  15. Sadoon, G.A.A.S.; Almohammed, E.; Al-Behadili, H.A. Wavelet neural networks in signal parameter estimation: A comprehensive review for next-generation wireless systems. In Proceedings of the AIP Conference Proceedings, Pune, India, 18–19 May 2024; AIP Publishing LLC.: College Park, MD, USA, 2025; Volume 3255, p. 020014. [Google Scholar]
  16. Akujuobi, C.M. Wavelets and Wavelet Transform Systems and Their Applications; Springer International Publishing: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
  17. Wang, P.; Wen, Z. A spatiotemporal graph wavelet neural network (ST-GWNN) for association mining in timely social media data. Sci. Rep. 2024, 14, 31155. [Google Scholar] [CrossRef]
  18. Uddin, Z.; Ganga, S.; Asthana, R.; Ibrahim, W. Wavelets based physics informed neural networks to solve non-linear differential equations. Sci. Rep. 2023, 13, 2882. [Google Scholar] [CrossRef] [PubMed]
  19. Jung, H.; Lodhi, B.; Kang, J. An automatic nuclei segmentation method based on deep convolutional neural networks for histopathology images. BMC Biomed. Eng. 2019, 1, 24. [Google Scholar] [CrossRef]
  20. Xiao, Q.; Lu, S.; Chen, T. An alternating optimization method for bilevel problems under the Polyak-Łojasiewicz condition. Adv. Neural Inf. Process. Syst. 2023, 36, 63847–63873. [Google Scholar]
  21. Yazdani, K.; Hale, M. Asynchronous parallel nonconvex optimization under the polyak-łojasiewicz condition. IEEE Control Syst. Lett. 2021, 6, 524–529. [Google Scholar] [CrossRef]
  22. Chen, K.; Yi, C.; Yang, H. Towards Better Generalization: Weight Decay Induces Low-rank Bias for Neural Networks. arXiv 2024, arXiv:2410.02176. [Google Scholar] [CrossRef]
  23. Kobayashi, S.; Akram, Y.; Von Oswald, J. Weight decay induces low-rank attention layers. Adv. Neural Inf. Process. Syst. 2024, 37, 4481–4510. [Google Scholar]
  24. Seleznova, M.; Kutyniok, G. Analyzing finite neural networks: Can we trust neural tangent kernel theory? In Mathematical and Scientific Machine Learning; PMLR: Birmingham, UK, 2022; pp. 868–895. [Google Scholar]
  25. Tan, Y.; Liu, H. How does a kernel based on gradients of infinite-width neural networks come to be widely used: A review of the neural tangent kernel. Int. J. Multimed. Inf. Retr. 2024, 13, 8. [Google Scholar] [CrossRef]
  26. Tang, A.; Wang, J.B.; Pan, Y.; Wu, T.; Chen, Y.; Yu, H.; Elkashlan, M. Revisiting XL-MIMO channel estimation: When dual-wideband effects meet near field. IEEE Trans. Wirel. Commun. 2025. [Google Scholar] [CrossRef]
  27. Cui, Z.X.; Zhu, Q.; Cheng, J.; Zhang, B.; Liang, D. Deep unfolding as iterative regularization for imaging inverse problems. Inverse Probl. 2024, 40, 025011. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.