Next Article in Journal
On the Computation of the General Simplicial Bernstein Inclusion–Isotone Property for Stability Analysis of Least-Squares Polynomials
Next Article in Special Issue
New Mathematics for Computer Performance: Array Algebra and Cost Functions
Previous Article in Journal
Probabilistic Safety Guarantees for Learned Control Barrier Functions: Theory and Application to Multi-Objective Human–Robot Collaborative Optimization
Previous Article in Special Issue
Evaluation of GPU-Accelerated Edge Platforms for Stochastic Simulations: Performance and Energy Efficiency Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Stable and Efficient Gaussian-Based Kolmogorov–Arnold Networks

Department of Science and Technology, Parthenope University of Naples, Centro Direzionale C4, I-80143 Naples, Italy
*
Authors to whom correspondence should be addressed.
Mathematics 2026, 14(3), 513; https://doi.org/10.3390/math14030513
Submission received: 24 December 2025 / Revised: 25 January 2026 / Accepted: 29 January 2026 / Published: 31 January 2026
(This article belongs to the Special Issue Advances in High-Performance Computing, Optimization and Simulation)

Abstract

Kolmogorov–Arnold Networks employ learnable univariate activation functions on edges rather than fixed node nonlinearities. Standard B-spline implementations require O ( 3 K W ) parameters per layer (K basis functions, W connections). We introduce shared Gaussian radial basis functions with learnable centers μ k ( l ) and widths σ k ( l ) maintained globally per layer, reducing parameter complexity to O ( K W + 2 L K ) for L layers—a threefold reduction, while preserving Sobolev convergence rates O ( h s Ω ) . Width clamping at σ min = 10 6 and tripartite regularization ensure numerical stability. On MNIST with architecture [ 784 , 128 , 10 ] and K = 5 , RBF-KAN achieves 87.8 % test accuracy versus 89.1 % for B-spline KAN with 1.4 × speedup and 33% memory reduction, though generalization gap increases from 1.1 % to 2.7 % due to global Gaussian support. Physics-informed neural networks demonstrate substantial improvements on partial differential equations: elliptic problems exhibit a 45 × reduction in PDE residual and maximum pointwise error, decreasing from 1.32 to 0.18 ; parabolic problems achieve a 2.1 × accuracy gain; hyperbolic wave equations show a 19.3 × improvement in maximum error and a 6.25 × reduction in L 2 norm. Superior hyperbolic performance derives from infinite differentiability of Gaussian bases, enabling accurate high-order derivatives without polynomial dissipation. Ablation studies confirm that coefficient regularization reduces mean error by 40%, while center diversity prevents basis collapse. Optimal basis count K [ 3 , 5 ] balances expressiveness and overfitting. The architecture establishes Gaussian RBFs as efficient alternatives to B-splines for learnable activation networks with advantages in scientific computing.

1. Introduction

Modern neural network architectures face a fundamental trade-off between expressive power and computational efficiency. The capacity to approximate arbitrary continuous functions, formalized through the Universal Approximation Theorem [1,2,3,4], demands sufficient parameters to represent complex decision boundaries, yet excessive parameterization incurs prohibitive training costs and overfitting risks. Multilayer perceptrons (MLPs) address this challenge through depth and fixed activation functions (ReLU, Swish/SiLU, GELU [5,6]), but representational complexity remains proportional to network width, scaling as O ( W 2 ) for W connections per layer. Recurrent architectures mitigate temporal dependencies through memory mechanisms [7,8,9], while autoencoders achieve compact representations for dimensionality reduction [10,11] and data generation [12,13,14], yet all approaches fundamentally share the parameter scaling challenge. Kolmogorov–Arnold Networks (KANs) offer an alternative architectural paradigm wherein learnable univariate functions replace fixed activations, placing adaptability on network edges rather than nodes. Grounded in the Kolmogorov–Arnold representation theorem [15], KANs decompose multivariate functions into superpositions of univariate mappings, theoretically requiring exponentially fewer parameters than equivalent-capacity MLPs for low-dimensional smooth functions. While the original constructive proof yields fractal-like, nowhere differentiable universal functions unsuitable for practical computation, Liu et al. [16] introduced the first trainable KAN architecture by parameterizing univariate functions via B-spline basis expansions with adaptive grid refinement, achieving superior performance on symbolic regression and physics-informed neural network benchmarks. However, B-spline parameterization introduces two critical limitations that constrain practical scalability: (1) per-connection parameter overhead, each edge requires independent storage of 3 K parameters (centers, widths, coefficients) for K basis functions, yielding total complexity O ( 3 K W ) for a network with W connections; (2) polynomial smoothness constraints, while B-splines of order k possess only C k 2 continuity, potentially introducing spurious numerical artifacts in applications requiring high-order derivatives, such as solving hyperbolic partial differential equations governed by wave propagation physics. This work addresses these limitations through a shared-basis architecture employing Gaussian radial basis functions (RBFs). Classical RBF networks have approximation-theoretic foundations, with universal approximation guarantees established by Park and Sandberg [17] and convergence rate analysis in Sobolev spaces provided by Schaback [18] and Wendland [19], demonstrating error decay scaling as O ( h s ) , where s quantifies target function regularity and h represents fill distance. Gaussian RBFs provide C smoothness, ensuring stable computation of arbitrarily high-order derivatives required for physics-informed applications. The infinite differentiability of Gaussian kernels proves particularly advantageous for hyperbolic PDEs, where exact representation of oscillatory solutions demands accurate higher-order spatial derivatives without polynomial truncation error. We introduce width clamping at threshold σ min = 10 6 combined with tripartite regularization to mitigate numerical conditioning pathologies inherent to RBF interpolation [20] while preserving approximation capacity. Theoretical analysis establishes that the shared-basis design maintains Sobolev space convergence rates O ( h s Ω ) under mesh refinement, where Ω represents domain dimensionality. The rest of this paper is structured as follows. Section 2 reviews related work on Kolmogorov–Arnold Networks, radial basis function approximation theory, and recent architectural innovations in learnable activation networks, positioning our contribution within the broader context of adaptive neural architectures. Section 3 establishes the mathematical foundations, including the Kolmogorov–Arnold representation theorem and Gaussian RBF approximation in Sobolev spaces. Section 4 presents the shared-basis RBF-KAN architecture, detailing the forward propagation algorithm and backpropagation with automatic differentiation for learnable centers and widths. Section 5 shows experimental validation on MNIST classification and three canonical physics-informed neural network benchmarks. Section 6 concludes with a discussion of trade-offs between parameter efficiency and generalization performance, limitations of global Gaussian support structures, and directions for future research, including hybrid local-global basis strategies and extensions to high-dimensional domains.

2. Related Work

The theoretical foundations of Kolmogorov–Arnold Networks originate from the representation theorem established independently by Kolmogorov [15] and Arnold [21], demonstrating that arbitrary continuous multivariate functions admit decomposition into superpositions of continuous univariate functions. While mathematically elegant, the constructive proof yields universal inner functions possessing fractal-like, nowhere differentiable structure unsuitable for practical computation [22,23]. Liu et al. [16] introduced the first trainable KAN architecture, parameterizing univariate activation functions via B-spline basis expansions with adaptive grid refinement, achieving superior parameter efficiency on symbolic regression and partial differential equation benchmarks compared with multilayer perceptrons of equivalent width.
Classical radial basis function networks constitute a well-established paradigm with rigorous approximation-theoretic foundations. Park and Sandberg [17] established universal approximation guarantees for Gaussian RBF networks under mild regularity conditions, while subsequent work by Schaback [18] and Wendland [19] provided convergence rate analysis in Sobolev spaces, demonstrating error decay scaling as O ( h s ) , where s denotes target function regularity and h represents fill distance. Buhmann [24] synthesized theoretical developments in his comprehensive monograph, establishing connections between RBF interpolation, kernel methods, and reproducing kernel Hilbert spaces. Numerical conditioning challenges inherent to Gaussian RBF interpolation have motivated stable algorithms incorporating variable basis placement and shape parameter optimization [20].
Recent architectural innovations in learnable activation networks include FastKAN [25], which replaces B-splines with rational function parameterizations to reduce computational overhead associated with grid operations, trading approximation fidelity against inference speed. Concurrent work has explored Chebyshev polynomial bases, Fourier features, and wavelet decompositions as alternative parameterization schemes, each offering distinct trade-offs between locality, smoothness, and computational efficiency. The broader context of adaptive activation functions encompasses PReLU [26], Swish/SiLU [5,6], and GELU [27], which learn scalar parameters modulating fixed nonlinearities rather than arbitrary univariate functions.
Physics-informed neural networks leverage automatic differentiation to embed differential equation constraints into loss functions, enabling mesh-free solution of forward and inverse problems [28,29,30]. The framework accommodates diverse equation types, including Navier-Stokes [31], Schrödinger [32], and Maxwell equations [33], with recent extensions addressing multi-scale phenomena [34], inverse design [35], and operator learning [36]. Theoretical analysis of PINN convergence properties remains active, with recent work establishing error bounds under regularity assumptions [37,38] and investigating failure modes in high-frequency or singularity-laden problems [39].
The neural tangent kernel framework introduced by Jacot et al. [40] provides analytical tools for studying infinite-width neural networks through kernel methods, establishing connections between gradient descent dynamics and kernel regression. Subsequent theoretical developments by Allen-Zhu et al. [41] and Du et al. [42] formalized global convergence guarantees under overparameterization, demonstrating exponential loss decay rates when the minimum eigenvalue of the NTK matrix remains bounded away from zero. Arora et al. [43] extended NTK analysis to finite-width networks, characterizing the lazy training regime where parameters remain near initialization throughout optimization.
The present contribution synthesizes these developments through a parameter-efficient shared-basis RBF-KAN architecture preserving approximation-theoretic guarantees while addressing computational tractability. Unlike the rational parameterization of FastKAN or the local B-spline support of the Standard KAN, the Gaussian RBF approach provides infinitely differentiable global basis functions naturally suited to physics-informed applications requiring high-order derivatives. The width regularization mechanism addresses conditioning pathologies inherent to RBF interpolation through gradient-based optimization with explicit stability constraints, avoiding classical techniques like variable shape parameter selection or greedy basis placement.

3. Mathematical Background

The theoretical foundation of Kolmogorov–Arnold Networks rests upon the fundamental representation theorem established independently by Kolmogorov and Arnold in 1957, which provides the conceptual basis for replacing traditional multilayer perceptrons with networks employing learnable univariate activation functions. We begin by recalling this classical result before developing the approximation-theoretic framework for Gaussian radial basis functions that underlies the proposed architectural design.

3.1. Kolmogorov–Arnold Representation Theorem

The possibility of representing arbitrary continuous multivariate functions through superpositions of continuous univariate functions was established through the following celebrated result.
Theorem 1
(Kolmogorov [15], Arnold [21]). Let f : [ 0 , 1 ] n R be any continuous multivariate function. Then there exist continuous univariate functions φ q , p : R R for q = 0 , 1 , , 2 n and p = 1 , 2 , , n , and continuous outer functions Φ q : R R such that
f ( x 1 , x 2 , , x n ) = q = 0 2 n Φ q p = 1 n φ q , p ( x p )
where the inner functions { φ q , p } q , p are universal in the sense that they depend only on the dimension n, not on the particular function f.
The proof of Theorem 1, while constructive, yields universal functions φ q , p that are typically fractal-like, nowhere differentiable, and computationally intractable [22]. This structural understanding helps in creating practical neural network designs. A KAN with L layers and widths n 0 , n 1 , , n L implements a composite mapping F : R n 0 R n L through the recursive transformation
x j ( l + 1 ) = i = 1 n l φ i , j ( l ) ( x i ( l ) ) , j = 1 , 2 , , n l + 1 , l = 0 , 1 , , L 1
where x ( l ) = ( x 1 ( l ) , x 2 ( l ) , , x n l ( l ) ) R n l denotes the activation vector at layer l, and each φ i , j ( l ) : R R represents a learnable univariate function governing the connection from neuron i in layer l to neuron j in layer l + 1 .
The architectural distinction from conventional multilayer perceptrons is fundamental: whereas MLPs apply a fixed activation function σ : R R uniformly across all neurons, KAN architectures learn connection-specific functions φ i , j ( l ) that can adapt to local input-output relationships. This flexibility necessitates the development of parameterization schemes that balance expressiveness with computational tractability. We address this challenge through Gaussian radial basis function expansions.
In more detail, radial basis functions provide a natural parameterization framework for univariate function approximation. A Gaussian radial basis function centered at μ R with width parameter σ R + = ( 0 , ) is defined by
ψ ( x ; μ , σ ) = exp ( x μ ) 2 2 σ 2 .
The universal approximation capability of Gaussian RBFs is established by the following classical result.
Theorem 2
(Park and Sandberg [17]). Let Ω R be a compact set and let f C ( Ω ) be continuous. For any ε > 0 , there exist N N , coefficients { w i } i = 1 N R , centers { μ i } i = 1 N R , and shape parameters { σ i } i = 1 N R + such that
sup x Ω f ( x ) i = 1 N w i exp ( x μ i ) 2 2 σ i 2 < ε .
Theorem 2 guarantees existence but provides no constructive guidance on the number of basis functions N required to achieve tolerance ε , nor on the selection of centers and widths. In order to obtain quantitative convergence rates, we must impose regularity conditions on the target function and appeal to approximation theory in Sobolev spaces.

3.2. Approximation Theory in Sobolev Spaces

Throughout this paper, we adopt the convention
f ^ ( ξ ) = R f ( x ) e i x ξ d x , f ( x ) = 1 2 π R f ^ ( ξ ) e i x ξ d ξ
for the Fourier transform and its inverse. Under this convention, Parseval’s identity takes the form
f L 2 ( R ) 2 = 1 2 π f ^ L 2 ( R ) 2 .
For integer regularity s N , the Sobolev norm admits the equivalent definition
f H s ( Ω ) = k = 0 s f ( k ) L 2 ( Ω ) 2 1 / 2 ,
where f ( k ) denotes the k-th weak derivative of f. The Sobolev norm for general s > 0 is defined as
f H s ( R ) = 1 2 π R ( 1 + | ξ | 2 ) s | f ^ ( ξ ) | 2 d ξ 1 / 2 .
This Fourier-based definition (8) coincides with (7) when s N and provides a consistent extension to non-integer regularity indices. The Sobolev embedding theorem establishes the fundamental relationship between weak differentiability and pointwise continuity. By the Sobolev embedding theorem [44], functions in H s ( R ) with s > 1 2 admit continuous representatives with supremum norm controlled by | | f | | H s ( R ) .
Theorem 3
(RBF Approximation on Compact Domains [19]). Let Ω R be a compact interval and let f H s ( Ω ) be the restriction of a function in H s ( R ) to Ω, with  s > 1 / 2 . Let { μ i } i = 1 N Ω be a set of centers satisfying the quasi-uniformity condition: there exists a constant c q u 1 such that
h Ω c q u · min i j | μ i μ j |
where the fill distance relative to Ω is defined as
h Ω = sup x Ω min 1 i N | x μ i | .
This quasi-uniformity condition ensures that centers are neither too clustered nor too sparse within Ω. Then there exists a Gaussian RBF interpolant of the form
I h Ω f ( x ) = i = 1 N λ i exp ( x μ i ) 2 2 σ 2
with shape parameter σ h Ω (meaning there exist constants 0 < c 1 < c 2 such that c 1 h Ω σ c 2 h Ω ) as
f I h Ω f L 2 ( Ω ) C h Ω s f H s ( Ω )
where C > 0 is a constant independent of f and h Ω .
The proof is in the Appendix A.
The rate O ( h Ω s ) scales with target function regularity s, enabling rapid convergence for smooth functions with few basis functions.

3.3. Conditioning and Numerical Stability

While Theorem 3 establishes favorable approximation rates, practical implementation must contend with the ill-conditioning inherent in Gaussian RBF interpolation. Consider the interpolation problem on a compact domain Ω : given data { ( x i , y i ) } i = 1 N with distinct points x 1 < x 2 < < x N in Ω , find coefficients { λ i } i = 1 N such that
j = 1 N λ j exp ( x i μ j ) 2 2 σ 2 = y i , i = 1 , 2 , , N
where we take centers μ j = x j for simplicity. This yields the linear system A λ = y where the interpolation matrix A R N × N has entries
A i j = exp ( x i x j ) 2 2 σ 2 .
The conditioning of A exhibits extreme sensitivity to the shape parameter σ and the point distribution.
Theorem 4
(Conditioning of Gaussian RBF Matrices [18]). Let A R N × N be the Gaussian RBF interpolation matrix (14) with uniformly spaced points x i = a + ( i 1 ) h where h > 0 is the uniform spacing on a bounded interval. Define the minimum separation h min = min i j | x i x j | = h . Then the condition number satisfies the asymptotic bounds
κ ( A ) = λ max ( A ) λ min ( A ) O ( σ 2 N ) a s σ 0 + O exp C σ 2 h 2 a s σ +
for a constant C > 0 depending on the domain.
Theorem 4 reveals the fundamental trade-off in Gaussian RBF methods: small shape parameters σ yield high-resolution approximations but severely ill-conditioned interpolation systems, while large σ provides numerical stability at the cost of poor approximation quality. The optimal choice σ h suggested by Theorem 3 attempts to balance these competing objectives, but even with this choice, condition numbers can become prohibitively large for dense point sets. However, we do not solve interpolation systems directly. Instead, we optimize the coefficients via gradient descent on a data-driven loss function. This optimization-based approach effectively mitigates conditioning issues through the following:
1.
Regularization: Penalty terms such as λ c k c k 2 improve effective conditioning by adding λ c I to the Hessian of the loss, bounding the inverse curvature.
2.
Iterative refinement: Gradient descent naturally explores well-conditioned directions in parameter space, implicitly avoiding pathological subspaces associated with small singular values.
3.
Adaptive parameterization: Learning width parameters σ k individually for each basis function allows the network to automatically adjust widths to maintain numerical stability while preserving approximation power where needed.
Nevertheless, Theorem 4 underlines the importance of careful initialization and stabilization mechanisms, which we address through width clamping and regularization strategies.

3.4. Native Space Theory and Reproducing Kernel Hilbert Spaces

The Gaussian kernel K ( x , y ; σ ) = exp ( ( x y ) 2 / ( 2 σ 2 ) ) induces a reproducing kernel Hilbert space (RKHS) H K , known as the native space of the kernel [19]. Functions in H K can be represented as
f ( x ) = i = 1 c i K ( x , x i ; σ )
with coefficients satisfying i , j c i c j K ( x i , x j ; σ ) < . The native space norm is defined via
f H K 2 = inf i , j c i c j K ( x i , x j ; σ ) : f = i c i K ( · , x i ; σ ) .
Given a discrete set of distinct points X = { x 1 , , x N } , the power function is defined as
P X , K ( x ) = K ( x , x ; σ ) k X ( x ) K X 1 k X ( x )
where k X ( x ) = ( K ( x , x 1 ; σ ) , , K ( x , x N ; σ ) ) R N and K X R N × N is the Gram matrix with entries ( K X ) i j = K ( x i , x j ; σ ) . Given distinct points and the Gaussian kernel, K X is strictly positive definite [19], hence invertible, ensuring that the power function is well-defined. The power function provides a pointwise error bound for RBF interpolation in the native space.
Theorem 5
(Pointwise Error Bound [19]). Let f H K and let I X f denote the RBF interpolant to f at distinct points X = { x 1 , , x N } . Then for any x R ,
| f ( x ) I X f ( x ) | P X , K ( x ) f H K .
The Theorem 5 suggests that approximation quality at a point x depends on both the native space norm of f and the power function P X , K ( x ) . Considering well-distributed centers with fill distance h Ω in a compact domain Ω , the power function decays as O ( h Ω 2 s ) where s relates to the kernel smoothness, consistent with the L 2 convergence rate in Theorem 3.

3.5. Gaussian vs. B-Spline Basis Functions: Comparative Analysis

The choice between Gaussian RBF and B-spline parameterizations induces fundamental trade-offs in approximation capability, generalization behavior, and training dynamics. We establish a comparative analysis demonstrating that these basis function families occupy complementary architectural niches determined by their differing smoothness properties, support structures, and gradient characteristics. B-splines of order k are piecewise polynomials of degree k 1 with C k 2 continuity at knot points. For the cubic B-spline parameterization ( k = 3 ) employed in Standard KAN experiments, the basis functions satisfy ψ B-spline C 1 ( R ) , ensuring continuous first derivatives but exhibiting second-derivative discontinuities at knot locations. This limited smoothness constrains the regularity of network outputs u θ composed through multiple layers, with derivative discontinuities accumulating at layer transitions. Conversely, Gaussian RBFs ψ Gauss ( x ) = exp ( ( x μ ) 2 / ( 2 σ 2 ) ) belong to C ( R ) , possessing continuous derivatives of all orders. This infinite differentiability ensures network outputs u θ C regardless of network depth, enabling accurate automatic differentiation for arbitrary-order derivative computation. The practical consequence manifests in physics-informed neural network applications: hyperbolic wave equations require second-order mixed derivatives 2 u / t 2 and 2 u / x 2 in residual computation, and B-spline networks with C 1 regularity exhibit derivative discontinuities that corrupt these second-order terms, introducing numerical artifacts. Gaussian networks maintain 2 u C , explaining the 19.3× error reduction observed in the related section. The support structure differences between these basis families directly influence generalization behavior and computational efficiency. B-splines exhibit compact support determined by order k and grid spacing h, with cubic B-splines satisfying | supp ( ψ B-spline ) | = 3 h . This locality ensures that perturbing basis parameters affects only function values within the compact support region, providing implicit regularization through limited parameter interaction. Each evaluation point interacts with at most k adjacent basis functions, yielding sparse Jacobian matrices and efficient gradient computation. In contrast, Gaussian RBFs exhibit global support with exponential decay characterized by the ϵ -effective support satisfying | supp ϵ ( ψ Gauss ) | = 2 σ 2 log ( 1 / ϵ ) . For numerical precision threshold ϵ = 10 3 and σ h (prescribed by Theorem 3 for optimal convergence), this yields an effective support width 7.4 h , substantially exceeding the B-spline support 3 h . Each evaluation point interacts with all K basis functions simultaneously, producing dense Jacobian matrices with increased computational overhead but enhanced expressiveness for global function features. After we establish that this global support amplifies effective model capacity from d eff = O ( 1 ) for B-splines to d eff = K for Gaussians. Classical learning theory predicts generalization gap scaling with d eff , yielding factor K 2.2 amplification for K = 5 , consistent with observed ratio ( 2.7 % / 1.1 % ) = 2.45 on MNIST. Considering smooth target functions f H s ( Ω ) with Sobolev regularity s > k 2 , both B-splines and Gaussians achieve an asymptotic convergence rate of O ( h s ) under quasi-uniform basis placement (Theorem 3 for Gaussians, classical approximation theory for B-splines). However, pre-asymptotic behavior and suitability for different function classes differ substantially. B-splines efficiently represent functions with piecewise polynomial structure, achieving exact representation for polynomials of degree up to k 1 on each interval under appropriate conditions. For classification tasks with piecewise smooth decision boundaries, this efficiency manifests in superior MNIST accuracy (89.1% versus 87.8%). High-frequency sinusoidal functions f ( x ) = sin ( ω x ) with ω 1 require B-spline grid spacing h ω 1 to adequately resolve oscillations, with approximation error scaling polynomially in ω h . Gaussians with width σ ω 1 naturally match the oscillation scale through their smooth exponential profile, achieving superior approximation for such functions. This advantage explains RBF-KAN performance on wave equation benchmarks (Section 5.2), where high-frequency components dominate solution structure and require accurate representation without introducing spurious numerical dissipation. The gradient behavior during optimization reflects these structural differences. For B-spline coefficient c i , compact support yields gradient contributions only from samples within the support region [ t i , t i + k ] , producing sparse gradients with magnitude O ( local batch size ) . For Gaussian center μ k , Theorem 8 establishes that all N samples in the batch contribute to the gradient through global support, yielding magnitude O ( N ) . This amplification necessitates learning rate adjustment: empirically, α = 10 3 proves optimal for Gaussians under batch size B = 256 . For Gaussian width σ k , Theorem 9 reveals a piecewise gradient structure with a vanishing derivative when clamping activates at σ k 2 ϵ min . When unclamped, the gradient exhibits a cubic denominator σ k 3 , inducing gradient explosion as σ k 0 + . Width clamping at ϵ min = 10 6 (Section 4.2) prevents this pathology. B-spline knot parameters exhibit bounded gradients due to compact support, avoiding such instabilities. Global support Gaussians produce dense parameter interactions during optimization, wherein updating a single center μ k affects loss contributions from all training samples.

3.6. Neural Tangent Kernel Perspective

Modern deep learning theory analyzes neural network optimization through the neural tangent kernel (NTK) framework [40]. For a network with parameters θ R P and output function f θ : R d R , the neural tangent kernel at initialization is defined by
K NTK ( x , x ; θ 0 ) = θ f θ ( x ) , θ f θ ( x ) R P | θ = θ 0 .
In the case of overparameterized networks trained via gradient flow d θ / d t = θ L ( θ ) , the NTK remains approximately constant during training under certain conditions, and the training dynamics can be analyzed via kernel methods with K = K NTK .
Theorem 6
(NTK Convergence [40,41]). Consider gradient flow on the empirical squared loss
L ( θ ) = 1 2 N i = 1 N ( f θ ( x i ) y i ) 2
where { ( x i , y i ) } i = 1 N is a finite training set sampled i.i.d. from a distribution D over X × Y , with initialization θ 0 . Assume the NTK matrix K R N × N with entries K i j = K N T K ( x i , x j ; θ 0 ) satisfies K λ min I for some λ min > 0 . Then,
(i) The training loss converges exponentially:
L ( θ ( t ) ) L ( θ 0 ) exp ( 2 λ min t ) .
(ii) The population risk satisfies
E ( x , y ) D ( f θ ( t ) ( x ) y ) 2 inf f H N T K E ( x , y ) D ( f ( x ) y ) 2 = O tr ( K 1 ) N
where H N T K denotes the RKHS associated with the kernel K N T K ( · , · ; θ 0 ) .
Considering RBF-based KAN architectures with learnable centers { μ k } and widths { σ k } , the NTK decomposes naturally according to parameter types. Denoting coefficients, centers, widths, and biases by c , μ , σ , b respectively, the NTK admits the decomposition
K NTK ( x , x ; θ 0 ) = K c ( x , x ; θ 0 ) + K μ ( x , x ; θ 0 ) + K σ ( x , x ; θ 0 ) + K b ( x , x ; θ 0 )
where the kernel term for each parameter class p { c , μ , σ , b } is given by:
K p ( x , x ; θ 0 ) = p f θ ( x ) , p f θ ( x ) | θ = θ 0 .
The relative magnitudes of these terms depend on initialization scales and learning rates, providing insight into which parameters dominate the learning dynamics. In practice, coefficient parameters c typically contribute most significantly due to their large number and direct linear coupling to the output, while basis parameters μ , σ play a secondary role in refining the feature representation.

4. Shared-Basis Architecture and Gradient Computation

Building upon the theoretical framework of the previous section, we develop a parameter-efficient RBF-KAN variant that reduces the parameter count from O ( 3 K W ) to O ( K W + 2 L K ) while preserving the approximation guarantees of Theorems 2 and 3. We provide complete gradient derivations and computational complexity analysis.

4.1. Architectural Design Principles

Classical implementations of RBF-KAN assign independent parameters { μ i , j , k ( l ) , σ i , j , k ( l ) , c i , j , k ( l ) } to each connection from neuron i in layer l to neuron j in layer l + 1 , indexed over each basis function k { 1 , , K } . This allocation yields 3 K parameters per connection, resulting in a total parameter count of
O 3 K l = 0 L 1 n l n l + 1 = O ( 3 K W ) ,
where W = l = 0 L 1 n l n l + 1 denotes the aggregate number of connections across the network.
The convergence rate O ( h Ω s ) established in Theorem 3 depends fundamentally upon the collective distribution of basis centers throughout the domain, rather than upon connection-specific parameter assignments. This observation motivates a parameter-sharing strategy. Within each layer l { 0 , 1 , , L 1 } , we introduce layer-wise global basis parameters:
μ ( l ) = ( μ 1 ( l ) , μ 2 ( l ) , , μ K ( l ) ) R K , σ ( l ) = ( σ 1 ( l ) , σ 2 ( l ) , , σ K ( l ) ) R + K ,
which define a shared feature map Ψ ( l ) : R R K given componentwise by
Ψ k ( l ) ( x ) = exp ( x μ k ( l ) ) 2 2 ( σ k ( l ) ) 2 , k = 1 , , K .
Under this parameterization, each connection ( i , j ) between layers l and l + 1 maintains exclusively a coefficient vector c i , j ( l ) = ( c i , j , 1 ( l ) , , c i , j , K ( l ) ) R K . The learnable activation function associated with connection ( i , j ) is then expressed as
ϕ i , j ( l ) ( x ) = c i , j ( l ) , Ψ ( l ) ( x ) = k = 1 K c i , j , k ( l ) exp ( x μ k ( l ) ) 2 2 ( σ k ( l ) ) 2 .
The transformation is completed by introducing output bias terms b ( l ) = ( b 1 ( l ) , b 2 ( l ) , , b n l + 1 ( l ) ) R n l + 1 for each layer. This shared-basis design reduces the total parameter count to
O 2 K L + K W + n total = O ( K W ) ,
where n total = l = 1 L n l represents the total number of neurons. For networks with K 3 , this constitutes a substantial reduction in model complexity while preserving the approximation-theoretic guarantees of Theorem 3.

4.2. Convergence Rate Preservation Under Parameter Sharing

The parameter reduction from O ( 3 K W ) to O ( K W + 2 L K ) raises the question of whether shared-basis design degrades approximation-theoretic guarantees. We establish that the asymptotic Sobolev convergence rate O ( h Ω s ) is preserved exactly under the assumption that all activation functions within a layer require similar basis support regions. We quantify the potential constant degradation when this assumption is violated.
Theorem 7
(Rate Preservation under Shared Support). Let f H s ( R ) for s > 1 / 2 . Consider layer l with activation domain Ω ( l ) = [ μ min ( l ) , μ max ( l ) ] representing the range of pre-activation values. Assume all connection-specific functions { ϕ i , j ( l ) } require non-negligible values throughout Ω ( l ) (no extreme localization). Then:
(i) 
Shared-basis parameterization with K global centers { μ k ( l ) } k = 1 K quasi-uniformly distributed on Ω ( l ) achieves fill distance
h Ω ( l ) s h a r e d = d i a m ( Ω ( l ) ) 2 ( K 1 ) + O ( K 2 ) .
(ii) 
Per-connection parameterization with independent centers { μ i , j , k ( l ) } k = 1 K quasi-uniformly distributed on Ω ( l ) achieves
h Ω ( l ) p e r - c o n n = d i a m ( Ω ( l ) ) 2 ( K 1 ) + O ( K 2 ) = h Ω ( l ) s h a r e d .
(iii) 
Both interpolants satisfy identical convergence rates:
ϕ i , j ( l ) I h Ω ϕ i , j ( l ) L 2 ( Ω ( l ) ) C h Ω ( l ) s ϕ i , j ( l ) H s ( Ω ( l ) ) .
where the constant C depends on s and the quasi-uniformity constant c q u from Equation (8) but is independent of the parameter storage scheme.
Proof. 
Under the stated assumption that all connection functions { ϕ i , j ( l ) } require non-negligible support throughout the common domain Ω ( l ) , both parameterizations distribute K quasi-uniform basis functions over this domain. For quasi-uniform spacing with K points on interval [ a , b ] , the inter-point distance is ( b a ) / ( K 1 ) , yielding fill distance ( b a ) / ( 2 ( K 1 ) ) since the worst-case point lies midway between consecutive centers. Thus, both achieve
h Ω = μ max ( l ) μ min ( l ) 2 ( K 1 ) .
Theorem 3 establishes that for Gaussian RBF interpolation with quasi-uniform centers satisfying fill distance h Ω and shape parameter σ h Ω , the  L 2 error bound
f I h Ω f L 2 ( Ω ) C ( s , c qu ) h Ω s f H s ( Ω )
holds with constant C depending on Sobolev regularity s and quasi-uniformity constant c qu , but independent of the specific center storage mechanism. Since both parameterizations achieve identical fill distance and satisfy identical quasi-uniformity conditions, they admit identical convergence rate exponent s.    □
Corollary 1
(Worst-Case Constant Degradation). If per-connection centers concentrate on a subdomain Ω i , j ( l ) Ω ( l ) with d i a m ( Ω i , j ( l ) ) = α · d i a m ( Ω ( l ) ) for α ( 0 , 1 ] , while shared-basis centers must cover the full domain Ω ( l ) , then the fill distance ratio satisfies
h Ω ( l ) s h a r e d h Ω i , j ( l ) p e r - c o n n = d i a m ( Ω ( l ) ) / ( 2 ( K 1 ) ) α · d i a m ( Ω ( l ) ) / ( 2 ( K 1 ) ) = 1 α .
By Theorem 3, this induces constant degradation bounded by
C s h a r e d C p e r - c o n n α s .
For α = 0.5 (per-connection centers specialized to half the global range) and Sobolev regularity s = 2 , this yields at most 4 × constant increase. Critically, the convergence rate exponent s remains identical regardless of α, confirming that parameter sharing affects only multiplicative constants, not asymptotic convergence behavior.

4.3. Numerical Stability via Width Parameter Regularization

The optimization of width parameters σ k ( l ) R + presents inherent numerical challenges; in fact, as  σ k ( l ) 0 + , gradient magnitudes diminish exponentially when | x μ k ( l ) | σ k ( l ) , while concurrent evaluation of the exponential term exp ( x μ k ( l ) ) 2 2 ( σ k ( l ) ) 2 incurs risk of numerical overflow in the denominator. We mitigate these instabilities through a forward-pass regularization scheme that preserves unconstrained parameter evolution during optimization.
Definition 1
(Width Parameter Clamping). Given any layer index l { 0 , 1 , , L 1 } and basis function index k { 1 , , K } , we define the regularized squared width parameter by
( σ ^ k ( l ) ) 2 = max { ( σ k ( l ) ) 2 , ε min } ,
where ε min = 10 6 denotes the minimum admissible squared width.
The evaluation of radial basis functions employs the regularized parameter σ ^ k ( l ) in place of the native parameter σ k ( l ) :
Ψ k ( l ) ( x ) = exp ( x μ k ( l ) ) 2 2 ( σ k ( l ) ) 2 .
The stored parameters { σ k ( l ) } k = 1 K remain unclamped throughout training and evolve according to standard gradient-based optimization; the regularization (29) is applied exclusively during forward propagation. This design prevents numerical pathologies while maintaining unrestricted exploration of the parameter manifold during optimization.
Lemma 1
(Uniform Boundedness of RBF Evaluations). Under the regularization scheme of Definition 1, the following properties hold uniformly over all x R , all layers l { 0 , , L 1 } , and all basis indices k { 1 , , K } :
(i)
0 < Ψ k ( l ) ( x ) 1 ;
(ii)
The exponential argument satisfies the bound
( x μ k ( l ) ) 2 2 ( σ ^ k ( l ) ) 2 ( x μ k ( l ) ) 2 2 ε min = 5 × 10 5 · ( x μ k ( l ) ) 2 .
Proof. 
Property (i): The upper bound follows from the non-positivity of the exponential argument. Since ( x μ k ( l ) ) 2 0 and ( σ ^ k ( l ) ) 2 > 0 , we obtain
Ψ k ( l ) ( x ) = exp ( x μ k ( l ) ) 2 2 ( σ ^ k ( l ) ) 2 exp ( 0 ) = 1 .
The strict positivity Ψ k ( l ) ( x ) > 0 holds because the exponential function maps R into ( 0 , ) .
Property (ii): By Definition 1, ( σ ^ k ( l ) ) 2 ε min , which yields
( x μ k ( l ) ) 2 2 ( σ ^ k ( l ) ) 2 ( x μ k ( l ) ) 2 2 ε min .
Substituting ε min = 10 6 gives
1 2 ε min = 1 2 × 10 6 = 10 6 2 = 5 × 10 5 .
Consider normalized inputs satisfying | x | , | μ k ( l ) | M with M > 0 . The triangle inequality ensures
( x μ k ( l ) ) 2 ( | x | + | μ k ( l ) | ) 2 4 M 2 .
Taking M = 10 as representative of standardized data yields ( x μ k ( l ) ) 2 400 , then
( x μ k ( l ) ) 2 2 ε min 400 2 × 10 6 = 2 × 10 8 ,
which lies within the exponent range of IEEE 754 double-precision floating-point arithmetic (approximately ± 10 308 ), thereby guaranteeing numerical stability under standard computational implementations.    □
Dead unit prevention. When the width parameter σ k ( l ) becomes sufficiently small such that ( σ k ( l ) ) 2 ε min , Theorem 9 establishes gradient vanishing L / σ k ( l ) = 0 , creating a dead unit unable to adapt during optimization. The clamping mechanism prevents this pathology while inducing a singular Hessian structure at the boundary: for ( σ k ( l ) ) 2 = ε min , the width Hessian entry satisfies 2 L / ( σ k ( l ) ) 2 = 0 due to the vanishing first derivative, while off-diagonal coupling terms 2 L / ( σ k ( l ) c i , j , k ( l ) ) remain nonzero via the chain rule through Ψ n , i , k ( l ) dependencies. This rank-deficient structure indicates that width parameters reaching ε min effectively deactivate while coefficient parameters c i , j , k ( l ) retain gradient flow. Empirically, initialization strategy (Equation (47)) with σ init ( μ max μ min ) / ( 2 ( K 1 ) ) ensures all basis functions remain active throughout training, with observed minimum widths satisfying min k σ k ( l ) > 10 3 = ε min across all experiments.
Remark 1.
The regularization constant ε min = 10 6 balances numerical stability against representational flexibility. Smaller values risk underflow in reciprocal computations, while larger values impose constraints on the expressivity of narrow basis functions.

4.4. Forward Propagation

We start by describing the forward propagation strategy. Let X ( 0 ) R B × n 0 denote the input matrix containing B samples, where X n , i ( 0 ) represents the i-th feature of the n-th sample. At layer l { 0 , 1 , , L 1 } , we construct the evaluation tensor Ψ ( l ) R B × n l × K with entries defined by
Ψ n , i , k ( l ) = exp ( X n , i ( l ) μ k ( l ) ) 2 2 ( σ ^ k ( l ) ) 2 ,
where indices range over n { 1 , , B } , i { 1 , , n l } , and  k { 1 , , K } . The coefficient structure is encoded in C ( l ) R n l × n l + 1 × K with entries C i , j , k ( l ) = c i , j , k ( l ) . The activations at layer l + 1 are computed via tensor contraction:
X n , j ( l + 1 ) = i = 1 n l k = 1 K Ψ n , i , k ( l ) C i , j , k ( l ) + b j ( l ) ,
where n { 1 , , B } and j { 1 , , n l + 1 } . In Einstein summation notation, this transformation takes the form
X ( l + 1 ) = einsum ( nik , ijk - > nj , Ψ ( l ) , C ( l ) ) + 1 B b ( l ) ,
where 1 B R B denotes the vector of ones. Previous forward procedures are summarized in following Algorithm 1.
Algorithm 1 Shared-basis RBF-KAN forward propagation
  • Require: Input X ( 0 ) R B × n 0 , parameters { μ ( l ) , σ ( l ) , C ( l ) , b ( l ) } l = 0 L 1
  • Ensure: Output X ( L ) R B × n L
      1:
    for   l = 0 , 1 , , L 1   do
      2:
        Compute ( σ ^ k ( l ) ) 2 max { ( σ k ( l ) ) 2 , 10 6 } over k { 1 , , K }
      3:
        Evaluate Δ : , : , k X ( l ) μ k ( l ) via broadcasting
      4:
        Compute Ψ : , : , k ( l ) exp Δ : , : , k 2 / ( 2 ( σ ^ k ( l ) ) 2 ) over k { 1 , , K }
      5:
        Evaluate X ( l + 1 ) einsum ( nik , ijk - > nj , Ψ ( l ) , C ( l ) ) + 1 B b ( l )
      6:
    end for
        return  X ( L )
The computational complexity of each layer forward pass consists of two main operations. The radial basis function evaluation requires computing B × K × n l exponential functions, where B represents the batch size, K denotes the number of basis functions per dimension, and  n l indicates the number of neurons in layer l. Each exponential function evaluation involves calculating the distance between input coordinates and basis function centers, followed by the exponential of the negative squared distance. The tensor contraction step then performs B × n l × n l + 1 × K multiply-accumulate operations, combining the evaluated basis functions with the learned weight tensor to produce the layer output. For each sample in the batch B, this operation maps n l input features to n l + 1 output features through K-dimensional basis function expansions, requiring element-wise multiplications followed by summations across all basis functions.
The total computational complexity of the forward pass scales as
T forward = O ( B W K ) ,
where W = l = 0 L 1 n l n l + 1 denotes the total number of connections across the network.

4.5. Backward Propagation via Automatic Differentiation

We derive gradients with respect to all parameters θ = { C ( l ) , b ( l ) , μ ( l ) , σ ( l ) } l = 0 L 1 via reverse-mode automatic differentiation. Throughout this derivation, we assume the forward activations { X ( l ) , Ψ ( l ) } l = 0 L 1 and the output gradient L X ( L ) R B × n L are available from the forward pass and loss computation, respectively.
Lemma 2
(Coefficient Gradient). The linearity of the transformation (33) yields
L C i , j , k ( l ) = n = 1 B L X n , j ( l + 1 ) · Ψ n , i , k ( l ) .
In tensor notation, this gradient admits the compact representation
L C ( l ) = einsum nik , nj - > ijk , Ψ ( l ) , L X ( l + 1 ) .
Lemma 3
(Bias Gradient). The gradient with respect to the bias parameter b j ( l ) is given by
L b j ( l ) = n = 1 B L X n , j ( l + 1 ) .
In vector form, this becomes
L b ( l ) = L X ( l + 1 ) 1 B .
We also establish the local derivatives of RBF evaluations before deriving the full loss gradients.
Lemma 4
(RBF Derivatives w.r.t. Basis Parameters). Let Ψ k ( l ) ( x ) = exp ( x μ k ( l ) ) 2 2 ( σ ^ k ( l ) ) 2 denote the RBF evaluation. The following derivatives hold:
Ψ k ( l ) ( x ) μ k ( l ) = Ψ k ( l ) ( x ) · x μ k ( l ) ( σ ^ k ( l ) ) 2 ,
Ψ k ( l ) ( x ) σ k ( l ) = Ψ k ( l ) ( x ) · ( x μ k ( l ) ) 2 ( σ k ( l ) ) 3 i f ( σ k ( l ) ) 2 > ε min , 0 i f ( σ k ( l ) ) 2 ε min ,
Ψ k ( l ) ( x ) x = Ψ k ( l ) ( x ) · x μ k ( l ) ( σ ^ k ( l ) ) 2 .
Proof. 
Equations (39) and (41) follow from the chain rule applied to the composite function exp ( g ( x ) ) , where g ( x ) = ( x μ k ( l ) ) 2 / ( 2 ( σ ^ k ( l ) ) 2 ) .
The piecewise structure of (40) arises from the clamping operation. When ( σ k ( l ) ) 2 > ε min , the regularized width satisfies σ ^ k ( l ) = σ k ( l ) , and differentiating the exponential argument yields
σ k ( l ) ( x μ k ( l ) ) 2 2 ( σ k ( l ) ) 2 = ( x μ k ( l ) ) 2 ( σ k ( l ) ) 3 .
Conversely, when ( σ k ( l ) ) 2 ε min , the clamping operation yields σ ^ k ( l ) = ε min , which is constant with respect to σ k ( l ) , thereby producing a vanishing derivative. Complete derivations appear in Appendix A.    □
Theorem 8
(Center Parameter Gradient). The gradient of the loss function with respect to the center parameter μ k ( l ) is given by
L μ k ( l ) = n = 1 B i = 1 n l j = 1 n l + 1 L X n , j ( l + 1 ) · C i , j , k ( l ) · X n , i ( l ) μ k ( l ) ( σ ^ k ( l ) ) 2 · Ψ n , i , k ( l ) .
Proof. 
Applying the chain rule yields
L μ k ( l ) = n , i , j L X n , j ( l + 1 ) · X n , j ( l + 1 ) Ψ n , i , k ( l ) · Ψ n , i , k ( l ) μ k ( l ) .
From the forward transformation (33), we obtain X n , j ( l + 1 ) Ψ n , i , k ( l ) = C i , j , k ( l ) . Substituting the derivative (39) from Lemma 4 completes the derivation. The full calculation appears in Appendix A.    □
Theorem 9
(Width Parameter Gradient). The gradient with respect to the width parameter σ k ( l ) admits the piecewise representation
L σ k ( l ) = n = 1 B i = 1 n l j = 1 n l + 1 L X n , j ( l + 1 ) · C i , j , k ( l ) · ( X n , i ( l ) μ k ( l ) ) 2 ( σ k ( l ) ) 3 · Ψ n , i , k ( l ) i f ( σ k ( l ) ) 2 > ε min , 0 i f ( σ k ( l ) ) 2 ε min .
Proof. 
The result follows from applying the chain rule in conjunction with the derivative (40) from Lemma 4. The vanishing gradient when clamping is active establishes a soft lower bound on the width parameters during optimization.    □
Theorem 10
(Input Gradient). The gradient with respect to layer inputs, required for backpropagation to preceding layers, is given by
L X n , i ( l ) = j = 1 n l + 1 k = 1 K L X n , j ( l + 1 ) · C i , j , k ( l ) · X n , i ( l ) μ k ( l ) ( σ ^ k ( l ) ) 2 · Ψ n , i , k ( l ) .
Proof. 
The chain rule with derivative (41) from Lemma 4 yields the stated expression. The negative sign reflects the translation symmetry of the Gaussian kernel: Ψ x = Ψ μ . The full derivation is in Appendix A.    □
The dominant operations at each layer consist of
  • Coefficient gradients (line 4): O ( B n l n l + 1 K ) operations.
  • Basis parameter gradients (lines 6–13): O ( K · B n l n l + 1 ) = O ( B n l n l + 1 K ) operations.
  • Input gradients (lines 15–20): O ( K · B n l n l + 1 ) = O ( B n l n l + 1 K ) operations.
The total computational complexity of the backward pass scales as T backward = O ( B W K ) , where W = l = 0 L 1 n l n l + 1 denotes the total connection count. The combined training cost per minibatch satisfies
T train = T forward + T backward = O ( B W K ) ,
which matches the complexity of standard multilayer perceptrons up to a multiplicative factor of K (typically K { 5 , 6 , , 10 } in practice). Forward procedures are shown in Algorithm 2.
Algorithm 2 Shared-basis RBF-KAN backpropagation
Require: Forward activations { X ( l ) , Ψ ( l ) } l = 0 L 1 , output gradient L X ( L )
Ensure: Parameter gradients L C ( l ) , L b ( l ) , L μ ( l ) , L σ ( l ) l = 0 L 1
  1: for  l = L 1 , L 2 , , 0  do
  2:    Compute L C ( l ) einsum ( nik , nj - > ijk , Ψ ( l ) , L X ( l + 1 ) ) ▹ Coefficients
  3:    Compute L b ( l ) L X ( l + 1 ) 1 B ▹ Biases
  4:    Initialize L μ ( l ) 0 R K and L σ ( l ) 0 R K
  5:    for  k = 1 , 2 , , K  do
  6:        Evaluate ( σ ^ k ( l ) ) 2 max { ( σ k ( l ) ) 2 , ε min }
  7:        Compute G L X ( l + 1 ) · ( C : , : , k ( l ) ) ▹ Weighted gradient
  8:        Evaluate Δ : , : , k X ( l ) μ k ( l )
  9:        Accumulate L μ k ( l ) n , i G n , i · Δ n , i , k ( σ ^ k ( l ) ) 2 · Ψ n , i , k ( l ) ▹ Theorem 8
  10:        if  ( σ k ( l ) ) 2 > ε min  then
  11:           Accumulate L σ k ( l ) n , i G n , i · ( Δ n , i , k ) 2 ( σ k ( l ) ) 3 · Ψ n , i , k ( l ) ▹ Theorem 9
  12:        end if
  13:    end for
  14:    if  l > 0  then▹ Input gradients when not at input layer
  15:        Initialize L X ( l ) 0 R B × n l
  16:        for  k = 1 , , K  do
  17:           Compute G k L X ( l + 1 ) · ( C : , : , k ( l ) )
  18:           Evaluate Δ : , : , k X ( l ) μ k ( l )
  19:           Accumulate L X ( l ) L X ( l ) G k Δ : , : , k ( σ ^ k ( l ) ) 2 Ψ : , : , k ( l ) ▹ Negative sign, Theorem 10
  20:        end for
  21:    end if
  22:end for

4.6. Initialization, Optimization, and Stability

Having established the forward and backward propagation algorithms, we now specify the complete training framework. This includes the geometric initialization strategy based on Theorem 3, the differentiated optimization protocol with adaptive learning rates matched to the gradient scaling established in Theorems 8 and 9, and the regularized objective function that enforces desirable properties during training. We conclude with theoretical stability guarantees ensuring well-conditioned optimization dynamics.

4.6.1. Parameter Initialization Strategy

We adopt a geometric initialization scheme based on the approximation-theoretic principles established in Theorem 3. The centers { μ k ( l ) } k = 1 K are distributed uniformly over the expected activation range [ μ min , μ max ] = [ 2.5 , 2.5 ] according to
μ k ( l ) = μ min + μ max μ min K 1 ( k 1 ) , k { 1 , , K } .
Meanwhile, all widths are initialized uniformly at σ k ( l ) = σ init , where
σ init μ max μ min 2 ( K 1 ) .
This choice yields moderate overlap between adjacent basis functions, consistent with the optimal scaling σ h Ω prescribed by Theorem 3, where the characteristic mesh size satisfies
h Ω = ( μ max μ min ) / ( K 1 ) .
Regarding coefficients parameters, we employ a target-guided warm-start strategy that approximates a reference activation function g : R R (typically ReLU or GELU):
c i , j , k ( l ) = g ( μ k ( l ) ) + α init · ε i , j , k , ε i , j , k N ( 0 , 1 ) ,
where the noise magnitude α init [ 0.01 , 0.1 ] is task-dependent and balances the incorporation of structural prior knowledge against symmetry-breaking required for effective gradient-based learning. Finally, all biases are initialized to zero
b j ( l ) = 0 .

4.6.2. Optimization Protocol

We employ the Adam optimizer with differentiated learning rates designed to account for the heterogeneous gradient scaling across parameter types:
  • Coefficients: α C = 10 3 (baseline learning rate).
  • Centers: α μ = 10 4 (compensates for the O ( ( σ k ( l ) ) 2 ) scaling established in Theorem 8).
  • Widths: α σ = 10 5 (compensates for the O ( ( σ k ( l ) ) 3 ) scaling established in Theorem 9).
Gradient clipping with parameter-specific thresholds τ C = 1.0 , τ μ = 0.1 , and  τ σ = 0.01 mitigates the effects of gradient outliers during training.

4.6.3. Regularized Objective Function

The total training objective augments the empirical data loss L data ( θ ) with three regularization terms:
L total ( θ ) = L data ( θ ) + λ c 2 l = 0 L 1 C ( l ) F 2 + λ σ l = 0 L 1 k = 1 K 1 ( σ k ( l ) ) 2 + λ d l = 0 L 1 k < k exp ( μ k ( l ) μ k ( l ) ) 2 2 τ 2 ,
where the regularization coefficients are specified as follows:
  • λ c = 10 4 : coefficient sparsity penalty via Frobenius norm.
  • λ σ = 10 3 : width penalty discouraging excessively narrow basis functions.
  • λ d = 10 5 : center diversity enforcement via Gaussian repulsion with characteristic scale τ = 0.5 .
Theorem 11
(Stability Guarantees under Regularized Training). Assume the regularized training satisfies the monotonic decrease of the total loss, i.e.,  L t o t a l ( t ) L t o t a l ( 0 ) holds uniformly over all training iterations t 0 . Then the following properties hold:
(i) 
Coefficient Boundedness: The Frobenius norm of each coefficient tensor satisfies
C ( l ) ( t ) F 2 L t o t a l ( 0 ) λ c
uniformly over all layers l { 0 , , L 1 } and iterations t 0 .
(ii) 
Width Lower Bound: Each width parameter satisfies
σ k ( l ) ( t ) ε min = 10 3
uniformly over all basis indices k { 1 , , K } , layers l { 0 , , L 1 } , and iterations t 0 .
(iii) 
Center Diversity: The separation between distinct centers satisfies
P | μ k ( l ) μ k ( l ) | τ 2 log ( λ d 1 ) 1 λ d L t o t a l ( 0 )
where the probability is taken over the stochastic optimization trajectory.
Proof. 
Property (i): The regularization term satisfies
λ c 2 l = 0 L 1 C ( l ) ( t ) F 2 L total ( t ) L total ( 0 )
by the monotonicity assumption. Isolating the coefficient norm yields the stated bound.
Property (ii): Suppose ( σ k ( l ) ) 2 < ε min at some iteration. By Theorem 9, the gradient vanishes: L σ k ( l ) = 0 . Consequently, gradient-based optimization cannot decrease σ k ( l ) further, establishing a soft lower bound at ε min .
Property (iii): The Gaussian repulsion penalty increases rapidly when centers approach one another. Configurations violating the stated separation bound contribute at least λ d exp [ τ 2 log ( λ d 1 ) ] 2 2 τ 2 = λ d · λ d = λ d 2 to the regularization term. Under the loss bound
L total ( t ) L total ( 0 ) ,
such configurations occur with probability at most λ d L total ( 0 ) .    □
Remark 2.
These theoretical guarantees ensure well-conditioned optimization dynamics: coefficients remain uniformly bounded (preventing numerical overflow), width parameters stay above the numerical stability threshold, and centers maintain sufficient diversity (avoiding basis function redundancy).

5. Experimental Results

We validate the proposed shared-basis RBF-KAN architecture on two distinct problem domains: supervised image classification and physics-informed neural network (PINN) applications for solving partial differential equations (PDEs). All experiments were conducted on NVIDIA GPU hardware using PyTorch 2.10.0 with automatic mixed precision when applicable.

5.1. Image Classification on MNIST

Experimental Setup

We evaluate KAN variants on the MNIST handwritten digit classification task:
1.
Standard KAN: Original KAN implementation with B-spline basis functions and SiLU activation
2.
RBF-KAN: Our proposed shared-basis architecture with learnable Gaussian RBFs
Network Architecture. All models employ a three-layer architecture with width configuration [ 784 , 64 , 10 ] , where the input dimension corresponds to flattened 28 × 28 grayscale images and the output dimension matches the 10 digit classes. For the standard KAN, we use grid resolution G = 3 and B-spline order k = 3 . While considering RBF-based variants, we employ K = 5 Gaussian basis functions per connection with noise scale ϵ noise = 0.1 for robust initialization.
Training Protocol. We train for 1000 epochs using the Adam optimizer with learning rate α = 10 3 , mini-batch size B = 256 , and cross-entropy loss:
L CE ( θ ) = 1 N n = 1 N c = 1 10 y n , c log y ^ n , c ( θ )
where y n , c { 0 , 1 } denotes ground-truth labels and y ^ n , c ( θ ) = softmax ( f θ ( x n ) ) c represents predicted probabilities. Regularization is applied via 1 -penalty on coefficients with λ 1 = 10 4 and no weight decay ( λ weight = 0 ). Grid adaptation is disabled (update_grid=False) to isolate the effect of basis function choice.

5.2. MNIST Classification with Controlled Architecture

In order to address architectural confounding identified in preliminary experiments, we conducted a controlled comparison wherein both Standard KAN and RBF-KAN employ identical network width [ 784 , 64 , 10 ] , differing solely in basis function parameterization. Standard KAN utilizes B-spline bases with grid resolution G = 3 and spline order k = 3 , while RBF-KAN employs K = 5 shared Gaussian RBFs per layer with initialization noise ϵ noise = 0.1 and width clamping at σ min = 10 6 . Both architectures were trained for 1000 epochs using the Adam optimizer ( α = 10 3 , batch size B = 256 ) with cross-entropy loss and coefficient regularization λ 1 = 10 4 . Grid refinement was disabled for Standard KAN to ensure parameter stability throughout training. All experiments were replicated across three random seeds to quantify statistical variance. Standard KAN achieved a test accuracy of 89.1 ± 0.2 % with a generalization gap of 1.1 ± 0.2 % , while RBF-KAN attained 87.8 ± 0.3 % with gap of 2.7 ± 0.3 % , representing a 1.3 percentage point performance differential. The elevated generalization gap in RBF-KAN reflects the global support structure of Gaussian basis functions, which lack the implicit regularization provided by B-splines’ compact local support. Training accuracy for RBF-KAN marginally exceeded Standard KAN overfitting to training data. Parameter counts differed in that Standard KAN required 7.6 × 10 5 parameters versus 2.5 × 10 5 for RBF-KAN, confirming the theoretical threefold reduction from shared-basis design. Wall-clock training time decreased from 330 ± 8 s to 235 ± 6 s (29% improvement), with per-batch forward pass latency reducing from 18.2 ± 0.4 ms to 13.1 ± 0.3 ms. Memory consumption decreased 33% from 2.4 GB to 1.6 GB due to the elimination of per-connection basis storage. The controlled protocol reveals a fundamental accuracy-efficiency trade-off: RBF-KAN sacrifices 1.3 percentage points of test accuracy to achieve threefold parameter reduction and 1.4 × computational speedup. This exchange proves acceptable in resource-constrained deployments or when error margins dominate classification uncertainty. The performance gap does not generalize across problem domains. Domain-specific basis selection should balance regularization requirements against derivative accuracy demands.

5.2.1. Learning Dynamics and Generalization Analysis

Figure 1 presents a comprehensive analysis of the learning dynamics across all the architectures.
Follows a discussion of achieved results. The convergence dynamics reveal several salient properties across the evaluated architectures. All implementations achieve test accuracy exceeding 85% within 200 training epochs, with diminishing marginal improvements beyond epoch 400. The initial learning phase spanning epochs 0–100 exhibits elevated gradient magnitudes as demonstrated in Panel B, followed by a fine-tuning regime characterized by smaller parameter updates. Generalization stability varies systematically across architectural variants. The Standard KAN maintains a generalization gap consistently below the 1% threshold (indicated by the dashed line in Panel C) throughout the training trajectory. This behavior can be attributed to the adaptive grid refinement mechanism and the localized support structure of B-spline basis functions. Conversely, RBF-based variants demonstrate persistent generalization gaps exceeding 2%, suggesting that the global support of Gaussian basis functions may necessitate explicit capacity regularization to achieve comparable generalization performance. The distributional analysis of test accuracy presented in Panel D provides additional insight into optimization stability. The Standard KAN exhibits a narrower interquartile range, indicating more consistent epoch-to-epoch performance. The RBF-KAN variant displays occasional outlier epochs (represented by isolated points below the box plot), which we hypothesize arise from gradient instabilities occurring when basis function centers become misaligned with the input data distribution. This phenomenon underscores the sensitivity of center parameters to initialization and learning rate schedules, as formalized in Theorems 8 and 11.

5.2.2. Architectural Implications

The preceding analysis reveals three principal trade-offs in RBF-KAN design. The RBF-KAN achieves approximately threefold parameter reduction relative to full B-spline parameterization, yielding 1.4 × computational speedup through the shared-basis design wherein μ ( l ) and σ ( l ) are maintained globally per layer. The asymptotic complexity O ( B W K ) established in Equation (35) confirms linear scaling with basis count K.
The Standard KAN achieves 89.1% test accuracy versus 87.8% for RBF, an absolute 1.3% degradation attributable to differing basis support structures. B-splines provide implicit regularization through compact local support, while Gaussian RBFs exhibit global influence that may require explicit capacity control beyond Equation (49). Hybrid strategies combining local and global bases warrant investigation.
The initialization noise scale ε noise = 0.1 admits a narrow stable regime: values exceeding 0.2 induce early gradient instabilities (Theorems 8 and 9), while values below 0.05 cause width collapse toward ε min , yielding inactive basis functions. This sensitivity confirms the necessity of the width regularization term λ σ k , l ( σ k ( l ) ) 2 in maintaining gradient signal (Theorem 11).

5.2.3. Comparison with Alternative Efficient KAN Variants

In order to position RBF-KAN within the parameter-efficient KAN landscape, we analyze three recent variants addressing O ( 3 K W ) complexity: WavKAN [45] (wavelets), fKAN [46] (fractional Jacobi polynomials), and rKAN [47] (rational functions). While competitive on certain benchmarks, their mathematical properties constrain applicability to physics-informed neural networks requiring high-order derivative accuracy. Gaussian RBFs belong to Schwartz space S ( R ) with C ( R ) smoothness. The n-th derivative admits a closed form
d n d x n exp ( x μ ) 2 2 σ 2 = ( 1 ) n σ n He n x μ σ exp ( x μ ) 2 2 σ 2
where He n denotes the probability of the Hermite polynomial. This polynomial Gaussian structure enables exact computation of arbitrarily high-order derivatives without truncation error, essential for hyperbolic PDEs requiring precise mixed derivatives x x 2 u and t t 2 u governing wave propagation. Daubechies wavelets ψ N with N vanishing moments possess Hölder regularity α 0.2 N for large N [48,49], with measured values α 0.55 (DB2), α 0.8 (DB4), α 1.2 (DB6). Since α < k implies ψ C k ( R ) , commonly used DB2 and DB4 fail to achieve C 1 smoothness, introducing derivative discontinuities that manifest as spurious artifacts in PDE solutions. DB6 achieves C 1 but not C 2 , insufficient for second-order hyperbolic equations. This limited smoothness constrains accurate approximation of oscillatory solutions u ( x , t ) = sin ( π x ) cos ( c π t ) , whose Fourier spectrum contains infinitely many non-zero frequencies requiring smooth bases for reconstruction without artificial damping. Fractional Jacobi functions ( 1 x ) α ( 1 + x ) β P n ( α , β ) ( x ) remain polynomial in x with degree n despite fractional weight exponents α , β > 1 . Derivatives of order m > n satisfy d m P n / d x m = 0 , fundamentally limiting the representation of functions with infinitely many non-zero Taylor coefficients. Considering wave solutions u ( x , t ) = sin ( π x ) cos ( c π t ) with infinite Taylor expansion, polynomial degree n 10 is required to achieve error below 10 1 , yielding parameter count Θ ( 10 W ) that defeats efficiency motivation. While fractional ( α , β ) modulates weight distribution, it cannot overcome the C n smoothness bound, where derivatives exceeding degree n contribute zero approximation capacity. Rational functions R ( x ) = P ( x ) / Q ( x ) introduce singularities at Z = { x : Q ( x ) = 0 } . The gradient R / x = [ P Q P Q ] / Q 2 exhibits quadratic blow-up R 1 / | Q | 2 near zeros, necessitating clamping Q clamp = max ( Q , ϵ min ) or penalties λ Q Q L 2 2 that defeat the approximation advantage of rational functions. This conflicts with the uniform smoothness requirements of PDE solutions u H s ( Ω ) on compact domains, introducing gradient instabilities during iterative residual minimization, particularly when true solutions contain no singularities. Given the absence of controlled comparisons on identical configurations, Table 1 provides estimates combining measured values with theoretical predictions based on smoothness constraints.
MNIST estimates reflect that wavelet multi-resolution analysis provides superior inductive bias for local edge detection (∼ 1.5 % advantage over RBF-KAN), while rational functions suffer conditioning challenges under stochastic training. Wave PDE estimates quantify smoothness impact: wavelets with α < 2 cannot accurately represent
x t 2 u * = c π 2 sin ( π x ) sin ( c π t )
without artificial dissipation, violating energy conservation. Fractional Jacobi polynomials truncate at degree n with error O ( h n + 1 ) , requiring n 10 for sub- 10 1 accuracy on oscillatory solutions. Rational functions exhibit gradient explosions near Z during residual minimization.

5.3. Global Support and Generalization: Theoretical Analysis

The observed generalization gap disparity requires further investigation. We establish that this phenomenon arises from fundamental differences in basis function support structure and quantify mitigation strategies through systematic ablation.

5.3.1. Support Structure and Effective Capacity

Considering the basis function ψ : R R , we define the ϵ -effective support as supp ϵ ( ψ ) = { x R : | ψ ( x ) | ϵ · ψ } where ϵ > 0 quantifies the threshold for non-negligible contribution. B-splines of order k with support interval [ a , b ] exhibit compact support supp ϵ ( ψ B-spline ) = [ a , b ] with width O ( h Ω ) independent of threshold ϵ . Conversely, Gaussian RBF ψ Gauss ( x ; μ , σ ) = exp ( ( x μ ) 2 / ( 2 σ 2 ) ) satisfies | ψ ( x ) | ϵ when | x μ | σ 2 log ( 1 / ϵ ) . For numerical precision threshold ϵ = 10 3 , the effective support width satisfies | supp ϵ ( ψ Gauss ) | = 2 σ 2 log ( 10 3 ) = 2 σ 13.815 7.4 σ . With  σ h Ω as prescribed by Theorem 3 and K = 5 basis functions quasi-uniformly distributed, Gaussian bases maintain non-negligible influence across 7.4 h Ω 1.5 · diam ( Ω ) / K , demonstrating substantial overlap between adjacent centers. This global support structure induces capacity amplification through simultaneous activation of all K basis functions throughout domain Ω . For local support bases, only O ( K 1 ) fraction contribute non-negligibly to any point x, yielding effective dimensionality d eff = O ( 1 ) per evaluation. Global support Gaussians with 7.4 h Ω spread achieve d eff = K since all bases contribute simultaneously.

5.3.2. Regularization Mitigation and Empirical Ablation

The tripartite regularization scheme of Equation (48) addresses capacity amplification through coefficient sparsity ( λ c = 10 4 ), width stabilization ( λ σ = 10 3 ), and center diversity enforcement ( λ d = 5 × 10 4 ). The  1 penalty λ c i , j , k | c i , j , k ( l ) | induces sparse linear combinations, effectively reducing the number of simultaneously active basis functions per connection and directly counteracting the d eff = K effective dimensionality. The inverse-variance penalty λ σ k , l ( σ k ( l ) ) 2 discourages excessively narrow Gaussians that would amplify ill-conditioning (Theorem 4) while maintaining sufficient width σ h Ω for approximation quality (Theorem 3). The Gaussian repulsion term λ d k < k exp ( ( μ k μ k ) 2 / ( 2 τ 2 ) ) with characteristic scale τ = 0.5 prevents basis collapse wherein multiple centers concentrate in narrow regions. We conducted systematic ablation studies, removing each component individually on MNIST with a controlled architecture [ 784 , 64 , 10 ] . Removing coefficient regularization ( λ c = 0 ) produces the most severe degradation, increasing the generalization gap by 59% from 2.7% to 4.3% and reducing test accuracy from 87.8% to 86.2%. Post-training coefficient analysis reveals that without the 1 penalty, 94% of coefficients exceed the threshold | c i , j , k | > 10 3 compared with 67% for the baseline, confirming the loss of sparsity. Removing width stabilization ( λ σ = 0 ) increases the gap to 3.1% (15% relative increase), with training exhibiting occasional gradient instabilities after epoch 700 when individual widths collapse below 3 × 10 6 despite clamping at σ min = 10 6 . Removing center diversity increases the gap to 3.4%, with post-training analysis revealing that three of five centers in the first hidden layer concentrate within radius of 0.08 in standardized activation units, leaving tail regions under-represented. We investigated two alternative regularization strategies proposed in the literature. Dropout with rate p = 0.2 applied to hidden layer activations achieves test accuracy of 88.1% with a generalization gap of 2.4%, representing a modest 0.3 percentage point improvement over baseline. However, dropout introduces a training time overhead of 34% and severely degrades physics-informed applications, with wave equation L 2 error increasing from 8.0 × 10 4 to 1.9 × 10 3 due to stochastic masking interfering with smooth derivative computation. Weight decay λ wd = 10 4 applied to all parameters achieves test accuracy of 88.0% with a gap 2.5%, but post-training inspection reveals systematic bias wherein learned centers concentrate near zero rather than covering the full observed activation range [ 1.2 , 1.5 ] , leaving tail regions poorly approximated. The tripartite regularization achieves optimal balance with 2.7% generalization gap, no training overhead, and preserved PDE accuracy.

5.4. Comparison with Standard MLP Baseline

In order to contextualize RBF-KAN within classical neural architecture paradigms, we compare it against standard multilayer perceptrons (MLPs) with ReLU activations, representing the dominant fixed-activation baseline in deep learning. Both architectures employ identical network width [ 784 , 64 , 10 ] and training protocol (Adam optimizer with α = 10 3 , batch size B = 256 , 1000 epochs) to ensure a controlled comparison. The standard MLP requires only 5.1 × 10 4 parameters (50,890 total: 784 × 64 weights + 64 biases in layer 1, plus 64 × 10 weights + 10 biases in layer 2), achieving substantially higher test accuracy of 97.5 ± 0.2 % with a negligible generalization gap of 0.4 % . Training completes in 155 s, representing a 1.5× speedup over RBF-KAN (235 s) due to the absence of basis function evaluations—forward propagation reduces to matrix multiplications followed by elementwise ReLU operations max ( 0 , x ) . Memory consumption of 0.3 GB (versus 1.6 GB for RBF-KAN) reflects the minimal parameter count and absence of learnable activation storage. The superior MLP performance on MNIST demonstrates that for classification tasks with relatively simple decision boundaries and large training sets (60,000 samples), fixed ReLU activations provide sufficient expressiveness without requiring adaptive univariate functions. However, this performance hierarchy reverses for physics-informed neural network applications demanding accurate derivative computation. Section 5.2 establishes that RBF-KAN achieves 19.3× improvement over Standard KAN on hyperbolic wave equations (maximum pointwise error 0.052 versus 1.00 ), attributed to the infinite differentiability of Gaussian basis functions. Standard MLPs with ReLU activations fail catastrophically on such tasks—preliminary experiments on the one-dimensional wave equation 2 u / t 2 = c 2 2 u / x 2 with MLP architecture [ 2 , 64 , 64 , 1 ] yield L 2 error 5.0 × 10 2 , nearly two orders of magnitude worse than RBF-KAN ( 8.0 × 10 4 ). This degradation arises from derivative discontinuities at ReLU kink points ( x = 0 ), where ReLU ( x ) exhibits jump discontinuities and ReLU ( x ) vanishes almost everywhere except at the origin. These pathologies propagate through automatic differentiation when computing PDE residuals N [ u θ ] , which require second-order spatial and temporal derivatives for hyperbolic equations. Considering the supervised learning on well-structured datasets with ample training data, standard MLPs achieve optimal accuracy-efficiency trade-offs through simple fixed activations and mature optimization techniques. For scientific computing applications requiring smooth function approximation and accurate high-order derivatives—including PDE solving, surrogate modeling of dynamical systems, and physics-informed inverse problems—RBF-KAN’s learnable Gaussian activations provide essential infinite differentiability despite modest classification accuracy penalties. This domain-specific performance dichotomy underscores the value of adaptive activation functions for specialized applications where derivative accuracy supersedes raw classification performance.

5.5. Physics-Informed Neural Networks for PDE Solving

We evaluate the proposed RBF-KAN architecture on three canonical PDE problems representing distinct mathematical characteristics: elliptic (steady-state), parabolic (diffusion), and hyperbolic (wave propagation). This benchmark confirms the ability of the networks to approximate solutions satisfying both differential equations and boundary/initial conditions without explicit supervision on interior points.

5.5.1. General PINN Formulation

Considering a PDE of the form N [ u ] ( x ) = f ( x ) on domain Ω with boundary Ω , the PINN loss decomposes as
L PINN ( θ ) = w PDE · L PDE Interior residual + w BC · L BC Boundary conditions + w IC · L IC Initial conditions
where
L PDE = 1 N int i = 1 N int | N [ u θ ] ( x i ) f ( x i ) | 2 , x i Ω L BC = 1 N bc j = 1 N bc | u θ ( x j ) g ( x j ) | 2 , x j Ω L IC = 1 N ic k = 1 N ic | u θ ( x k , 0 ) u 0 ( x k ) | 2 , x k Ω .
The weights { w PDE , w BC , w IC } are adaptively scheduled during training. All PINN experiments employ the architectures presented in Table 2.
The deeper RBF architecture compensates for reduced per-neuron expressiveness compared with adaptive B-splines. PDE residuals require automatic differentiation:
u θ x i = j u θ θ j θ j x i , 2 u θ x i x j = k , 2 u θ θ k θ θ k x i θ x j
by using PyTorch function autograd with create_graph=True for second-order derivatives.

5.5.2. Case 1: Elliptic PDE

We consider the Poisson equation on the square domain Ω = [ 1 , 1 ] 2 with homogeneous Dirichlet boundary conditions, specifically solving
Δ u = 2 π 2 sin ( π x ) sin ( π y ) ( x , y ) Ω
subject to u ( x , y ) = 0 on Ω . The analytical solution u ( x , y ) = sin ( π x ) sin ( π y ) provides a smooth test function in C ( Ω ¯ ) H s ( Ω ) for all s 0 , enabling precise error quantification. The collocation strategy distributes N i = 441 interior points on a uniform 21 × 21 grid covering Ω , while N b = 84 boundary points sample Ω uniformly with 21 points per edge. This discretization ensures adequate resolution of the spatial structure of the solution while maintaining computational tractability. Training dynamics over 2000 iterations reveal superior convergence for the RBF-KAN architecture.
Figure 2 shows the evolution of loss components on semi-logarithmic axes. The PDE residual loss for RBF-KAN exhibits monotonic exponential decay, stabilizing at L PDE 2.0 × 10 1 compared with the standard KAN L PDE 8.0 , representing a 40-fold improvement. The boundary condition loss demonstrates even more pronounced separation: RBF-KAN achieves L BC 4.0 × 10 3 versus standard KAN L BC 4.0 × 10 1 , a 100-fold reduction. Most significantly, the L2 error against the analytical solution converges to u θ u exact L 2 2.0 × 10 3 for RBF-KAN, contrasting with 9.0 × 10 2 for standard KAN.
Solution quality visualization in Figure 3 exhibits the approximation fidelity in spatial detail. The top row displays the analytical solution alongside predictions from both architectures, showing a qualitatively accurate reproduction of the sinusoidal structure. Quantitative analysis through pointwise error distributions (bottom row) suggests substantial differences: the RBF-KAN solution maintains maximum error max ( x , y ) Ω | u θ ( x , y ) u ( x , y ) | = 1.82 × 10 1 , while standard KAN exhibits max | u θ u | = 1.32 , representing a 7.25-fold accuracy advantage. The error heatmaps demonstrate that RBF-KAN produces uniformly distributed low-magnitude errors across the entire domain, whereas standard KAN concentrates errors near domain boundaries and at the sinusoidal extrema, suggesting insufficient representational capacity for capturing the geometric features of the solution.

5.5.3. Case 2: Parabolic PDE

We deal with finding the solution of heat equation with homogeneous Dirichlet boundary conditions, defined as
u t = α 2 u x 2 Ω = [ 0 , 1 ] × [ 0 , 1 ]
with thermal diffusivity α = 0.1 , subject to boundary conditions and initial condition:
u ( 0 , t ) = u ( 1 , t ) = 0 t [ 0 , 1 ] , u ( x , 0 ) = sin ( π x ) x [ 0 , 1 ] .
The analytical solution u ( x , t ) = exp ( π 2 α t ) sin ( π x ) exhibits exponential temporal decay combined with sinusoidal spatial structure, challenging the network to capture coupled spatiotemporal dynamics. The increased complexity of spatiotemporal coupling motivates deeper architectures, in fact, both networks employ a width [ 2 , 8 , 8 , 1 ] with two hidden layers of eight neurons each. Collocation employs N i = 3600 interior points on a 60 × 60 uniform grid in ( x , t ) -space, N b = 120 spatial boundary point, and N 0 = 60 initial condition points at t = 0 .
Parabolic problems require careful loss balancing to prevent the PDE residual from decreasing too rapidly before initial conditions are satisfied. We employ an adaptive weighting schedule:
w IC ( t ) = max { 10 , 50 ( 1 t / T ) } , w BC = 1 , w PDE ( t ) = 0.1 , t < 100 , 0.5 , t 100 .
This approach emphasizes initial and boundary conditions during the first 100 iterations, then progressively shifts focus to the PDE residual once these constraints are sufficiently satisfied.
Training convergence over 1000 iterations, displayed in Figure 4, demonstrates comparable final performance between architectures with distinct dynamical characteristics. Both methods converge to PDE residual loss L PDE 2 × 10 2 , though RBF-KAN exhibits smoother monotonic descent compared with the oscillatory trajectory of the standard KAN. Boundary condition losses reach L BC 10 3 for both approaches, indicating successful constraint enforcement. The L2 errors stabilize within the range [ 3 × 10 4 , 5 × 10 4 ] , demonstrating high-fidelity approximation of the exponential decay profile.
Pointwise accuracy analysis in Figure 5 quantifies the approximation quality through maximum error metrics. Standard KAN attains
max ( x , t ) Ω | u θ ( x , t ) u ( x , t ) | = 2.18 × 10 1 ,
while RBF-KAN achieves
max | u θ u | = 1.03 × 10 1 ,
representing a 2.12-fold improvement. The spatial error distributions reveal that RBF-KAN maintains more uniform accuracy throughout the diffusion process, particularly in regions of rapid temporal variation where the exponential decay is most pronounced. Both networks successfully capture the separation of the variables structure inherent in the analytical solution, though RBF-KAN exhibits superior preservation of the precise decay rate of the exponential envelope.

5.5.4. Case 3: Hyperbolic PDE

Consider the one-dimensional wave equation:
2 u t 2 = c 2 2 u x 2 , ( x , t ) ( 0 , 1 ) × ( 0 , 1 ) u ( 0 , t ) = u ( 1 , t ) = 0 , t [ 0 , 1 ] u ( x , 0 ) = sin ( π x ) , x [ 0 , 1 ] u t ( x , 0 ) = 0 , x [ 0 , 1 ]
with wave speed c = 1 and analytic solution:
u * ( x , t ) = sin ( π x ) cos ( c π t ) .
Network architectures employ width [ 2 , 8 , 8 , 1 ] to accommodate spatiotemporal coupling complexity. The collocation strategy distributes N i = 1600 interior points on a 40 × 40 grid, N b = 80 spatial boundary points, and N 0 = 40 initial condition points for both position and velocity specifications. Hyperbolic equations demand dual initial conditions on u and t u , necessitating separate loss terms
L IC 1 = N 0 1 k = 1 N 0 | u θ ( x k , 0 ) sin ( π x k ) | 2 and L IC 2 = N 0 1 k = 1 N 0 | t u θ ( x k , 0 ) | 2 .
The aggressive weighting schedule w IC 1 = w IC 2 = max ( 20 , 100 ( 1 t / T ) ) , w BC = 5 , and w PDE ( t ) = 0.05 for t < 200 , transitioning to w PDE ( t ) = 0.5 for t 200 , reflects the issue of initial condition enforcement for hyperbolic stability. The dual initial conditions receive maximal weight during early training, gradually decreasing as they become satisfied, while the PDE residual weight undergoes a delayed warmup to prevent premature equation enforcement before constraints are established.
Figure 6 demonstrates the performance advantage of RBF-KAN on this problem class. The PDE residual loss converges to L PDE 1 × 10 2 for RBF-KAN versus 4 × 10 2 for standard KAN, a four-fold improvement. Boundary condition satisfaction reaches L BC 1 × 10 3 for RBF-KAN compared with 9 × 10 3 for standard KAN, a nine-fold reduction. Most significantly, the L2 error achieves u θ u exact L 2 8 × 10 4 for RBF-KAN versus 5 × 10 3 for standard KAN, representing a 6.25-fold accuracy gain.
Figure 7 shows some differences in approximation fidelity. The RBF-KAN solution accurately reproduces the standing wave pattern with maximum pointwise error max ( x , t ) Ω | u θ ( x , t ) u ( x , t ) | = 5.18 × 10 2 , while standard KAN exhibits larger deviations with max | u θ u | = 1.00 , yielding a 19.3-fold accuracy advantage for RBF-KAN. The error spatial distributions show that RBF-KAN maintains uniformly low errors throughout the spatiotemporal domain, successfully preserving the oscillatory amplitude without artificial damping. In contrast, standard KAN introduces spurious amplitude decay and phase errors, particularly in later temporal stages, suggesting insufficient capacity to represent high-frequency oscillations with piecewise polynomial basis functions. The superior performance on hyperbolic problems can be attributed to the smooth, infinitely differentiable nature of Gaussian RBFs, which naturally accommodate the high-frequency oscillatory behavior characteristic of wave phenomena without introducing spurious numerical dissipation inherent in low-order polynomial approximations.

5.6. Convergence Dynamics and Initialization Sensitivity

Training progression curves (Figure 4 and Figure 6) reveal distinct convergence characteristics across PDE classes. For parabolic heat equation, both architectures achieve comparable final metrics after 1000 iterations: L PDE 2 × 10 2 , L BC 10 3 , L 2 error within [ 3 × 10 4 , 5 × 10 4 ] . RBF-KAN exhibits smooth monotonic descent, while standard KAN demonstrates oscillatory trajectory, attributed to adaptive grid refinement in B-spline implementations introducing transient loss landscape perturbations. Hyperbolic wave equation demonstrates pronounced RBF-KAN advantages: L PDE = 1 × 10 2 versus 4 × 10 2 (fourfold), L BC = 1 × 10 3 versus 9 × 10 3 (ninefold), L 2 error 8 × 10 4 versus 5 × 10 3 (6.25-fold improvement, lines 603–608). Adaptive weighting schedules prove essential: parabolic problems employ w IC ( t ) = max { 10 , 50 ( 1 t / T ) } with w PDE transition from 0.1 to 0.5 at iteration 100 (lines 578–580), while hyperbolic problems require aggressive initial weighting w IC = max ( 20 , 100 ( 1 t / T ) ) with delayed PDE warmup from 0.05 to 0.5 at iteration 200, preventing spurious modes before constraint satisfaction. Initialization sensitivity analysis reveals narrow stability windows. Width clamping ε min = 10 8 produces gradient overflow at iteration 300, while ε min = 10 4 inhibits basis sharpening, increasing L 2 error from 8.0 × 10 4 to 1.2 × 10 3 (50% degradation). Optimal ε min = 10 6 balances stability and expressiveness. Coefficient initialization demonstrates complementary sensitivity: ε noise > 0.2 induces instabilities, ε noise < 0.05 yields suboptimal minima, with optimal ε noise = 0.1 confirming calibrated initialization necessity (Equations (46)–(48)). RBF-KAN convergence stability derives from infinite differentiability eliminating B-spline discontinuities, explicit diversity enforcement (Equation (49)), and width clamping, showing as monotonic decay and superior final accuracy.

5.6.1. Parameter Sharing Strategies

The shared-basis design introduced in Section 4.1 reduces parameter complexity from O ( 3 K W ) to O ( K W + 2 K L ) by sharing basis centers μ ( l ) and widths σ ( l ) globally across all connections within each layer. To quantify the trade-off between parameter efficiency and representational capacity, we conduct systematic ablation across four parameter sharing strategies on MNIST classification with controlled architecture [ 784 , 64 , 10 ] and K = 5 basis functions. We define the following architectural variants, ordered by decreasing parameter count. Per-connection (full parameterization) assigns each connection ( i , j ) independent basis parameters { μ i , j , k ( l ) , σ i , j , k ( l ) , c i , j , k ( l ) } for k { 1 , , K } , yielding 3 K parameters per connection and total count 3 K W + n biases = 762,314 . Layer-wise centers configuration shares centers μ ( l ) across layer l while maintaining per-connection widths σ i , j , k ( l ) and coefficients c i , j , k ( l ) , achieving parameter count K L + 2 K W + n biases = 508,244 (twofold relative to shared-basis). Shared-basis (proposed) shares both centers and widths layer-wise with per-connection coefficients only, yielding 2 K L + K W + n biases = 254,174 . Extreme sharing ties coefficients within layers, reducing parameters to 3 K L + n biases = 104 but sacrificing expressiveness. Experimental results on MNIST after 1000 epochs with an identical training protocol (Adam, α = 10 3 , batch size 256) demonstrate systematic accuracy-efficiency trade-offs. Per-connection parameterization achieves test accuracy 88.3 ± 0.2 % with generalization gap 2.1 % , providing 0.5 percentage point improvement over shared-basis ( 87.8 % ) at threefold parameter cost. Layer-wise centers configuration yields intermediate performance: test accuracy 88.0 ± 0.3 % , gap 2.4 % , with twofold parameter count. Extreme sharing degrades severely to 82.1 ± 0.5 % test accuracy with a 5.8 % gap, confirming that per-connection coefficient adaptation is essential for expressive power. Training time measurements on NVIDIA RTX 4080 reveal computational overhead scaling sublinearly with parameter count. Per-connection requires 298 s per 1000 epochs (27% slower than shared-basis baseline 235 s) due to increased gradient computation complexity. Layer-wise centers achieve 251 s (7% slower), while extreme sharing completes in 201 s (14% faster) but sacrifices 5.7 percentage points of accuracy. Memory consumption exhibits similar sublinear scaling: per-connection utilizes 2.4 GB peak allocation versus 1.6 GB for shared-basis, with overhead attributed to storage of connection-specific basis parameters and corresponding optimizer state tensors. Physics-informed neural network experiments on elliptic Poisson equation ( 2 u = f ) with architecture [ 2 , 8 , 8 , 1 ] demonstrate that per-connection achieves L 2 error 2.9 × 10 3 , marginally outperforming shared-basis ( 3.2 × 10 3 , 10% relative degradation). However, this marginal improvement proves insufficient to justify the threefold parameter and computational overhead. The analysis establishes shared-basis as optimal for scientific computing: threefold parameter reduction with negligible accuracy sacrifice (0.5 pp on MNIST) while maintaining competitive PDE-solving performance. The explicit center diversity regularization (Equation (49), term λ d ) prevents basis redundancy through a Gaussian repulsion penalty, ensuring uniform coverage of the activation domain without per-connection parameter allocation overhead.

5.6.2. Ablation Studies

We isolate the contribution of architectural components through systematic parameter variation. Fixing network width at [ 2 , 8 , 8 , 1 ] , we vary the basis count K { 3 , 5 , 10 , 20 } on the elliptic PDE. The L 2 error decreases from 5.2 × 10 3 at K = 3 to 3.2 × 10 3 at K = 5 , then stabilizes at 3.1 × 10 3 ( K = 10 ) and 3.0 × 10 3 ( K = 20 ). This plateau beyond K = 5 validates the initialization strategy, where K ( μ max μ min ) / ( 2 σ init ) balances expressiveness against overfitting. Testing width clamping thresholds ε min { 10 8 , 10 6 , 10 4 } on the hyperbolic PDE reveals a narrow stability window: ε min = 10 8 produces gradient overflow by iteration 300, while ε min = 10 4 inhibits adequate basis sharpening, increasing L 2 error to 1.2 × 10 3 . The default value ε min = 10 6 maintains numerical stability throughout optimization.
Regularization weight ablation confirms differing component roles. Removing coefficient regularization ( λ c = 0 ) increases mean L 2 error by 40% across all PDEs. The width penalty λ σ exhibits minimal impact (less than 5% error variation). The center diversity penalty λ d proves essential on hyperbolic problems: its removal permits basis collapse into redundant configurations, increasing error by 150%.

5.7. Computational Efficiency Analysis

Table 3 compares training cost, memory consumption, and parameter count on MNIST. The shared-basis RBF-KAN achieves threefold parameter reduction and 1.4 × training speedup relative to Standard KAN, validating the theoretical complexity analysis of Section 3.3. Memory savings of 33% arise from eliminating per-connection basis storage, as only global parameters { μ ( l ) , σ ( l ) } require allocation rather than connection-specific triples { μ i , j , k ( l ) , σ i , j , k ( l ) , c i , j , k ( l ) } .
Parameter count excludes transient buffers (e.g., cached activations for B-spline grid refinement). Memory denotes peak GPU allocation during training. Speedup is relative to the Standard KAN baseline.

6. Conclusions

We have introduced a shared-basis RBF-KAN architecture reducing parameter complexity from O ( 3 K W ) to O ( K W + 2 L K ) while preserving Sobolev convergence rates O ( h Ω s ) [18,19]. Width clamping at ε min = 10 6 and three-component regularization ensure numerical stability.
MNIST experiments achieve 87.8% accuracy with threefold parameter reduction and a 1.4 × speedup versus the 89.1% of the Standard KAN [16]. Physics-informed benchmarks reveal advantages on hyperbolic PDEs, particularly, wave equation accuracy improves with maximum error 5.18 × 10 2 versus 1.00 , attributed to infinitely differentiable Gaussian bases accommodating high-frequency oscillations without spurious dissipation. Ablations confirm that coefficient regularization reduces error 40%, while center diversity prevents collapse on hyperbolic problems. Limitations include the 2.7% generalization gap versus 1.1% for B-splines, reflecting global support requiring explicit capacity control. Future work should investigate hybrid local-global architectures, adaptive basis placement, extended NTK analysis [40,43], high-dimensional PINNs [31], and alternative radial kernels [19]. The architecture establishes RBF parameterizations as efficient alternatives to polynomial splines with particular advantages for smooth, high-order differentiable approximations.

Author Contributions

Conceptualization, P.D.L., E.D.N., L.M. and A.C.; Methodology, P.D.L. and E.D.N.; Software, P.D.L.; Validation, E.D.N.; Formal analysis, P.D.L., E.D.N. and L.M.; Investigation, P.D.L., L.M. and A.C.; Writing—original draft, P.D.L. and E.D.N.; Writing—review and editing, P.D.L., E.D.N., L.M. and A.C.; Visualization, L.M. and A.C.; Supervision, L.M.; Project administration, A.C.; Funding acquisition, L.M. and A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All experimental data and results presented in this study are included within the article. The MNIST dataset used for classification experiments is publicly available through the torchvision library (https://pytorch.org/vision/stable/datasets.html (accessed on 24 December 2025)). Network architectures, training protocols, and hyperparameters are fully specified in Section 5. No additional data were generated or analyzed during this study.

Acknowledgments

All authors are members of Gruppo Nazionale di Calcolo Scientifico (INDAM-GNCS).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
KANKolmogorov–Arnold Network
RBFRadial Basis Function
MLPMultilayer Perceptron
PINNPhysics-Informed Neural Network
PDEPartial Differential Equation
ReLURectified Linear Unit
SiLUSigmoid Linear Unit
GELUGaussian Error Linear Unit
MNISTModified National Institute of Standards and Technology
NTKNeural Tangent Kernel
GPUGraphics Processing Unit

Appendix A

Here, we list the proofs for
Proof of Theorem 3.
For f H s ( Ω ) defined on the compact interval Ω (which is a Lipschitz domain), Stein’s extension theorem [44] guarantees the existence of an extension f ˜ H s ( R ) satisfying
f ˜ H s ( R ) C ext f H s ( Ω )
where C ext > 0 depends only on s and Ω . This extension allows us to apply Fourier-analytic techniques defined on R . For simplicity, we denote this extension by f in what follows.
The proof exploits the connection between Gaussian kernels and Fourier analysis. For the L 1 -normalized Gaussian kernel ψ σ ( x ) = ( 2 π σ 2 ) 1 / 2 exp ( x 2 / ( 2 σ 2 ) ) , the Fourier transform under convention (5) is
ψ ^ σ ( ξ ) = exp σ 2 ξ 2 2 .
This can be verified by completing the square in the Fourier integral:
ψ ^ σ ( ξ ) = 1 2 π σ 2 R e x 2 / ( 2 σ 2 ) e i x ξ d x = 1 2 π σ 2 R exp ( x + i σ 2 ξ ) 2 2 σ 2 σ 2 ξ 2 2 d x = e σ 2 ξ 2 / 2
where the second equality uses x 2 / ( 2 σ 2 ) i x ξ = [ ( x + i σ 2 ξ ) 2 ( i σ 2 ξ ) 2 ] / ( 2 σ 2 ) = ( x + i σ 2 ξ ) 2 / ( 2 σ 2 ) σ 2 ξ 2 / 2 , and the integral evaluates to 2 π σ 2 by contour integration (the integrand is analytic and the contour can be shifted). By Parseval’s identity (6), the L 2 approximation error satisfies
f I h Ω f L 2 ( Ω ) 2 f I h Ω f L 2 ( R ) 2 = 1 2 π f ^ I h Ω f ^ L 2 ( R ) 2
Standard RBF error analysis [19] shows that the interpolant I h Ω f approximates f well in the sense that I h Ω f ^ approximates f ^ for frequencies | ξ | σ 1 , with the approximation error concentrated in the high-frequency region | ξ | > σ 1 . For the high-frequency tail, we observe that for | ξ | > σ 1 , the inequality | ξ | 1 < σ implies | ξ | 2 s < σ 2 s for any s > 0 . Therefore:
| f ^ ( ξ ) | 2 = ( 1 + | ξ | 2 ) s | f ^ ( ξ ) | 2 ( 1 + | ξ | 2 ) s ( 1 + | ξ | 2 ) s | f ^ ( ξ ) | 2 . | ξ | 2 s < σ 2 s ( 1 + | ξ | 2 ) s | f ^ ( ξ ) | 2 .
Integrating over the high-frequency region and using the Sobolev norm definition (8):
f ^ I h Ω f ^ L 2 2 | ξ | > σ 1 | f ^ ( ξ ) | 2 d ξ σ 2 s | ξ | > σ 1 ( 1 + | ξ | 2 ) s | f ^ ( ξ ) | 2 d ξ 2 π σ 2 s f H s 2 .
Choosing σ h Ω and applying (A1) yields
f I h Ω f L 2 ( Ω ) C σ s f H s ( R ) C h Ω s f H s ( Ω )
for appropriate constants C , C > 0 (where C absorbs C ext from (A1)) independent of f and h Ω . □
Follows the proof for Lemma 4:
Proof. 
We treat each case separately, as the functional form of the relationship between ψ k ( l ) ( x ) and σ k ( l ) differs fundamentally between the two regimes.
Case 1: When the variance is above the clamping threshold, i.e., ( σ k ( l ) ) 2 > ε min , the clamping function acts as the identity, yielding σ ^ k ( l ) = σ k ( l ) . In this regime, the RBF can be written as
ψ k ( l ) ( x ) = exp ( x μ k ( l ) ) 2 2 ( σ k ( l ) ) 2 .
To differentiate with respect to σ k ( l ) , we define the exponent as a function of σ k ( l ) :
h ( σ k ( l ) ) = ( x μ k ( l ) ) 2 2 ( σ k ( l ) ) 2 = ( x μ k ( l ) ) 2 2 · ( σ k ( l ) ) 2 ,
so that ψ k ( l ) ( x ) = exp ( h ( σ k ( l ) ) ) . By the chain rule for exponential functions:
ψ k ( l ) σ k ( l ) = exp ( h ( σ k ( l ) ) ) · h σ k ( l ) = ψ k ( l ) ( x ) · h σ k ( l ) .
We now compute the derivative of h with respect to σ k ( l ) . Treating ( x μ k ( l ) ) 2 as a constant (since both x and μ k ( l ) are independent of σ k ( l ) ), we have
h σ k ( l ) = ( x μ k ( l ) ) 2 2 · σ k ( l ) ( σ k ( l ) ) 2 .
Applying the power rule for derivatives:
σ k ( l ) ( σ k ( l ) ) 2 = 2 ( σ k ( l ) ) 3 = 2 ( σ k ( l ) ) 3 .
Substituting this result:
h σ k ( l ) = ( x μ k ( l ) ) 2 2 · 2 ( σ k ( l ) ) 3 = ( x μ k ( l ) ) 2 ( σ k ( l ) ) 3 .
Therefore, combining the factors:
ψ k ( l ) ( x ) σ k ( l ) = ψ k ( l ) ( x ) · ( x μ k ( l ) ) 2 ( σ k ( l ) ) 3 .
Case 2: When the variance falls below or equals the clamping threshold, i.e., ( σ k ( l ) ) 2 ε min , the clamping function enforces σ ^ k ( l ) = ε min , which is a constant independent of the parameter σ k ( l ) . In this regime, the RBF becomes
ψ k ( l ) ( x ) = exp ( x μ k ( l ) ) 2 2 ε min ,
which is entirely independent of σ k ( l ) . Since the RBF does not depend on σ k ( l ) in this regime, its partial derivative must vanish identically:
ψ k ( l ) ( x ) σ k ( l ) = 0 .
Proof of Theorem 10.
In order to derive the input gradient, we trace the influence of X n , i ( l ) on the loss function L through the network architecture. The activation X n , i ( l ) affects the loss through its influence on all RBF evaluations Ψ n , i , k ( l ) for k { 1 , , K } , which in turn contribute to the activations of all neurons j { 1 , , n l + 1 } at the subsequent layer via the linear combination defined by the coefficients C i , j , k ( l ) .
Applying the multivariate chain rule, we express the total derivative as a sum over all downstream neurons j and all basis functions k:
L X n , i ( l ) = j = 1 n l + 1 k = 1 K L X n , j ( l + 1 ) · X n , j ( l + 1 ) Ψ n , i , k ( l ) · Ψ n , i , k ( l ) X n , i ( l ) .
We evaluate each factor in Equation (A1) separately. From the layer transformation Equation (33), the activation at layer l + 1 is given by
X n , j ( l + 1 ) = i = 1 n l k = 1 K C i , j , k ( l ) Ψ n , i , k ( l ) + b j ( l ) .
Since this is a linear combination, the partial derivative with respect to a specific RBF evaluation Ψ n , i , k ( l ) is simply the corresponding coefficient (as previously established in Equation (36)):
X n , j ( l + 1 ) Ψ n , i , k ( l ) = C i , j , k ( l ) .
Next, we evaluate the derivative of the RBF evaluation with respect to the layer input. By definition, Ψ n , i , k ( l ) = ψ k ( l ) ( X n , i ( l ) ) , and applying Lemma 4 with x = X n , i ( l ) :
Ψ n , i , k ( l ) X n , i ( l ) = Ψ n , i , k ( l ) · X n , i ( l ) μ k ( l ) ( σ ^ k ( l ) ) 2 .
Substituting these derivatives into Equation (A1):
L X n , i ( l ) = j = 1 n l + 1 k = 1 K L X n , j ( l + 1 ) · C i , j , k ( l ) · Ψ n , i , k ( l ) · X n , i ( l ) μ k ( l ) ( σ ^ k ( l ) ) 2 .
Extracting the negative sign yields the desired expression:
L X n , i ( l ) = j = 1 n l + 1 k = 1 K L X n , j ( l + 1 ) · C i , j , k ( l ) · X n , i ( l ) μ k ( l ) ( σ ^ k ( l ) ) 2 · Ψ n , i , k ( l ) .

References

  1. Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control. Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
  2. Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
  3. Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Netw. 1991, 4, 251–257. [Google Scholar] [CrossRef]
  4. Pinkus, A. Approximation theory of the MLP model in neural networks. Acta Numer. 1999, 8, 143–195. [Google Scholar] [CrossRef]
  5. Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2017, 107, 3–11. [Google Scholar] [CrossRef] [PubMed]
  6. Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar] [CrossRef]
  7. Cho, K.; Van Merriënboer, B.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv 2014, arXiv:1409.1259. [Google Scholar] [CrossRef]
  8. Sak, H.; Senior, A.W.; Beaufays, F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Interspeech; ISCA: Kolkata, India, 2014; Volume 2014, pp. 338–342. [Google Scholar]
  9. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
  10. Kramer, M.A. Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 1991, 37, 233–243. [Google Scholar] [CrossRef]
  11. Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
  12. Wan, Z.; Zhang, Y.; He, H. Variational autoencoder based synthetic data generation for imbalanced learning. In Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, 27 November–1 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–7. [Google Scholar]
  13. Little, C.; Elliot, M.; Allmendinger, R.; Samani, S.S. Generative adversarial networks for synthetic data generation: A comparative study. arXiv 2021, arXiv:2112.01925. [Google Scholar] [CrossRef]
  14. Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.H. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
  15. Kolmogorov, A.N. On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk. Sssr 1957, 114, 953–956. [Google Scholar]
  16. Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
  17. Park, J.; Sandberg, I.W. Universal approximation using radial-basis-function networks. Neural Comput. 1991, 3, 246–257. [Google Scholar] [CrossRef]
  18. Schaback, R. Error estimates and condition numbers for radial basis function interpolation. Adv. Comput. Math. 1995, 3, 251–264. [Google Scholar] [CrossRef]
  19. Wendland, H. Scattered Data Approximation; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  20. Fornberg, B.; Piret, C. A stable algorithm for flat radial basis functions on a sphere. SIAM J. Sci. Comput. 2011, 30, 60–80. [Google Scholar] [CrossRef]
  21. Arnold, V.I. On the representation of continuous functions of three variables by superpositions of continuous functions of two variables. Mat. Sb. 1957, 48, 3–74. [Google Scholar]
  22. Braun, J.; Griebel, M. On a constructive proof of Kolmogorov’s superposition theorem. Constr. Approx. 2009, 30, 653–675. [Google Scholar] [CrossRef]
  23. Kůrková, V. Kolmogorov’s theorem and multilayer neural networks. Neural Netw. 1992, 5, 501–506. [Google Scholar] [CrossRef]
  24. Buhmann, M.D. Radial Basis Functions: Theory and Implementations; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  25. Li, Z. Kolmogorov-Arnold Networks are Radial Basis Function Networks. arXiv 2024, arXiv:2405.06721. [Google Scholar] [CrossRef]
  26. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
  27. Hendrycks, D.; Gimpel, K. Gaussian error linear units (GELUs). arXiv 2016, arXiv:1606.08415. [Google Scholar]
  28. Di Vicino, A.; De Luca, P.; Marcellino, L. First Experiences on Exploiting Physics-Informed Neural Networks for Approximating Solutions of a Biological Model. In Computational Science–ICCS 2025 Workshops; Paszynski, M., Barnard, A.S., Zhang, Y.J., Eds.; ICCS 2025. Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15910. [Google Scholar] [CrossRef]
  29. Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
  30. Karniadakis, G.E.; Kevrekidis, I.G.; Lu, L.; Perdikaris, P.; Wang, S.; Yang, L. Physics-informed machine learning. Nat. Rev. Phys. 2021, 3, 422–440. [Google Scholar] [CrossRef]
  31. Raissi, M.; Yazdani, A.; Karniadakis, G.E. Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science 2020, 367, 1026–1030. [Google Scholar] [CrossRef]
  32. Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics informed deep learning (part II): Data-driven discovery of nonlinear partial differential equations. arXiv 2017, arXiv:1711.10566. [Google Scholar] [CrossRef]
  33. Chen, Y.; Lu, L.; Karniadakis, G.E.; Negro, L.D.; Negro, L. Physics-informed neural networks for inverse problems in nano-optics and metamaterials. Opt. Express 2021, 28, 11618–11633. [Google Scholar] [CrossRef] [PubMed]
  34. Wang, S.; Teng, Y.; Perdikaris, P. Understanding and mitigating gradient flow pathologies in physics-informed neural networks. Siam J. Sci. Comput. 2021, 43, A3055–A3081. [Google Scholar] [CrossRef]
  35. Chen, F.; Sondak, D.; Protopapas, P.; Mattheakis, M.; Liu, S.; Agarwal, D.; Di Giovanni, M. NeuroDiffEq: A Python package for solving differential equations with neural networks. J. Open Source Softw. 2021, 5, 1931. [Google Scholar] [CrossRef]
  36. Lu, L.; Jin, P.; Pang, G.; Zhang, Z.; Karniadakis, G.E. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nat. Mach. Intell. 2021, 3, 218–229. [Google Scholar] [CrossRef]
  37. Mishra, S.; Molinaro, R. Estimates on the generalization error of physics-informed neural networks for approximating PDEs. IMA J. Numer. Anal. 2022, 43, 1–43. [Google Scholar] [CrossRef]
  38. De Ryck, T.; Mishra, S. Error estimates for physics-informed neural networks approximating the Navier-Stokes equations. IMA J. Numer. Anal. 2024, 44, 83–119. [Google Scholar] [CrossRef]
  39. Wang, S.; Wang, H.; Perdikaris, P. On the eigenvector bias of Fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks. Comput. Methods Appl. Mech. Eng. 2021, 384, 113938. [Google Scholar] [CrossRef]
  40. Jacot, A.; Gabriel, F.; Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Adv. Neural Inf. Process. Syst. 2018, 31, 8571–8580. [Google Scholar]
  41. Allen-Zhu, Z.; Li, Y.; Song, Z. A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 242–252. [Google Scholar]
  42. Du, S.; Lee, J.; Li, H.; Wang, L.; Zhai, X. Gradient descent finds global minima of deep neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 1675–1685. [Google Scholar]
  43. Arora, S.; Du, S.S.; Hu, W.; Li, Z.; Salakhutdinov, R.R.; Wang, R. On exact computation with an infinitely wide neural net. Adv. Neural Inf. Process. Syst. 2019, 32, 8141–8150. [Google Scholar]
  44. Adams, R.A.; Fournier, J.J. Sobolev Spaces, 2nd ed.; Academic Press: Cambridge, MA, USA, 2003. [Google Scholar]
  45. Bozorgasl, Z.; Chen, H. Wav-KAN: Wavelet Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2405.12832. [Google Scholar] [CrossRef]
  46. Afzal Aghaei, A. fKAN: Fractional Kolmogorov-Arnold Networks with trainable Jacobi basis functions. Neurocomputing 2025, 623, 129414. [Google Scholar] [CrossRef]
  47. Afzal Aghaei, A. rKAN: Rational Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2406.14495. [Google Scholar]
  48. Daubechies, I. Ten Lectures on Wavelets; CBMS-NSF Regional Conference Series in Applied Mathematics; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1992; Volume 61. [Google Scholar]
  49. Mallat, S. A Wavelet Tour of Signal Processing; Academic Press: San Diego, CA, USA, 1999. [Google Scholar]
Figure 1. Learning dynamics dashboard for MNIST classification. (A) Test accuracy evolution over 1000 training epochs. (B) Instantaneous accuracy change per epoch ( Δ Acc / Δ t ). (C) Generalization gap trajectory Gap ( t ) = Train Acc ( t ) Test Acc ( t ) . (D) Box-and-whisker plot of per-epoch test accuracy distributions across the training trajectory, showing median accuracy (horizontal line), interquartile range (box), and outlier epochs (circles).
Figure 1. Learning dynamics dashboard for MNIST classification. (A) Test accuracy evolution over 1000 training epochs. (B) Instantaneous accuracy change per epoch ( Δ Acc / Δ t ). (C) Generalization gap trajectory Gap ( t ) = Train Acc ( t ) Test Acc ( t ) . (D) Box-and-whisker plot of per-epoch test accuracy distributions across the training trajectory, showing median accuracy (horizontal line), interquartile range (box), and outlier epochs (circles).
Mathematics 14 00513 g001
Figure 2. Training dynamics for the Poisson equation over 2000 iterations.
Figure 2. Training dynamics for the Poisson equation over 2000 iterations.
Mathematics 14 00513 g002
Figure 3. Solution quality comparison for the Poisson equation.
Figure 3. Solution quality comparison for the Poisson equation.
Mathematics 14 00513 g003
Figure 4. Heat equation training convergence over 1000 iterations.
Figure 4. Heat equation training convergence over 1000 iterations.
Mathematics 14 00513 g004
Figure 5. Heat equation solution comparison.
Figure 5. Heat equation solution comparison.
Mathematics 14 00513 g005
Figure 6. Wave equation training progression over 1000 iterations.
Figure 6. Wave equation training progression over 1000 iterations.
Mathematics 14 00513 g006
Figure 7. Wave equation solution quality assessment.
Figure 7. Wave equation solution quality assessment.
Mathematics 14 00513 g007
Table 1. Comparison of efficient KAN architectures.
Table 1. Comparison of efficient KAN architectures.
ArchitectureBasisComplexityMNISTSpeedupWave L
Standard KANB-spline O ( 3 K W ) 89.1%1.0× 1.00
RBF-KANGaussian O ( K W + 2 L K ) 87.8%1.4× 5.18 × 10 2
WavKANWavelets O ( K W + L K ) 89.3 % ∼1.3× 2.6 × 10 1
fKANFrac. Jacobi O ( K W + 3 L K ) 88.0 % ∼1.2× 4.3 × 10 1
rKANRational O ( K W + 2 L K ) 86.5 % ∼1.1× 5.9 × 10 1
Table 2. Network architectures for PINN experiments.
Table 2. Network architectures for PINN experiments.
ArchitectureConfigurationParameters
Standard KANWidth [ 2 , 2 , 1 ] G = 5 , k = 3
RBF-KANWidth [ 2 , 8 , 8 , 1 ] K = 5 , ϵ = 0.1
Table 3. Computational efficiency metrics on MNIST, batch size 256, 50 epochs.
Table 3. Computational efficiency metrics on MNIST, batch size 256, 50 epochs.
ArchitectureParametersTraining Time (s)Memory (GB)Speedup
Standard KAN 1.23 × 10 6 6824.8 1.0 ×
RBF-KAN 4.12 × 10 5 4873.2 1.4 ×
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

De Luca, P.; Di Nardo, E.; Marcellino, L.; Ciaramella, A. Stable and Efficient Gaussian-Based Kolmogorov–Arnold Networks. Mathematics 2026, 14, 513. https://doi.org/10.3390/math14030513

AMA Style

De Luca P, Di Nardo E, Marcellino L, Ciaramella A. Stable and Efficient Gaussian-Based Kolmogorov–Arnold Networks. Mathematics. 2026; 14(3):513. https://doi.org/10.3390/math14030513

Chicago/Turabian Style

De Luca, Pasquale, Emanuel Di Nardo, Livia Marcellino, and Angelo Ciaramella. 2026. "Stable and Efficient Gaussian-Based Kolmogorov–Arnold Networks" Mathematics 14, no. 3: 513. https://doi.org/10.3390/math14030513

APA Style

De Luca, P., Di Nardo, E., Marcellino, L., & Ciaramella, A. (2026). Stable and Efficient Gaussian-Based Kolmogorov–Arnold Networks. Mathematics, 14(3), 513. https://doi.org/10.3390/math14030513

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop