1. Theoretical Foundations of Kolmogorov–Arnold Networks
Symmetry, as an intrinsic property of both natural and mathematical systems, enhance model generalizability. It also serves as a crucial bridge linking data-driven approaches to mechanistic modeling. This characteristic is also evident in the recently emerged KAN field. A KAN is theoretically based on the Kolmogorov–Arnold representation theorem, which states that any continuous multivariate function with a bounded domain can be reduced to a finite composition of univariate continuous functions and the binary operation of addition for any multivariate continuous function [
1]. The theorem provides a constructive method of function approximation, suggesting that high-dimensional mappings can be broken down into simple univariate components, which ultimately influences KAN architecture by guiding the decomposition of complex functions into univariate function compositions. Making the neural network design have a clear mathematical grounding and fundamentally improving model structure interpretability.
This theorem led to KANs being developed as an alternative to MLPs. Hecht-Nielsen initially proposed shallow Kolmogorov networks based on the theorem, but their practical ineffectiveness arose from poor univariate function properties [
2]. Later, Liu et al. deepened this framework to develop KANs, which integrate learnable activation functions to enhance performance, effectively combining the structural insights of Kolmogorov networks with the flexibility of MLPs [
3]. Mathematically, KANs approximate functions as cascaded compositions of layers (
), where each layer
is a sum of univariate functions
, often parameterized by basis functions (e.g., B-splines, polynomials) with learnable coefficients [
4].
The structure of KANs differs fundamentally from the structure of MLPs. Based on linear transformations, traditional MLPs apply fixed nonlinear activations (e.g., Tanh, ReLU) at nodes, according to the following form [
5]:
Here,
denotes the weights between network layers, and
represents the activation function. Contrary to KANs, the edges (weights) of KANs contain learnable univariate activation functions, while the nodes only perform summation operations. As a result, the computational graph has been shifted to
where
denotes matrices of univariate functions (e.g., B-splines, orthogonal polynomials) replacing scalar weights [
6,
7]. This edge-based activation design eliminates the need for separate linear weight matrices, integrating transformation and nonlinearity into a unified, interpretable layer structure [
8].
Figure 1 shows the comparison between them.
In terms of their approximation mechanism, MLPs differ functionally from KANs. A KAN, in contrast to MLPs, implements the Kolmogorov–Arnold decomposition explicitly by summing compositions of parameterized univariate functions, rather than using nested layers of linear combinations followed by fixed nonlinearities [
9]. Accordingly, KANs benefit from guaranteed constructive representation under the theorem, which complements MLPs’ universal approximation capability in a theoretically grounded manner [
10,
11]. KANs’ activation functions are highly flexible; early implementations used B-splines (e.g., vanilla PIKAN), while variants adopt orthogonal polynomials (Chebyshev, Jacobi), radial basis functions, or fractional Jacobi polynomials to improve efficiency, stability, or adaptability [
12].
Training strategy and parameter complexity further distinguish KANs from MLPs. During the training phase, MLPs adjust the linear weights on the edges while keeping the nonlinear activation functions on the nodes fixed; while KANs adapt the nonlinear activation functions on the edges while keeping the linear operations on the nodes fixed. Due to the size of the grid and the polynomial order of the spline grids, KANs may be more complex. However, modified variants (e.g., Chebyshev-KAN) often reduce redundant parameters, enabling compact architectures with fewer nodes than MLPs [
13]. By replacing linear layers with KAN-based function matrices, this structural and functional divergence distinguishes KANs from MLPs as an interpretable, theoretically grounded alternative. This paper makes the following contributions:
According to the latest research using KANs, 80 papers were selected for analysis and review based on journal impact factors, citation counts, publication years, and journal rankings.
This paper provides an overview of KAN applications across multiple domains. Analyzing the characteristics of these models enables readers to better apply and understand these models within their respective fields.
This paper contains comparative research on KANs. It enables readers to gain a deeper understanding of the KAN model, facilitating its more judicious application.
This paper explores existing challenges, limitations, and future research directions, providing relevant insights for subsequent research and development.
2. Architecture and Variants of KAN Models
Classical KANs are rooted in the Kolmogorov–Arnold representation theorem, decomposing multivariate functions into sums of learnable univariate activations on network edges, distinct from fixed node activations in traditional MLPs [
14,
15]. B-splines serve as the foundational activation of offering smooth, piecewise polynomial approximations. Their mathematical formulation typically combines a residual-like term with B-spline basis functions:
With
scaling magnitude,
as a basis function (e.g., SiLU),
as trainable coefficients, and
as B-spline basis functions. This activation function is well-suited to image recognition [
16] and time series data prediction [
17].
An extension of the KAN architecture that integrates sinusoidal embeddings with MLP layers is the FCN-KAN. The function is defined as
In this equation,
is a learnable coefficient representing the frequency of sinusoidal functions and
is the input value.
KAN layers are structured as matrices of such spline functions, enabling compositional multivariate mapping [
18,
19]. Computationally, B-splines support grid adaptation to enhance pattern capture [
20], but incur higher complexity due to iterative basis calculations, limiting scalability in deep networks [
21,
22]. Practical implementations optimize spline order and grid points to balance performance and cost [
23,
24]. Furthermore, KANs reduce computational complexity by limiting node operations to simple additions and streamlining the computation process [
25].
RBF-KAN is a quicker version of Spline-KAN that calculates the higher-order B-spline basis with Gaussian RBFs described as
In this equation,
is the radial distance between input
and grid center points
, and it can be calculated as
.
denotes the width or extent of the Gaussian function specified by the grid range.
Polynomial basis functions, including orthogonal variants, offer alternatives to splines. Traditional polynomial KANs suffer from numerical instability (e.g., Runge’s phenomenon), but orthogonal polynomials (e.g., Chebyshev, Jacobi) mitigate this with bound outputs and improved derivatives. Chebyshev-KAN approximates functions as follows:
where
is the normalized input value of
,
represents Chebyshev polynomials and
.
Instead of spline coefficients, the Naive-Fourier-KAN uses 1-D Fourier coefficients. By using the Fourier series, this layer approximates the KAN univariate functions. The definition of this layer is as follows:
Here
and
are Fourier coefficients,
is the input value. The number of Fourier terms (grid size) is
.
Recent advancements in enhanced KAN architectures focus on adaptive and hybrid designs. These designs integrate adaptive basis functions, attention mechanisms, and combinations of established deep learning components. The CKAN is one of the most basic structures. It combines the nonlinear modeling advantages of KAN with the spatial feature extraction capabilities of convolution operations. This breaks through traditional CNN performance bottlenecks in terms of feature representation and parameter efficiency, creating an overall optimization effect.
Figure 2 shows the structure of a basic CKAN model.
Another one, the KAN-MHA model, combines adaptive basis functions with MHA, enabling focused learning on critical flow regions (e.g., leading/trailing edges of airfoils) via parallel attention heads, significantly improving prediction accuracy and consistency in flow field modeling [
26].
Figure 3 illustrates the overall structure of the model.
Hybridization with Transformers has emerged as a key strategy, with models like TFKAN and MTF-AViTK replacing traditional MLP layers in Transformers with KAN layers. For KAN layers’ compact representation, TFKAN reduces parameter counts by 78% compared to conventional Transformers while maintaining high accuracy in IoT intrusion detection, demonstrating the efficiency of combining a Transformer with KAN’s adaptive representations [
9]. MTF-AViTK integrates KAN into an adaptive ViT for tool wear recognition, using self-attention for feature extraction and KAN in the classification layer to handle complex nonlinear mappings [
27]. KANEFT further advances this by introducing Linear-KAN and Convolutional-KAN layers within a Transformer framework for hyperspectral image classification, where Convolutional-KAN enables multi-scale local feature extraction alongside spatial attention for global context integration [
28].
For sequence and temporal data, hybrid KAN models with RNNs and TCNs have been developed. In BiLSTM-KAN, the temporal dependency capture of LSTM is integrated with the nonlinear feature extraction of KAN, thus increasing the robustness of time series forecasting, while TCN-KAN incorporates the multi-scale local dependency modeling of TCN with KAN’s adaptive transformations to reduce overfitting when predicting electricity demand [
29]. A KAN model combined with BiLSTM and TCN is shown in
Figure 4.
Integration with GNNs is exemplified by MKAN-MMI, which incorporates KAN into a masked graph autoencoder for medicine–microbe interaction prediction. KAN enhances weight learnability and interpretability in the prediction layer. It outperforms current leading models in capturing complex nonlinear mappings within sparse biological datasets [
30]. The GNN encoder obtains node embeddings based on the input graph, while the decoder reconstructs the graph based on a set of known relationships between nodes.
Figure 5 illustrates the MKAN-MMI model’s architecture.
Ablation studies validate the efficacy of these designs: KAN-MSA in KAN-
improves PSNR compared to standard MSA [
31]. As well as KAN layers in KANEFT [
28] and HKAN [
32] are critical for superior performance in hyperspectral image classification and fabric defect segmentation, respectively.
7. Challenges, Limitations, and Future Research Directions
KANs face a wide range of challenges in a number of critical areas. Robustness issues are evident, such as insufficient robustness in tool wear conditions recognition [
27], susceptibility to adversarial attacks and noise with a risk of underperformance compared to MLPs in complex or noisy conditions [
89], and persistent noise resilience concerns despite partial robustness demonstrations [
11]. Ultrasound image segmentation suffers when regions of interest have low contrast with the surrounding tissues [
56].
KAN replaces MLPs’ linear weights with learnable nonlinear basis functions. This offers significant theoretical advantages in terms of function approximation accuracy and interpretability. However, realizing these benefits often relies on optimizing the basis function parameters, which increases computational costs. As a result, KAN remains unsuitable for scenarios where the data itself demands substantial computational power. During application, additional considerations such as hardware configuration and energy consumption may be required.
Optimization instability poses another major hurdle. The loss of PIKANs increases sharply after grid extensions due to the re-initialization of the optimizer state [
6]. When the KAN module is removed from models like KRN-DTI, performance may be significantly reduced [
90]. PINNs are also prone to instability due to penalty factor inflation [
38].
Scalability challenges remain; larger KANs may impede interpretability, KAN technology exhibits high computational complexity, and KAN-based topology optimization models, such as KATO, require large amounts of vRAM due to dense tensor representations, restricting high-resolution applications [
37]. While variants like KAN-Mixer and convolutional KANs have been explored, the scalability results remain mixed.
Handling complex data geometries is problematic, with KANs struggling under adversarial attacks on complex geometries. US images present challenges like low contrast, fuzzy boundaries, and scale variations. Additionally, tuning hyperparameters, including grid sizes and spline orders, is critical, with neural network hyperparameters affecting optimization results significantly.
For future research, the summary of the aforementioned studies indicates that KAN’s training efficiency is low. This necessitates the optimization of basis functions and algorithms to improve computational efficiency. Research into lightweight KAN variants suitable for edge devices holds significant practical value, as it advances their real-time application to intelligent sensors, edge controllers, and similar equipment. Moreover, the current selection and combination of basis functions in KAN remain relatively fixed. Future research could explore dynamic adaptive mechanisms for basis functions to enhance multi-scale, complex nonlinear feature capture. Further exploration of fusion innovation between different model architectures may be pursued, leveraging their respective strengths to construct high-performance frameworks tailored to specific domains.
8. Conclusions
The KAN framework is underpinned by the Kolmogorov–Arnold representation theorem as its core theoretical foundation, whereas MLP is constructed upon the universal approximation theorem. Based on the Kolmogorov–Arnold Representation Theorem, which states that any multivariate continuous function can be decomposed into a finite superposition of univariate functions, the core of the KAN architecture lies in the combined learning of univariate functions. Based on the Universal Approximation Theorem, a feedforward network with a single hidden layer and a nonlinear activation function can approximate any continuous function on a compact set with arbitrary precision. This provides the theoretical basis for MLP’s ability to approximate functions through fixed nonlinear activation combined with weight adjustment. Among diverse function classes, including smooth functions and those derived from physical equations, KAN demonstrates distinct advantages owing to its unique structural design. With this structural innovation, nonlinear transformation and weight parametrization are integrated into unified layers. This leads to improved functional representation capability, flexibility, and interpretability. It has been shown that KANs extended with adaptive basis functions, hybridization with Transformers, CNNs, RNNs, and graph structures improve accuracy and parameter efficiency in a variety of domains. These domains include signal processing, physical modeling, and cybersecurity. Ablation analysis shows that LKAN can improve deep learning models with a reduction in parameters [
91].
The use of KANs to diagnose and monitor industrial faults exploits their interpretability and efficiency for real-time condition assessments of machinery, transformers, and power electronics. There are also numerous image processing applications, such as medical image analysis, low-light image enhancement, etc. There are, of course, applications in other fields, such as stock price forecasting [
92] and dam deformation prediction [
93]. For quantum architecture search, KANs demonstrated 2× to 5× higher success probability than MLPs in preparing Bell and GHZ states under noiseless conditions and better fidelity in noisy environments [
94]. In fraud detection, KANs’ hierarchical structure provides a more effective and interpretable solution over MLPs, with Efficient KAN achieving a 100% F1 score and 0.1 s detection time [
95]. In [
96], CoxKAN is described as a Cox proportional hazards KAN designed for high-performance, interpretable survival analysis. Based on real datasets, CoxKAN consistently outperformed the traditional Cox proportional hazards model. Multi-exit KAN is another interesting model [
97]. By including a prediction branch at every layer, the network can make accurate forecasts at multiple depths at the same time. It provides scientists with a practical method for high performance and interpretability. As the aforementioned examples illustrate, KAN can be used across a wide range of fields and generate valuable results.
According to comparative evaluations, KANs have many aspects of outstanding performance compared to mainstream neural architectures in terms of accuracy, parameter efficiency, interpretability, and training dynamics, although there is a trade-off in computational complexity and occasional optimization instability. It should be noted that some studies are based on datasets collected by the authors themselves. This may pose obstacles for researchers who need to reproduce the results. In summary, KANs are theoretically based and practically versatile neural networks with demonstrated applications in scientific computing, engineering, medicine, environmental, and cybersecurity. For KANs to fully realize their potential in complex function approximation and interpretable machine learning, continuous innovations and systematic investigation into their limitations are essential.