In this study, we aim to evaluate how effective Kolmogorov–Arnold Networks (KANs) are in the healthcare domain using a methodical and reproducible approach, which has the potential to be explored further. To this effect, we designed a comparative methodology involving both classification and regression tasks. The datasets were preprocessed using standard techniques, including normalization, imputation of missing values, and feature selection through recursive feature elimination (RFE) and signal transformation methods like wavelet decomposition. KANs were implemented based on the Kolmogorov–Arnold representation theorem, utilizing spline-based learnable functions in place of traditional activation functions. For benchmarking, conventional Artificial Neural Networks (ANNs) were trained using equivalent data splits and tuned hyperparameters. Both models were evaluated using appropriate metrics—accuracy, precision, recall, and F1 score for classification tasks, and mean squared error (MSE), mean absolute error (MAE), and for regression. Special emphasis was placed on interpretability by extracting symbolic expressions from the trained KANs, offering insights into the underlying data relationships critical in clinical decision-making.
2.5. Model Development and Training
2.5.1. Theoretical Foundation of KANs
Kolmogorov–Arnold Networks (KANs) are fundamentally based on the Kolmogorov–Arnold representation theorem [
3], which states that any multivariate continuous function
can be broken down into a combination of a finite sum of univariate functions. This can be represented by the following equation:
In the above Equation (1) [
3],
and
are continuous univariate functions,
n represents dimensionality of the input vector
x, and
refers to the
p-th component of
x. KANs leverage this theorem by replacing traditional linear weights in neural networks with learnable univariate functions, thereby allowing more expressive and efficient approximations of complex multivariate functions. This foundational principle allows KANs to capture complex nonlinear relationships in high-dimensional data, making them more suitable for healthcare applications [
11].
2.5.2. KAN Architecture
KANs take a different approach from traditional neural networks by replacing fixed linear weights with flexible, learnable functions
, which are modeled using splines. This means that each neuron in a KAN layer produces its output by applying these flexible one-dimensional functions to the inputs, thereby allowing the network to capture complex patterns more effectively [
3]. The output of a neuron is computed using the following Equation (2):
Here,
is the output of neuron
j,
is the
i-th input to neuron,
is the learnable univariate function mapping
to neuron
j, and
is the bias term associated with neuron
j. This formula allows each neuron’s activation to be a nonlinear transformation of its inputs, thereby facilitating more flexibility [
11] as compared with the traditional neural network architectures, which rely on fixed activation functions [
5].
The implementation of KANs involves several components to ensure efficient training and interpretability:
KANLayer class: This class constructs network layers where each connection is represented by a univariate function modeled with splines. These splines are initialized using adaptive grids that reflect the distribution of the input data, which allows the model to better capture important patterns by placing more knots in areas where the data is more concentrated [
3].
Symbolic_KANLayer Class: To enhance the interpretability of the model, Symbolic_KANLayer class translates learned univariate functions into symbolic expressions. These expressions offer a human-readable view of how the network processes information, which is valuable in clinical applications where understanding the model’s decision-making is crucial [
11].
Grid initialization and updates: The grids that define the spline functions are first initialized to span the entire range of the input data. As we train, these grids are periodically adjusted to better fit evolving univariate functions. The parameter grid_eps controls how these adjustments are made, striking a balance between maintaining uniform spacing and adapting knot placement based on data distribution [
5]. In our experiment, we chose one global grid value for the entire network: 3 knots for the Heart model, 5 for the Liver model, and 7 for the CGM model, so that every neuron uses that same knot count and we do not mix different grid sizes within one model.
Pruning and fine-tuning: To improve efficiency and generalization of the model, neurons and connections with low activation magnitudes are pruned by removing parts of the network that contribute little to the performance. After pruning, the remaining parameters are fine-tuned to ensure that the model continues to perform well. This process helps to maintain strong predictive accuracy while reducing computational demands and minimizing the risk of overfitting [
9].
By combining these elements, the KAN architecture strikes a thoughtful balance between computational efficiency and interpretability. This makes it well-suited for handling complex challenges in healthcare data analysis, where both accuracy and transparency are essential.
2.5.3. Training Procedure
The training of Kolmogorov–Arnold Networks (KANs) follows a two-stage recipe. First, we train the full network for classification tasks; this means running LBFGS, chosen for its ability to handle large parameter spaces and converge quickly [
3], for several dozen epochs. We then prune the low-impact connections and later switch to Adam using a small learning rate, light weight decay, and gradient clipping to fine-tune remaining weights. For regression tasks, we simplify things by using Adam in the beginning for a longer initial training pass, followed by pruning and a shorter Adam-based fine-tuning run under the same hyperparameters.
For optimization of KANs, different loss functions were employed based on the tasks performed:
Classification tasks: Binary Cross-Entropy Loss () was utilized to optimize separation between classes, as this loss function measures the performance of a classification model whose output is a probability between 0 and 1.
Regression tasks: Mean Squared Error (MSE) Loss () was employed to minimize the difference between predicted values and actual target values, as MSE is sensitive to outliers and stresses larger errors more than smaller ones.
For regularization, we applied L2 regularization through the parameter to control model complexity and prevent overfitting. This penalty discourages large weights, enhancing generalization. In addition to this, the spline-based architecture provides inherent regularization by enforcing smooth transformations and limiting abrupt changes in the learned functions. Thus, both L2 penalties and spline constraints ensure robust performance while minimizing overfitting risks.
The hyperparameters, including learning rates, regularization strengths, batch sizes, and spline grid sizes, were tuned using a grid search approach (
Table 1). This systematic exploration of hyperparameter combinations was combined with cross-validation so that the selected parameters generalized well to unseen data, thereby improving the model’s robustness and performance.
KAN uses one global grid setting network-wide, so every neuron’s spline has the same knot count (e.g., grid = 3 for Heart, grid = 5 for Liver, and grid = 7 for CGM) within a single model.
To further prevent overfitting, an early stopping mechanism with a patience of 10 epochs was applied to monitor validation loss and stop training if no improvement is seen. In addition, spline grids were periodically updated during the early stages of training, allowing the model to adapt better to the data distribution and to improve its ability to capture complex patterns.
The combination of the LBFGS optimizer, carefully chosen loss functions, and strong regularization techniques ensures that KANs are trained effectively, achieving a balance between model complexity and predictive accuracy. These training strategies are essential for enabling KANs to capture complex patterns in healthcare data.
Random weight initialization was made by fixing the RNG seed as 0 for the Heart dataset and seed 123 for the Liver and CGM datasets.
2.5.4. Total Training Steps
To make sure each model’s “training effort” is comparable, we count every gradient update as one training step:
For the Heart experiments, with 242 training samples and a mini-batch size of 32, that gives
steps per epoch. The Liver and CGM experiments use full-batch updates (1 step per epoch). As shown in
Table 2, each model is trained for an equivalent number of gradient-update steps to ensure a fair comparison.
2.5.5. Model Parameter Comparison: KAN vs. ANN
Although Kolmogorov–Arnold Networks (KANs) introduce a modest parameter overhead compared with size-matched Artificial Neural Networks (ANNs), they deliver substantially better predictive performance.
Table 3 summarizes the raw parameter counts for each model across the three tasks.
As shown in
Table 3, KANs use between 3.8× and 7.7× more parameters than their ANN counterparts. However, this additional capacity is put to good use: by embedding spline-based activation functions, KANs capture complex nonlinearities far more effectively, resulting in higher accuracy and F1-scores (see
Section 3.2,
Section 3.3 and
Section 3.4). In practice, the slight increase in feed-forward cost is more than offset by the performance gains, yielding a superior performance-per-parameter trade-off.
2.5.6. Advantages of the KAN Code Implementation
The KAN codebase has several techniques and optimizations, especially for spline implementation. One key feature is efficient spline computation, which leverages B-spline basis functions for fast evaluation. In addition, the B_batch function enhances performance by computing these basis functions in parallel.
Another key advantage is the use of adaptive and dynamic grids, which adjust based on data distribution. This allows the model to better capture complex patterns. By continuously updating during training, the grids help to ensure the spline functions remain effective and well-aligned with the data.
Flexible activation modeling is another key strength of KANs. By representing activations as spline functions, the network can approximate a wide range of nonlinear behaviors, thereby providing greater flexibility than traditional fixed activations. Learnable spline coefficients and grids allow the model to adapt activations to the data and improve its expressive power.
Moreover, the KAN codebase includes seamless integration with PyTorch (v2.4.0) Autograd. The spline functions are fully compatible with PyTorch’s automatic differentiation, enabling smooth backpropagation through spline layers and making gradient computation and optimization more efficient.
Finally, scalability is a key advantage of KANs. Batched computations and efficient implementations enable the model to handle large datasets and complex architectures without high computational costs, thus making it well-suited for real-world, large-scale healthcare applications.
2.5.7. Symbolic Function Extraction
A key advantage of KAN is its ability to extract symbolic expressions from learned univariate functions, improving interpretability. This is very crucial in healthcare applications, as understanding variable relationships is essential.
To enable symbolic function extraction, we used the Symbolic_KANLayer class, which manages symbolic activation functions and their parameters, including input/output dimensions, function representations, and affine transformation. Key methods of the class include forward(x), which computes the layer’s output by applying affine transformations and evaluating symbolic functions; fix_symbolic(i, j, fun_name, x = None, y = None), which assigns a specific symbolic function to an activation and fits its parameters if data is provided; and get_subset(in_id, out_id), which extracts a subset of the layer for pruning, etc.
Step 1: For each activation function , a list of candidate symbolic functions is generated from a predefined library (e.g., sin, tanh, , ). The suggest_symbolic method evaluates each function by fitting it to the learned spline function and computes the coefficient of determination ().
Step 2: When a candidate symbolic function is selected, we fit its parameters to match the learned spline function. This involves optimizing the affine transformation parameters
a,
b,
c, and
d in Expression (3):
where
f is the symbolic function [
11]. The fitting process minimizes the mean squared error between outputs of the spline function and the symbolic function over input data.
Step 3: Once the spline activation function is fixed, is replaced by a symbolic function along with learned parameters using the fix_symbolic method in Symbolic_KANLayer.
Step 4: The model is updated to incorporate symbolic activations, and this process is applied for all activation functions in the network. The auto_symbolic method simplifies this by automating the procedure across the network.
In our experiments with the Heart Disease dataset, we applied a symbolic function extraction process to derive an interpretable model. The symbolic library included functions such as
x,
,
, sqrt, tanh, sin, and abs. The extracted symbolic formula is a complex expression involving polynomial terms, hyperbolic tangent, and absolute value functions. An excerpt of the formula is:
This symbolic Expression (4) represents the nonlinear relationship between the input features and the target variable, where each term corresponds to a transformed combination of input features. It gives us a valuable insight into how different factors influence the risk of heart disease.
The extraction of symbolic functions provides several benefits:
Interpretability: Symbolic formulas improve the transparency in the model’s decision-making process, which is essential for healthcare applications.
Simplification: By replacing complex spline functions with simpler symbolic ones, we can reduce the model’s complexity without sacrificing performance.
Insight into feature importance: The coefficients and terms in the symbolic expressions show the most critical features and how they interact.
Several implementation details contribute to the robustness and flexibility of the symbolic function extraction:
Symbolic Library: A predefined set of functions is used for symbolic regression, which can be customized for a specific problem.
SymPy Integration: The Symbolic_KANLayer uses SymPy for symbolic manipulation, thereby helping to simplify and evaluate the symbolic expressions.
Error Handling: The implementation includes mechanisms to manage cases where fitting fails or numerical issues arise to ensure overall robustness.
2.7. Advantages of KANs
Compared with conventional neural network architectures such as Convolutional Neural Networks (CNNs), Kolmogorov–Arnold Networks (KANs) offer notable improvements. By replacing fixed activation functions with learnable univariate functions, KANs improve both accuracy and interpretability, making them more effective for modeling complex nonlinear relationships in healthcare, where both predictive performance and transparency are critical.
Compared with MLPs, KANs demonstrate faster neural scaling, allowing them to achieve strong performance with fewer parameters. This helps reduce both overfitting and computational costs. Moreover, KANs offer improved interpretability through more intuitive visualizations and interactions, which is again very valuable for clinical decision-making.
In comparison to CNNs, which are primarily used for image-based healthcare data, KANs offer a versatile approach applicable to both structured tabular data and unstructured time-series data. This adaptability, combined with the Kolmogorov–Arnold representation theorem, makes KAN a powerful tool for a variety of healthcare applications.