EKAResNet: Enhancing ResNet with Kolmogorov–Arnold Network-Based Nonlinear Feature Mapping

Dang, Zhiming; Wu, Tonghua; Zhang, Wulin; Chen, Jianxin; Chen, Huanlin; Liu, Xuan; Liu, Zirui

doi:10.3390/computation13110248

Open AccessArticle

EKAResNet: Enhancing ResNet with Kolmogorov–Arnold Network-Based Nonlinear Feature Mapping

by

Zhiming Dang

,

Tonghua Wu

^*,

Wulin Zhang

,

Jianxin Chen

,

Huanlin Chen

,

Xuan Liu

and

Zirui Liu

Department of Physics and Electrical Engineering, Mudanjiang Normal University, Mudanjiang 157000, China

^*

Author to whom correspondence should be addressed.

Computation 2025, 13(11), 248; https://doi.org/10.3390/computation13110248

Submission received: 7 September 2025 / Revised: 7 October 2025 / Accepted: 18 October 2025 / Published: 22 October 2025

(This article belongs to the Section Computational Engineering)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Residual Networks (ResNet) address the vanishing gradient problem through skip connections and have become a fundamental architecture for computer vision tasks. However, standard convolutional layers exhibit limited capacity in modeling complex nonlinear relationships. We present EKAResNet, a residual backbone enhanced with a spline-based Kolmogorov–Arnold Network (KAN) head. Specifically, we introduce a KAN-based Feature Classification Module (KAN-FCM) that replaces a portion of the traditional fully connected classifier. This module employs piecewise polynomial (spline) approximation to achieve adaptive nonlinear mapping while maintaining a controlled parameter budget. We evaluate EKAResNet on CIFAR-10 and CIFAR-100, achieving top accuracies of 95.84% and 80.06%, respectively. Importantly, the model maintains a parameter count comparable to strong ResNet and WideResNet baselines. Ablation studies on spline configurations further confirm the contribution of the KAN head. These results demonstrate the effectiveness of integrating KAN structures into ResNet for modeling high-dimensional, complex features. Our work highlights a promising direction for designing deep learning architectures that balance accuracy and computational efficiency.

Keywords:

KAN; nonlinear feature transformation; image classification

1. Introduction

Deep neural networks have achieved remarkable success in computer vision, largely attributed to increased model depth that enhances representational capacity [1]. However, deeper architectures face significant training challenges, including vanishing gradients and performance degradation. To address these issues, Residual Networks (ResNet) introduced shortcut connections that facilitate gradient flow, enabling the effective training of substantially deeper networks [2]. ResNet has achieved state-of-the-art performance across numerous computer vision tasks, establishing itself as a cornerstone of modern deep learning architectures [3].

In recent years, ResNet and its variants have made considerable progress in addressing training-related challenges, particularly vanishing and exploding gradients [4]. Despite these advances, conventional convolutional layers remain limited in their ability to model high-dimensional, complex, and nonlinear feature distributions [5]. Although architectures such as DenseNet and Inception have improved feature extraction through enhanced connectivity and multi-path designs, ResNet continues to be one of the most widely adopted backbones. This popularity stems from its simplicity, modularity, and extensibility, which make it an ideal baseline for integrating and evaluating new modules. The straightforward residual structure allows flexible incorporation of additional mechanisms with minimal architectural complexity.

Recent research has increasingly emphasized the enhancement of nonlinear feature modeling capabilities. Traditional CNN architectures primarily rely on fixed nonlinear activations such as ReLU, which limits their capacity to adaptively capture complex patterns. To overcome these limitations, various advanced methods have been explored, including novel activation functions [6], dynamic convolutional modules [7], and purely MLP-based architectures such as MLP-Mixer and PolyMLP [8]. Recently, spline-based and kernelized networks have attracted considerable attention for their ability to perform adaptive nonlinear transformations using learnable spline functions. Reinhardt et al. proposed SineKAN, a Kolmogorov–Arnold Network (KAN) variant that employs sinusoidal activation functions to achieve superior performance and improved interpretability with fewer parameters than conventional MLPs [9]. Bodner et al. introduced Conv-KAN, which extends KAN by integrating spline-based transformations directly into convolutional layers, achieving comparable accuracy in vision tasks with significantly fewer parameters [10]. Zheng et al. proposed Free-Knots KAN, which investigates the impact of spline knot placement and offers novel approaches to enhance training stability and parameter efficiency [11].

Despite these advances, many existing spline-based and nonlinear adaptive approaches face challenges, including increased computational overhead and difficulties in seamless integration into widely adopted CNN backbones like ResNet. Effectively enhancing the nonlinear modeling capability of standard architectures therefore remains an open research question.

To address these limitations, this study proposes integrating the Kolmogorov–Arnold Network (KAN) into the ResNet architecture [12]. By leveraging KAN’s adaptive nonlinear transformations, our framework aims to enhance feature representation and improve classification performance in high-dimensional visual recognition tasks. Through systematic experiments and ablation studies, we demonstrate that integrating KAN modules within the ResNet backbone improves modeling of complex nonlinear features while maintaining efficiency and scalability.

Specifically, we integrate KAN into ResNet’s fully connected layer to improve its capacity for modeling complex nonlinear features [13]. After global features are extracted by the convolutional backbone, KAN performs nonlinear transformations to augment the network’s representational power [14]. In the original ResNet architecture, the final fully connected (FC) layer serves as the classification head. In our proposed design, this component is partially replaced by a KAN-based Feature Classification Module (KAN-FCM), which provides a unified framework for both nonlinear feature transformation and classification [15]. This architectural enhancement addresses the limitations of conventional convolutional layers in capturing high-dimensional, nonlinear patterns, leading to improved feature representation and classification performance [16].

We introduce EKAResNet, a novel model that integrates ResNet with KAN, and conduct a comparative evaluation against the standard ResNet architecture. Experimental results demonstrate that our model achieves superior classification performance with reduced computational cost. The main contributions of this work are as follows:

(1): The model integrates ResNet and KAN to leverage KAN’s strengths in nonlinear modeling. By introducing parameterizable B-spline basis functions to map input features, the model enhances ResNet’s feature representation capability for high-dimensional complex data, thereby improving classification accuracy.
(2): The model replaces selected fully connected layers with KAN-FCM, introducing a dynamic feature adjustment mechanism through adaptive kernel functions. By employing piecewise polynomial interpolation (spline approximation) for feature mapping, this approach reduces computational complexity while maintaining effective feature representation.

2. Method

This study proposes EKAResNet, a novel model designed for image classification tasks. The overall architecture, summarized in Algorithm 1, consists of four main stages:

(1): Input preprocessing: The input image undergoes initial convolution, normalization, and pooling operations. These steps convert raw pixel data into a structured feature representation, effectively compressing redundant information while strengthening local feature expression [6].
(2): Feature extraction: The preprocessed features are processed through a module that employs residual connections. This mechanism directly propagates input information to subsequent layers, enhancing feature propagation efficiency and mitigating the vanishing gradient problem in deep networks [7].
(3): Feature enhancement: The extracted high-dimensional features undergo efficient nonlinear transformation through piecewise polynomial interpolation, improving feature separability [8].
(4): Classification: KAN-FCM replaces the traditional fully connected layer, reducing computational redundancy and improving efficiency [17].

Algorithm 1: Generic pre-activation residual CNN with spline-enhanced classifier

2.1. Initial Feature Extraction

To address the vanishing gradient problem, this study employs residual connections to ensure stable gradient propagation during training, this study employs a skip connection mechanism that directly propagates input features to subsequent layers. This strategy preserves original feature information, improves gradient flow, and enables effective training of deeper neural networks [7]. The input image first passes through an initial convolutional layer for low-level feature extraction. This stage converts raw pixel data into a structured, learnable feature representation that serves as the foundation for subsequent deep feature learning [18]. A convolutional kernel extracts local spatial features, followed by batch normalization and the rectified linear unit (ReLU) activation function to promote training stability and enhance nonlinear representation capacity.

Given an input image

X \in R^{H \times W \times C}

, where H and W represent the height and width of the image and C represents the number of channels, the initial convolutional layer is calculated as follows:

F^{(0)} = σ (BN (W_{s} \times X + b_{s}))

(1)

where

W_{s}

the

3 \times 3

convolution kernel used to extract local image features;

b_{s}

is a bias term;

BN (\cdot)

represents the batch normalization layer, which normalizes feature data distribution to improve training stability; and

σ (\cdot)

represents the ReLU activation function, which enhances the network’s nonlinear representation capacity for learning complex features.

Following initial feature extraction, max pooling is employed to downsample the feature maps. This operation serves two purposes: reducing computational complexity and emphasizing salient features. By selecting the maximum value within each local receptive field, max pooling preserves the most informative activations while discarding less relevant responses [19]. This operation eliminates redundant information, directing the network’s attention toward critical features and enhancing generalization capability. Additionally, by reducing the spatial dimensions of feature maps, it lowers computational overhead and provides a more compact and discriminative feature representation for subsequent residual blocks [20].

2.2. Feature Extraction

The proposed network employs a multi-stage residual architecture consisting of three sequential stages (Stage 1, Stage 2, and Stage 3), as shown in Figure 1. Each stage comprises multiple stacked residual units that facilitate hierarchical extraction of multi-scale features. Each residual unit adopts a pre-activation design:

BN \to ReLU \to {Conv}_{3 \times 3}

, then

BN \to ReLU \to {Conv}_{3 \times 3}

. A dropout layer is inserted between the two convolutional operations to enhance generalization, particularly on small datasets. This configuration follows the pre-activation residual framework [21] and provides additional regularization to mitigate overfitting.

The residual blocks incorporate skip connections to address the vanishing gradient problem in deep networks. Let

F_{ℓ}

denote the input feature of the ℓ-th layer. The output of the residual block is expressed as

F_{ℓ + 1} = S (F_{ℓ}; s, C_{out}) + H (F_{ℓ}; W_{ℓ}, s)

(2)

where

S (\cdot)

denotes the shortcut projection (identity mapping when dimensions match, or a

1 \times 1

convolution otherwise), and

H (\cdot)

represents the residual transformation.

The residual transformation is implemented through two successive convolutional operations. Each operation is followed by batch normalization and a nonlinear activation function. The intermediate feature map obtained from the first convolution is denoted as U and can be expressed as

U = {Conv}_{K} (Act (BN (F_{ℓ})); W_{1}, stride = s)

(3)

where K is the kernel size and

W_{1}

is the weight parameter. A dropout operation is then applied for regularization:

\tilde{U} = {Dropout}_{p_{drop}} (U)

(4)

where

p_{drop}

denotes the dropout probability. Finally, the second convolution is performed as

H (F_{ℓ}; W_{ℓ}, s) = {Conv}_{K} (Act (BN (\tilde{U})); W_{2}, stride = 1)

(5)

where

W_{2}

represents the weight parameter of the second convolutional layer. As shown in Figure 2, Stage 1 maintains the spatial dimensions of the input feature map. This stage primarily focuses on extracting low-level semantic information through multiple convolutional operations while progressively expanding the receptive field. Since no spatial downsampling is applied at this stage, the original spatial relationships within the image are preserved. This preservation provides crucial structural information for subsequent feature extraction.

Stage 2 performs spatial downsampling using a convolutional operation with a stride of 2. This operation reduces the spatial resolution while simultaneously increasing the number of channels from 16 to 32, thereby enhancing the representational capacity of the network. This stage is designed to extract mid-level discriminative features. The downsampling operation not only reduces computational cost but also enlarges the receptive field, enabling the network to capture richer contextual information. Residual connections are consistently retained across all blocks to promote stable gradient propagation during training.

Stage 3 further downsamples the feature map while increasing the number of channels to 64. At this stage, the residual blocks primarily learn high-level abstract semantic representations. Through deeper stacking and increased channel dimensionality, the model can capture complex and highly discriminative patterns. This capability improves its ability to distinguish between visually similar categories. The features obtained at this stage possess enhanced global perception and semantic representation. These features facilitate the integration of high-order contextual dependencies across spatial regions and structural elements [22].

The hierarchical feature extraction process can be summarized as

F^{(3)} = Stage3 (Stage2 (Stage1 (F^{(0)})))

(6)

where

F^{(0)}

denotes the initial feature map obtained from the stem convolutional layer;

F^{(1)}

,

F^{(2)}

, and

F^{(3)}

represent the outputs of Stage 1, Stage 2, and Stage 3, respectively; and

F^{(3)}

serves as the final deep feature embedding extracted by the three-stage residual hierarchy.

After residual feature extraction, the network applies global average pooling (GAP) for dimensionality reduction. The operation is expressed as

Z_{i} = \frac{1}{H \times W} \sum_{h = 1}^{H} \sum_{w = 1}^{W} x_{i, h, w}

(7)

where

x_{i, h, w}

denotes the feature value at spatial position

(h, w)

of the i-th channel, and H and W are the height and width of the feature map, respectively [23]. Equation computes the average response of all spatial positions within each channel of

F^{(3)}

, thereby compressing the three-dimensional feature map into a compact one-dimensional vector representation

Z \in R^{C_{3}}

[24].

Unlike fully connected layers, GAP is a parameter-free operation. It does not introduce additional learnable weights. Instead, it directly aggregates the spatial statistics of

F^{(3)}

. This property offers several advantages: it reduces the number of trainable parameters, mitigates overfitting, and improves generalization. The resulting global descriptor

Z

is subsequently forwarded to the KAN-based nonlinear mapping module for final classification.

2.3. Feature Enhancement

Traditional convolutional neural networks typically employ convolutional kernels with fixed structures during feature extraction. This design limits their ability to flexibly adapt to complex nonlinear variations in the input data. The limitation becomes particularly evident when dealing with diverse feature distributions or variations in local details. Such constraints often lead to sparse feature representations and diminished generalization performance. To enhance the model’s discriminative capability when processing high-dimensional and complex data, this study introduces a nonlinear feature enhancement module based on B-spline basis functions. This module is positioned after the feature extraction stage [25,26]. The design of this module is inspired by the Kolmogorov–Arnold representation theorem, which asserts that any multivariate continuous function can be represented as a finite composition of univariate functions [27]. The mathematical formulation is as follows:

f (x) = \sum_{q = 1}^{2 n + 1} Φ_{q} (\sum_{p = 1}^{n} ϕ_{q, p} (x_{p}))

(8)

This theory reveals that complex functions can be decomposed and reconstructed using a set of low-order basis functions. This decomposition provides a solid mathematical foundation for nonlinear mappings in feature space. Building upon this principle, KAN performs adaptive nonlinear transformations of input features by learning a set of parameterizable basis functions. This approach enhances the model’s ability to represent complex data patterns. The core mechanism of KAN constructs dynamic feature mappings based on B-spline basis functions. This design allows the model to adaptively adjust the shape and weight of each basis function in response to the distribution of input data, thereby enabling more accurate modeling of nonlinear relationships. As illustrated in Figure 3, a B-spline curve is defined by a set of control points together with a knot vector. It is piecewise polynomial and

C^{k - 1}

-continuous for order k.

A key property is the compact support of the basis functions: moving a single control point only affects the curve locally. This property provides stable and fine-grained shape control, improves robustness to noise, and helps prevent overfitting by avoiding global distortions. In our setting, the control points act as learnable coefficients that weight localized B-spline bases. This design enables smooth high-order nonlinear modeling with a moderate number of parameters and full differentiability.

Accordingly, KAN employs B-spline interpolation to perform nonlinear mapping on the input features. The mapping function is expressed as a linear combination of multiple B-spline basis functions evaluated on a predefined grid and their corresponding learnable weights, as formulated below:

S (x) = \sum_{i = 0}^{k} c_{i} B_{i} (x)

(9)

Here,

S (x)

represents the nonlinear mapping function,

B_{i} (x)

is the i th B spline basis function, and

c_{i}

is a learnable parameter. This structure endows the model with flexible nonlinear modeling capabilities. It enables the model to automatically adjust the mapping approach according to different distribution characteristics, thereby more effectively enhancing feature discriminability [28].

By integrating ResNet with KAN, the nonlinear feature transformation capabilities of KAN are leveraged to address the limitations of traditional ResNet in modeling high-dimensional and complex data. This integration enables further refinement of the features extracted by ResNet prior to the classification stage. Consequently, it enhances the quality of feature representation and improves overall classification accuracy.

2.4. KAN-FCM

Traditional fully connected layers typically employ fixed linear transformation methods. These methods struggle to fully model the complex nonlinear relationships in the input features, especially in high-dimensional spaces. This limitation easily leads to the loss of fine-grained semantic information, thereby affecting the model’s discriminative ability.

To overcome this problem, this study replaces the traditional two fully connected layers at the end of the classification stage with two sequential KAN-FCM layers. Specifically, after global average pooling, the features are first mapped to an intermediate dimensional space by a KAN-Encoder layer. They are then projected to the final output classes by a KAN-Classifier layer. Both KAN-FCM layers employ adaptive nonlinear transformations to improve feature expressiveness and classification performance.

In contrast to traditional fully connected layers, the KAN-FCM module extends conventional linear mappings by incorporating nonlinear modeling capabilities. It integrates both linear and nonlinear transformation mechanisms, thereby enabling a broader functional representation space for input features. The output of the KAN-FCM can be formulated as follows:

F_{KAN} = W_{base} \cdot σ (F_{in}) + W_{spline} \cdot S (F_{in})

(10)

Among them,

F_{in}

represents the input feature vector;

W_{base}

and

W_{spline}

represent the weight matrices for the linear path and nonlinear path, respectively;

σ

is the basic activation function; and

S (\cdot)

represents the nonlinear feature mapping function.

As illustrated in Figure 4, the proposed KAN-FCM module consists of two sequentially connected substructures: the KAN-Encoder and the KAN-Classifier, which jointly perform nonlinear feature transformation and classification.

The KAN-Encoder first receives the high-dimensional features extracted by ResNet and projects them into a 256-dimensional latent space [29]. Unlike conventional linear projections, it leverages B-spline basis functions with adaptive control points to implement a nonlinear mapping. Each control point locally alters the transformation curve while preserving global smoothness.

Functionally, this mechanism provides several advantages. It suppresses redundant inter-channel correlations, enhances inter-class separability, and adapts the mapping complexity to the input distribution. Consequently, it retains essential discriminative information and improves robustness to noise and overfitting.

As evidenced by the t-SNE visualization of KAN-Encoder embeddings in Figure 5, the learned features form multiple well-defined clusters. The moderate inter-class overlap indicates that the encoder produces discriminative representations while leaving room for further separation.

The KAN-FCM transformation enhances the accuracy of feature mapping. It maintains an optimal balance between representational capacity and computational efficiency through adaptive adjustment of the number of control points. This flexibility allows the KAN structure to maintain high performance in both computationally constrained environments and high-dimensional data scenarios.

Unlike traditional fully connected layers, which rely on fixed linear transformations, the adaptive kernel function in KAN-FCM performs nonlinear transformations that dynamically adjust based on the distribution of input data. This capability enables the model to modulate the complexity of feature mappings in response to varying data modes, ensuring robust computational efficiency and classification capability across diverse tasks and data scales.

Compared to static kernel functions with fixed parameters, the adaptive kernel function can automatically refine both local structures and global distributions within the input features. This refinement captures fine-grained variations and improves the model’s generalization ability. While reducing the number of parameters, KAN-FCM further customizes the feature transformation process based on data characteristics. This customization enables more accurate modeling of complex nonlinear relationships.

Consequently, before the features extracted by ResNet enter the final classification stage, KAN enhances their separability. Through KAN-FCM, the model achieves improved classification accuracy without incurring additional computational cost.

3. Experimental Section

3.1. Experimental Environment

The hardware and software configurations used in the experiments are as follows. The hardware environment includes an NVIDIA GeForce RTX 4080 graphics card. The software environment includes the Windows 10 64-bit operating system, Python 3.9 programming language, PyTorch 1.10 deep learning framework, CUDA 11.1, and related scientific computing and data loading libraries.

3.2. Dataset Introduction

This study utilizes the CIFAR-10 and CIFAR-100 datasets, both of which are widely recognized benchmark datasets in the field of computer vision. These datasets are commonly employed to evaluate the performance of image classification models.

The CIFAR-10 dataset comprises 10 object categories, including airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. It contains a total of 60,000 color images with a resolution of 32 × 32 pixels. Of these, 50,000 are allocated for training and 10,000 for testing. Each class is evenly represented by 6000 images. Due to its relatively small number of categories, CIFAR-10 is often used to assess model performance on simpler classification tasks.

The CIFAR-100 dataset is an extended version of CIFAR-10, consisting of 100 classes. Each class is represented by 500 training images and 100 test images, totaling 60,000 32 × 32 color images. Compared to CIFAR-10, CIFAR-100 introduces a more fine-grained classification task, posing greater challenges for feature learning and generalization. The dataset encompasses a wide range of categories, including animals, plants, vehicles, and everyday objects. This diversity makes it well-suited for evaluating a model’s capability in handling complex visual recognition tasks.

For preprocessing of the training dataset, the input images were subjected to a series of operations during the data loading phase. First, random padding and cropping were applied: each image was padded by 4 pixels on all sides and then randomly cropped back to a size of 32 × 32 pixels, simulating image translation and scaling. Second, random horizontal flipping was performed with a probability of 50% to increase data variability. Finally, normalization was applied, where pixel values were adjusted based on the dataset’s mean and standard deviation to ensure consistent input distribution. These data augmentation techniques were applied only during the training phase to improve model generalization.

For the test dataset, only normalization was performed to ensure a fair and unbiased evaluation of model performance.

Due to limited computational resources, our experiments are conducted on CIFAR-10 and CIFAR-100, which are widely adopted benchmarks for evaluating image classification models. Future work will explore the generalization of our approach to larger-scale and higher-resolution datasets such as ImageNet.

3.3. Experimental Analysis

To evaluate the effectiveness of the proposed model, all experiments were conducted with the following hyperparameter settings. Stochastic gradient descent (SGD) was employed as the optimizer, with an initial learning rate of 0.1, a momentum coefficient of 0.9, and a weight decay of

5 \times 10^{- 4}

. The batch size was set to 256, and training was performed for 200 epochs. The cosine annealing learning rate scheduler was adopted to adjust the learning rate dynamically during training. Label smoothing loss with a smoothing factor of 0.1 was used as the loss function to improve generalization.

All hyperparameter choices-including optimizer, learning rate, batch size, momentum, and weight decay-were initially based on standard settings reported in relevant literature for CIFAR-10 and CIFAR-100 classification tasks. These values were subsequently confirmed through limited empirical tuning on the validation set to ensure stable convergence and competitive performance. No large-scale grid or random search was performed, but multiple trials with different random seeds were conducted to confirm the robustness of the results.

Importantly, all experimental results reported in this paper are the average values obtained from multiple independent runs (at least three times) under the same settings, rather than from a single experiment. This approach ensures that the reported performance is stable and not due to randomness or outliers.

For comparative evaluation, we benchmarked the proposed EKAResNet model against several representative architectures, including ResNet101, WideResNet-28-10, ResNetKAN [30], KAT-Small [31], and ConvNeXtV2-Tiny [32].

As shown in Table 1, on the CIFAR-10 dataset, EKAResNet achieves the highest classification accuracy of 95.84%. This performance is comparable to or surpasses other state-of-the-art baselines. The result indicates that the incorporation of kernel-adaptive spline mappings enables EKAResNet to maintain strong discriminative capability even on relatively simple image classification tasks.

Furthermore, as presented in Table 2, on the more challenging CIFAR-100 dataset, EKAResNet attains an accuracy of 80.06%. This result outperforms all baseline models by a margin ranging from 0.63% to 1.41%. This performance gain highlights the enhanced representational capacity introduced by the KAN structure, which facilitates more effective modeling of high-dimensional nonlinear dependencies and contributes to improved classification outcomes in complex data scenarios.

As summarized in Table 3, the proposed EKAResNet comprises 36.2 million parameters, a scale that is comparable to WideResNet-28-10 yet markedly smaller than ResNet101 and ResNetKAN. Relative to lighter backbones such as ConvNeXtV2-Tiny and KAT-Small, EKAResNet uses more parameters but attains higher recognition accuracy, indicating a favorable accuracy–capacity trade-off.

Overall, these results indicate that introducing the KAN-based spline head yields near-baseline parameter cost while avoiding the heavy footprint of deeper ResNet variants. Combined with the superior accuracy observed on CIFAR-100, EKAResNet achieves a favorable accuracy–capacity trade-off. This balance positions it as a strong choice for scenarios that require high performance under a mid-range parameter budget.

3.4. Ablation Experiment

To systematically evaluate the contribution of the KAN module within the proposed model, two ablation studies were conducted. The first study examines the sensitivity of the model to the spline capacity of the KAN head-namely the number of spline layers and the per-layer control-point configuration determined by grid size and spline order.

As summarized in Table 4, we evaluate multiple configurations on CIFAR-100 by varying the number of spline layers, the grid size, and the spline order. The best accuracy is obtained when the head uses two spline layers, a grid size of three, and a spline order of two. This setting offers the most favorable trade-off between model complexity and classification accuracy.

Reducing the number of layers or removing the KAN module altogether leads to performance degradation. Conversely, excessively increasing the grid size does not yield further improvement. These findings highlight the importance of appropriately balancing the model capacity with spline resolution.

In the second study, we evaluated the model performance using the CIFAR-100 dataset. Three images were randomly selected from a single category as input. The corresponding neuron activation values were extracted and subjected to quantitative analysis to assess the influence of the KAN module on feature representation.

As shown in Figure 6 and Figure 7, the KAN module produces smoother neuron activation patterns. It reduces the standard deviation of activation values, thereby stabilizing the gradient update process. This enhancement accelerates optimizer convergence and reduces overall training time.

The underlying mechanism stems from KAN’s ability to improve response consistency across varying inputs through adaptive nonlinear transformations. This approach suppresses extreme activation values and mitigates training instability. The inherent structural stability of KAN significantly reduces computational overhead, offering a practical optimization strategy for efficient convergence in deep learning models.

These experimental results validate KAN’s effectiveness in enhancing training efficiency and improving model stability. The findings provide valuable insights for structural optimization of deep neural networks.

4. Summarize

This work presents EKAResNet, a residual network architecture equipped with a spline-based KAN head for image classification. On CIFAR-10, the model achieves accuracy comparable to strong baselines. On CIFAR-100, it demonstrates modest but consistent performance gains under identical training conditions. The spline-based head introduces structured nonlinearity within a standard residual architecture while maintaining reasonable computational efficiency. Ablation studies on the spline configuration support these findings, indicating that the KAN head serves as a practical alternative to conventional fully connected classifiers.

Although this study demonstrates the effectiveness of EKAResNet on CIFAR-10 and CIFAR-100, further validation is needed on larger-scale datasets and more complex visual tasks, such as object detection and image segmentation. The architectural design may impose additional computational demands, necessitating future exploration of lightweight optimization strategies to enhance efficiency and inference speed. Furthermore, the adaptive mechanism of the kernel function can be refined to better accommodate variations in feature space distributions, thereby improving the model’s robustness and generalization capability.

As the KAN framework continues to evolve, the proposed approach shows promise for broader application in nonlinear computational tasks. This work offers a novel perspective for the design and optimization of deep learning architectures, particularly in integrating adaptive nonlinear transformations with established backbone networks. Empirically, EKAResNet achieves competitive accuracy with a parameter budget comparable to strong baselines, while providing a principled mechanism for adaptive nonlinear feature mapping via B-spline kernels. Nevertheless, the current method has several limitations. The model’s performance has been evaluated primarily on medium-scale datasets, which may not fully capture its scalability or robustness under large-scale and real-world conditions. In addition, while the spline-based module introduces structured nonlinearity, it also increases the number of hyperparameters, making optimization more sensitive to initialization and training dynamics. Future work will focus on extending EKAResNet to larger and more diverse datasets, exploring adaptive spline parameter tuning strategies, and developing lightweight variants to reduce computational cost and enhance deployment feasibility in resource-constrained environments.

Author Contributions

Conceptualization, W.Z.; Investigation, Z.D. and J.C.; Writing—original draft preparation, Z.D.; Writing—review and editing, W.Z., T.W., X.L. and Z.L.; Visualization, H.C.; Funding acquisition, T.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Heilongjiang Provincial Basic Scientific Research Project of Higher Education Institutions in 2022 (Project No. 1452MSYYB008); This research was supported by the 2024 Shuanggua–Shuangneng Construction Project of Mudanjiang Normal University (Project No. 2024SGSN015).

Data Availability Statement

The original data presented in the study are openly available in the CIFAR dataset repository at https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 18 October 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Glorot, X.; Bengio, Y. Understanding the Difficulty of Training Deep Feedforward Neural Networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010; Proceedings of Machine Learning Research: London, UK, 2010; Volume 9, pp. 249–256. [Google Scholar]
Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep learning for computer vision: A brief review. Comput. Intell. Neurosci. 2018, 2018, 7068349. [Google Scholar] [CrossRef]
Hochreiter, S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 1998, 6, 107–116. [Google Scholar] [CrossRef]
Botalb, A.; Moinuddin, M.; Al-Saggaf, U.M.; Ali, S.S. Contrasting convolutional neural network (CNN) with multi-layer perceptron (MLP) for big data analysis. In Proceedings of the 2018 International Conference on Intelligent and Advanced System (ICIAS), Kuala Lumpur, Malaysia, 23–25 August 2018; pp. 1–5. [Google Scholar]
Krichen, M. Convolutional neural networks: A survey. Computers 2023, 12, 151. [Google Scholar] [CrossRef]
Borawar, L.; Kaur, R. ResNet: Solving vanishing gradient in deep networks. In Proceedings of the International Conference on Recent Trends in Computing (ICRTC 2022), Chandigarh, India, 9–10 December 2022; Springer Nature: Singapore, 2023; pp. 235–247. [Google Scholar]
Cao, Y.; Li, X.; Feng, H.; Wang, T.; Zhang, J.; Liu, J. KANMixer: A Fully Linear Network with Multi-Scale Feature Awareness for Power Quality Classification Tasks. In Proceedings of the 2024 China Automation Congress (CAC), Xiamen, China, 1–3 November 2024; pp. 3341–3346. [Google Scholar]
Reinhardt, E.; Ramakrishnan, D.; Gleyzer, S. Sinekan: Kolmogorov-arnold networks using sinusoidal activation functions. Front. Artif. Intell. 2025, 7, 1462952. [Google Scholar] [CrossRef] [PubMed]
Bodner, A.D.; Tepsich, A.S.; Spolski, J.N.; Pourteau, S. Convolutional kolmogorov-arnold networks. arXiv 2024, arXiv:2406.13155. [Google Scholar]
Zheng, L.N.; Zhang, W.E.; Yue, L.; Xu, M.; Maennel, O.; Chen, W. Free-Knots Kolmogorov-Arnold Network: On the Analysis of Spline Knots and Advancing Stability. arXiv 2025, arXiv:2501.09283v1. [Google Scholar]
Ji, T.; Hou, Y.; Zhang, D. A comprehensive survey on Kolmogorov Arnold networks (KAN). arXiv 2024, arXiv:2407.11075. [Google Scholar] [CrossRef]
Scabini, L.F.; Bruno, O.M. Structure and performance of fully connected neural networks: Emerging complex network properties. Phys. A Stat. Mech. Its Appl. 2023, 615, 128585. [Google Scholar] [CrossRef]
Chou, Y.C.; Chen, C.C. Improving deep learning-based polyp detection using feature extraction and data augmentation. Multimed. Tools Appl. 2023, 82, 16817–16837. [Google Scholar] [CrossRef]
Chen, W.; Wang, Y.; Ren, Y.; Jiang, H.; Du, G.; Zhang, J.; Li, J. An automated detection of epileptic seizures EEG using CNN classifier based on feature fusion with high accuracy. BMC Med. Inform. Decis. Mak. 2023, 23, 96. [Google Scholar] [CrossRef]
Li, X.; Li, Z.; Qiu, H.; Hou, G.; Fan, P. An overview of hyperspectral image feature extraction, classification methods and the methods based on small samples. Appl. Spectrosc. Rev. 2023, 58, 367–400. [Google Scholar] [CrossRef]
Sahin, V.H.; Oztel, I.; Yolcu Oztel, G. Human monkeypox classification from skin lesion images with deep pre-trained network using mobile application. J. Med. Syst. 2022, 46, 79. [Google Scholar] [CrossRef] [PubMed]
Zhang, D.; Zhou, J.; Zhang, W.; Lin, Z.; Yao, J.; Polat, K.; Alhudhaif, A. ReX-Net: A reflectance-guided underwater image enhancement network for extreme scenarios. Expert Syst. Appl. 2023, 231, 120842. [Google Scholar] [CrossRef]
Zhao, L.; Zhang, Z. A improved pooling method for convolutional neural networks. Sci. Rep. 2024, 14, 1589. [Google Scholar] [CrossRef]
Yin, T.; Chen, H.; Yuan, Z.; Wan, J.; Liu, K.; Horng, S.J.; Li, T. A robust multilabel feature selection approach based on graph structure considering fuzzy dependency and feature interaction. IEEE Trans. Fuzzy Syst. 2023, 31, 4516–4528. [Google Scholar] [CrossRef]
Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
Chen, Y.; Xia, R.; Yang, K.; Zou, K. DARGS: Image inpainting algorithm via deep attention residuals group and semantics. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 101567. [Google Scholar] [CrossRef]
Cao, L.; Li, J.; Chen, S. Multi-target segmentation of pancreas and pancreatic tumor based on fusion of attention mechanism. Biomed. Signal Process. Control 2023, 79, 104170. [Google Scholar] [CrossRef]
Xie, C.; Shao, Z.; Zhao, N.; Du, Y.; Du, L. An efficient CNN inference accelerator based on intra-and inter-channel feature map compression. IEEE Trans. Circuits Syst. Regul. Pap. 2023, 70, 3625–3638. [Google Scholar] [CrossRef]
Mou, L.; Xiao, X.; Cao, W.; Li, W.; Chen, X. Efficient and accurate capsule networks with b-spline-based activation functions. In Proceedings of the 2024 International Conference on New Trends in Computational Intelligence (NTCI), Shanghai, China, 18–20 October 2024; pp. 201–205. [Google Scholar]
Ta, H.T. BSRBF-KAN: A combination of B-splines and Radial Basis Functions in Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2406.11173. [Google Scholar]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Tegmark, M. KAN: Kolmogorov-arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
Cheon, M. Kolmogorov-arnold network for satellite image classification in remote sensing. arXiv 2024, arXiv:2406.00600. [Google Scholar] [CrossRef]
Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. Rsmamba: Remote sensing image classification with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Yu, R.C.; Wu, S.; Gui, J. Residual Kolmogorov–Arnold Network for Enhanced Deep Learning. arXiv 2024, arXiv:2410.05500. [Google Scholar]
Yang, X.; Wang, X. Kolmogorov–Arnold Transformer. arXiv 2024, arXiv:2409.10594. [Google Scholar]
Woo, S.; Kim, D.; Park, J.; Lee, J.; Kweon, I.S. ConvNeXt V2: Co-Designing and Scaling ConvNets with Masked Autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 16133–16142. [Google Scholar]

Figure 1. Residual block structure.

Figure 2. Channel correlation matrices across network stages.

Figure 3. Visualization of B-spline curves and their corresponding control points. The circles and crosses are purely visual aids to improve readability and carry no additional analytic meaning.

Figure 4. Schematic diagram of the KAN-FCM module.

Figure 5. T-SNE of KAN-Encoder embeddings (CIFAR-100).

Figure 6. Comparison of feature means.

Figure 7. Comparison of feature standard deviation.

Table 1. Classification accuracy on CIFAR-10.

Network	Accuracy (%)
ResNet101	95.75 ± 0.32
WideResNet-28-10	95.65 ± 0.56
ResNetKAN	95.20 ± 0.38
KAT-Small	95.50 ± 0.64
ConvNeXtV2-Tiny	94.80 ± 0.86
EKAResNet	95.84 ± 0.26

Table 2. Classification accuracy on CIFAR-100.

Network	Accuracy (%)
ResNet101	79.08 ± 1.21
WideResNet-28-10	78.64 ± 0.58
ResNetKAN	79.43 ± 1.33
KAT-Small	78.30 ± 0.68
ConvNeXtV2-Tiny	78.70 ± 2.21
EKAResNet	80.06 ± 0.55

Table 3. Number of parameters for each model on CIFAR100.

Network	Number of Parameters
ResNet101	42.70M
WideResNet-28-10	36.54M
ResNetkan	51.06M
KAT-Small	21.75M
ConvNeXtV2-Tiny	27.94M
EKAResNet	36.20M

Table 4. Ablation of proposed components on the CIFAR-100 dataset. Here, “✔” indicates the component is enabled, and “×” indicates it is disabled.

Idx	Number of Layers	Grid-Size	Spline-Order	Accuracy (%)
1	✔✔	✔✔✔	✔✔	80.06 ± 0.55
2	✔	✔✔✔	✔✔	79.32 ± 0.52
3	×	×	×	78.85 ± 0.60
4	✔✔	✔✔	✔✔	79.58 ± 0.50
5	✔✔	✔✔✔✔	✔✔	79.85 ± 0.58

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dang, Z.; Wu, T.; Zhang, W.; Chen, J.; Chen, H.; Liu, X.; Liu, Z. EKAResNet: Enhancing ResNet with Kolmogorov–Arnold Network-Based Nonlinear Feature Mapping. Computation 2025, 13, 248. https://doi.org/10.3390/computation13110248

AMA Style

Dang Z, Wu T, Zhang W, Chen J, Chen H, Liu X, Liu Z. EKAResNet: Enhancing ResNet with Kolmogorov–Arnold Network-Based Nonlinear Feature Mapping. Computation. 2025; 13(11):248. https://doi.org/10.3390/computation13110248

Chicago/Turabian Style

Dang, Zhiming, Tonghua Wu, Wulin Zhang, Jianxin Chen, Huanlin Chen, Xuan Liu, and Zirui Liu. 2025. "EKAResNet: Enhancing ResNet with Kolmogorov–Arnold Network-Based Nonlinear Feature Mapping" Computation 13, no. 11: 248. https://doi.org/10.3390/computation13110248

APA Style

Dang, Z., Wu, T., Zhang, W., Chen, J., Chen, H., Liu, X., & Liu, Z. (2025). EKAResNet: Enhancing ResNet with Kolmogorov–Arnold Network-Based Nonlinear Feature Mapping. Computation, 13(11), 248. https://doi.org/10.3390/computation13110248

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EKAResNet: Enhancing ResNet with Kolmogorov–Arnold Network-Based Nonlinear Feature Mapping

Abstract

1. Introduction

2. Method

2.1. Initial Feature Extraction

2.2. Feature Extraction

2.3. Feature Enhancement

2.4. KAN-FCM

3. Experimental Section

3.1. Experimental Environment

3.2. Dataset Introduction

3.3. Experimental Analysis

3.4. Ablation Experiment

4. Summarize

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI