Global–Local Feature Fusion of Swin Kansformer Novel Network for Complex Scene Classification in Remote Sensing Images

An, Shuangxian; Zhang, Leyi; Li, Xia; Zhang, Guozhuang; Li, Peizhe; Zhao, Ke; Ma, Hua; Lian, Zhiyang

doi:10.3390/rs17071137

Open AccessArticle

Global–Local Feature Fusion of Swin Kansformer Novel Network for Complex Scene Classification in Remote Sensing Images

by

Shuangxian An

¹,

Leyi Zhang

^2,3,

Xia Li

^1,3,*,†,

Guozhuang Zhang

^1,3,

Peizhe Li

¹,

Ke Zhao

⁴,

Hua Ma

¹ and

Zhiyang Lian

^4,†

¹

School of Land Engineering, Chang’an University, Xi’an 710054, China

²

Key Laboratory of Subsurface Hydrology and Ecological Effect in Arid Region of Ministry of Education, Key Laboratory of Ecohydrology and Water-Security in Arid and Semi-Arid Regions of Ministry of Water Resources, School of Water and Environment, Chang’an University, Xi’an 710054, China

³

Xi’an International Science and Technology Cooperation Base for Land Science and Engineering, Chang’an University, Xi’an 710054, China

⁴

China Siwei Surveying and Mapping Technology Co., Ltd., Beijing 100094, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(7), 1137; https://doi.org/10.3390/rs17071137

Submission received: 23 January 2025 / Revised: 17 March 2025 / Accepted: 21 March 2025 / Published: 22 March 2025

(This article belongs to the Special Issue Advances in High-Resolution Satellite Remote Sensing Image Processing and Classification)

Download

Browse Figures

Versions Notes

Abstract

The spatial distribution characteristics of remote sensing scene imagery exhibit significant complexity, necessitating the extraction of critical semantic features and effective discrimination of feature information to improve classification accuracy. While the combination of traditional convolutional neural networks (CNNs) and Transformers has proven effective in extracting features from both local and global perspectives, the multilayer perceptron (MLP) within Transformers struggles with nonlinear problems and insufficient feature representation, leading to suboptimal performance in fused models. To address these limitations, we propose a Swin Kansformer network for remote sensing scene classification, which integrates the Kolmogorov–Arnold Network (KAN) and employs a window-based self-attention mechanism for global information extraction. By replacing the traditional MLP layer with the KAN module, the network approximates functions through the decomposition of complex multivariate functions into univariate functions, enhancing the extraction of complex features. Additionally, an asymmetric convolution group module is introduced to replace conventional convolutions, further improving local feature extraction capabilities. Experimental validation on the AID and NWPU-RESISC45 datasets demonstrates that the proposed method achieves classification accuracies of 97.78% and 94.90%, respectively, outperforming state-of-the-art models such as ViT + LCA and ViT + PA by 0.89%, 1.06%, 0.27%, and 0.66%. These results highlight the performance advantages of the Swin Kansformer, while the incorporation of the KAN offers a novel and promising approach for remote sensing scene classification tasks with broad application potential.

Keywords:

remote sensing scene classification; multilayer perceptron; Swin Kansformer; Kolmogorov–Arnold network

1. Introduction

The integration of remote sensing technology with deep learning has evolved into a significant trend in the current remote sensing field. Remote sensing scene classification (RSSC), as one of the foundational pillars of remote sensing classification [1,2], detection, and segmentation, has become a central research focus [3]. The primary objective of remote sensing scene classification (RSSC) is to assign accurate semantic labels to individual remote sensing scenes and classify diverse scenes based on these labels. With the continuous advancement of remote sensing sensors and the rapid evolution of space satellite technologies, the spatial resolution of remotely sensed imagery has significantly improved. This progress has resulted in a substantial increase in the volume and complexity of remote sensing data, posing both challenges and opportunities for the development of more sophisticated classification methodologies. Alongside the surge in data volume, the complexity of the land cover and object information within remote sensing images has also increased, posing significant challenges regarding the categorization of remote sensing scenes. Unlike traditional natural scenes, remote sensing images contain complex land cover types at multiple scales and exhibit diverse distribution patterns across different regions of the image [4]. Relying exclusively on pixel-level information poses significant challenges in accurately distinguishing complex land cover types. The inevitable presence of mixed pixels, which cause spectral mixing, further complicates the precise identification of individual land cover categories. These limitations collectively hinder classification accuracy, underscoring the need for more advanced methodologies that incorporate spatial, contextual, and semantic features to overcome these inherent constraints in remote sensing scene classification. The concern associated with remote sensing mixed pixels is particularly prominent in high spatial resolution imagery. Additionally, there is a notable cross-class similarity and within-class divergence in the land cover types in remote sensing scenes, further complicating the classification task.

In response to these challenges, researchers have developed numerous methods to enhance the accuracy of remote sensing scene classification. Early approaches predominantly relied on traditional machine learning techniques for classifying remote sensing scenes. However, these methods often utilized handcrafted features, which were limited in their ability to comprehensively capture the intricate and diverse information inherent in remote sensing objects. Additionally, their performance in addressing the issue of mixed pixels was frequently suboptimal, highlighting the need for more advanced and adaptive methodologies to improve classification outcomes. Moreover, the traditional handcrafted feature extraction process requires extensive expert knowledge and significant manual effort [5,6]. With the rise of deep learning, convolutional neural networks (CNNs) [7,8] have provided an effective solution. Cheng et al. [9] first used CNNs to extract remote sensing features, followed by an SVM classifier for image classification. Zhou et al. [10] fine-tuned pre-trained CNNs on an end-to-end learnable remote sensing scene dataset. Numerous CNN variants, such as CaffeNet [11], AlexNet [12], VGGNet [13], GoogLeNet [14], and ResNet [15], have been commonly employed in remote sensing classification undertakings. While CNNs have achieved remarkable success in computer vision and image processing, they still have some inherent limitations. Convolutional neural networks (CNNs) employ local receptive fields and shared weights to extract feature information, enabling them to effectively capture local spatial patterns. However, due to the constrained size of convolutional kernels, CNNs exhibit limitations in capturing global contextual information, which is crucial for understanding complex scenes in remote sensing imagery. Furthermore, CNNs demonstrate sensitivity to geometric transformations such as rotations and scale variations, which are common in remote sensing data. These inherent constraints highlight the need for complementary mechanisms or hybrid architectures to address the challenges of global feature extraction and transformation invariance in remote sensing scene classification tasks. Additionally, they encounter challenges including high computational complexity and memory demands, poor interpretability, dependence on large amounts of annotated data, vulnerability to adversarial samples, and limited ability for the purpose of capturing long-range dependencies.

To address the limitations of CNNs, a new neural network architecture based entirely on the “attention mechanism”—the Transformer model—has emerged, initially excelling in relation to the sphere of natural language processing (NLP). The Transformer model represents a significant departure from traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs). Instead, it leverages self-attention mechanisms and positional encoding to effectively model complex sequential data. This architecture offers remarkable flexibility, enabling the Transformer to capture long-range dependencies and intricate patterns in data without the constraints of sequential processing or localized receptive fields. Such characteristics make it particularly well-suited for handling the diverse and complex structures often encountered in remote sensing imagery. Due to its powerful ability to capture global contextual information, the Transformer has been introduced into image processing and has shown performance on par with NLP tasks. Its data processing capacity and accuracy quickly surpassed those of traditional CNNs. Unlike CNNs that use shared weights, Vision Transformer (ViT) [16] employs fully connected layers and multihead self-attention mechanisms aiming at reducing computational complexity by minimizing the number of model parameters, thus improving computational efficiency. The self-attention mechanism establishes dependency relationships between different positions within the input data, effectively overcoming the limitations of local receptive fields inherent in convolutional neural networks (CNNs). By enabling the model to directly model interactions between distant regions, this mechanism facilitates the extraction of global contextual features, which are critical for accurately interpreting complex spatial patterns in remote sensing imagery. This capability significantly enhances the model’s ability to capture long-range dependencies and holistic scene characteristics, thereby improving classification performance [17]. However, despite surpassing CNNs in some performance aspects, Transformer models still face issues related to long-range dependency modeling, sequence length constraints, and contrast bias. To address these challenges, researchers have proposed various improvements. Li et al. [18] blended the strengths of CNNs and Transformers to achieve satisfactory results in both local and global feature extraction. Wang et al. [19] suggested an end-to-end relational attention network for learning multilevel features and shunning redundant information. Liang et al. [20] utilized a paired-stream classification framework, incorporating CNNs and Graph Convolutional Networks (GCNs) to learn both global features and target visual feature expressions in remote sensing images. Guo et al. [21] introduced the so-called Global–Local Attention Network (GLANet), which improves foreground feature extraction via allocating varying weights to various channels using global and local branch learning. Undoubtedly, integrating the strengths of convolutional neural networks (CNNs) and Transformers, along with the incorporation of attention mechanisms, can significantly enhance the overall classification accuracy of the model. CNNs excel at capturing local spatial features, while Transformers are adept at modeling global contextual relationships, and attention mechanisms further refine feature representation by focusing on relevant regions. However, the fusion of multiple models inevitably introduces challenges in module coupling, which may hinder the optimal performance of individual components. These coupling differences can lead to suboptimal interactions between modules, potentially limiting the overall effectiveness of the hybrid architecture. Therefore, careful design and optimization of the integration strategy are essential to fully leverage the complementary strengths of each component while minimizing performance degradation. Moreover, the multilayer perceptron (MLP) [22,23,24,25,26] in the Transformer model often suffers from large parameter sizes, high computational resource consumption, and a lack of local feature extraction capabilities in practical applications. Thus, no combination of Transformer with other networks can effectively resolve the shortcomings of the MLP. To address these challenges, novel training methodologies are often necessary. The primary objective of these techniques is to reduce the number of model parameters while maintaining or even enhancing model performance. This can be achieved through strategies such as pruning, quantization, or the adoption of more compact network architectures. However, during the processes of model quantization and pruning, a trade-off inevitably arises between model efficiency and accuracy. Despite advancements, achieving an optimal balance between training accuracy and computational efficiency remains a significant challenge, as it has not yet substantially alleviated the overall workload or computational demands associated with training complex models [27]. Besides adopting more efficient training methods, another potential solution is to introduce new network models, such as the Kolmogorov–Arnold Network (KAN), which offers a more effective learning framework and may provide a new pathway for addressing these challenges [28].

The principal contributions provided by this investigation are enumerated as follows:

(1): This study introduced a new network, the Kolmogorov–Arnold Network (KAN), and integrated it with the Swin Transformer’s window-based self-attention mechanism, resulting in a novel network architecture called Swin Kansformer.
(2): The Swin Kansformer network model proposed in this paper replaced the traditional MLP layer with the KAN layer, which performs linear transformations of complex functions using nonlinear activation functions to learn and represent complex features. This significantly improved the computational resource consumption, the modeling capability for spatial structures, overfitting risk, local feature capture, and interpretability issues inherent in MLPs.
(3): To enhance the network’s capability in extracting local information, an asymmetric convolution group structure was added to better capture local features while maintaining global feature extraction. This approach fully leveraged both global and local feature information, implementing a global–local feature extraction approach, thereby improving the operational efficiency, computational performance, and ultimately, the classification correctness for remote sensing scene classification assignments.

2. Method

2.1. KAN Theory

To achieve the universal approximation of complex functions by multilayer perceptrons (MLPs) [29] and their nonlinear transformation capabilities [29,30], this study introduces a novel KAN model for feature extraction. The concept of the KAN structure is derived from the theorem of Kolmogorov–Arnold regarding representation [31,32,33]; the theorem asserts that any continuous multivariate function on a bounded domain is capable of being represented as a finite sum of single-variable continuous functions applied through two-layer nested addition [33]. Specifically, any continuous multivariate function can be decomposed into a finite composition of continuous univariate functions combined with addition operations. This foundational principle, rooted in the Kolmogorov–Arnold representation theorem, enables the approximation of complex multivariate relationships through simpler, more manageable univariate functions. Such a decomposition not only simplifies the modeling of intricate dependencies but also enhances computational efficiency and interpretability, making it particularly advantageous for tasks such as remote sensing scene classification, where complex spatial and spectral relationships must be accurately captured. This groundbreaking concept transforms complex multivariate functions into simpler decomposable univariate functions [34], providing the theoretical foundation for the unique structure of the KAN.

In comparison with traditional multilayer perceptrons, where activation functions maintain a steady state at the nodes [35,36], KAN incorporates activation functions that are learnable and located on the edges, refraining from the utilization of linear weights [37,38]. Instead, it substitutes univariate functions accompanied by linear weights represented by splines. A spline function is defined as a piecewise polynomial function determined by a set of control points and nodes. Each input feature is transformed using a parameterized spline function, and the resulting values are aggregated to form intermediate representations. These intermediate values are subsequently processed through a construction function, and the final output is obtained by summing all the transformed values. This approach allows for flexible and precise modeling of complex relationships within the data, making it particularly effective for capturing intricate patterns in remote sensing imagery. By leveraging the smoothness and adaptability of spline functions, the model can achieve enhanced feature representation and improved classification performance. The activation mapping in the KAN consists of a base function in conjunction with spline components. The base function is typically a Sigmoid Linear Unit (SiLU), while the spline components are represented by B-spline basis functions B_i(x) combined with coefficients ci, all of which are capable of being learned in the training phase. The B-spline curve function can be seen in Figure 1.

: KAN Theoretical Expression

f (x) = \sum_{q = 1}^{2 n + 1} Φ_{q} (\sum_{p = 1}^{n} φ_{q, p} (x_{p})) .

(1)

In Equation (1), the function f(x) represents the complex multivariate function in the KAN, where φ_q,p(x_p) denotes the spline function and Φ_q represents the function transformation.

KAN Activation Function and Spline Function Expression

φ (x) = ω (b (x) + s p l i n e (x)) .

(2)

In Equation (2), φ(x) represents the activation function, ω is the weight, b(x) is the basis function, and spline(x) denotes the spline function.

KAN Basis Function Expression

b (x) = s i l u (x) = \frac{X}{1 + e^{- X}} .

(3)

In Equation (3), b(x) represents the basis function, which is implemented as silu(x) (Sigmoid Linear Unit).

Spline Function Expression

s p l i n e (x) = \sum_{i} c_{i} B_{i} (x)

(4)

In Equation (4), spline(x) represents the spline function, c_i denotes the coefficients, and B_i(x) is the B-spline basis function [37]. These coefficients also determine the ultimate form of the activation function, thereby eliminating the demand for the parameters of conventional linear transformation such as weights W and biases b in the multilayer perceptron (MLP). This enables the replacement of the MLP with the KAN. A comparison between the MLP and KAN structures is shown in Figure 2.

As shown in Figure 2, compared to the multilayer perceptron (MLP), the KAN uses instead of having given that neurons possess fixed activation functions, like in MLP, learnable functions can be found on the edges. The KAN achieves superior accuracy and enhanced interpretability while utilizing significantly fewer parameters compared to traditional architectures. This efficiency is particularly advantageous in remote sensing scene classification, where the ability to model complex spatial and spectral relationships with reduced computational overhead is critical. By leveraging its parameter-efficient design, the KAN not only improves classification performance but also provides clearer insights into the decision-making process, making it a highly effective solution for handling the intricate and diverse patterns inherent in remote sensing imagery [39]. Unlike black-box models [40], the KAN can be visualized and provides strong interpretability, making it possible to discover hidden laws in scientific applications such as mathematics and physics [41,42]. Additionally, the KAN helps prevent catastrophic forgetting [43], increasing the model’s resilience and flexibility.

2.2. Swin Kansformer Network

In the context of remote sensing scene classification tasks, remote sensing imagery exhibits distinct characteristics compared to traditional images. These differences arise from the complex and heterogeneous distribution patterns of land cover, the multiscale nature of land cover types, and the inherent similarities and intra-class variabilities among these types. Such unique attributes pose significant challenges for accurate classification, necessitating advanced methodologies capable of capturing both fine-grained details and broader contextual information to effectively distinguish between diverse and often overlapping land cover categories. These factors pose significant challenges with respect to remote sensing scene classification. Therefore, scene classification in remote sensing not only requires the network model to achieve high classification accuracy but also demands strong local-global feature extraction capabilities. This approach is essential to comprehensively capture multiscale feature information within remote sensing imagery, thereby enhancing the overall classification accuracy. By addressing the diverse spatial and spectral characteristics present at varying scales, the model can more effectively distinguish between complex land cover types, ultimately improving its ability to accurately classify intricate and heterogeneous scenes. This multiscale feature extraction capability is particularly critical in remote sensing applications, where the variability in object sizes and spatial arrangements necessitates a robust and adaptive methodology to achieve reliable classification outcomes.

To address the challenges in the categorization of scenes in remote sensing, this paper proposes the Swin Kansformer network model, which integrates the novel KAN. The model is built upon the Swin Transformer architecture, with the KAN layer replacing the traditional MLP module [36]. This replacement addresses the limitations of multilayer perceptrons (MLPs) by enabling the effective extraction of feature details that encompass both local and global contextual information. Additionally, to further enhance the acquisition of local feature information, an asymmetric convolution group structure is integrated into the Swin Kansformer network. This architectural innovation significantly strengthens the model’s ability to extract local features, thereby improving its capacity for local feature learning and contributing to more accurate and robust remote sensing scene classification. By combining these advancements, the model achieves a balanced representation of both fine-grained details and broader contextual patterns, essential for handling the complexity and diversity of remote sensing imagery.

Initially, a complete remote sensing image undergoes patch partitioning in the Patch Partition module, where every 4 × 4 adjacent pixels are grouped into one patch. The image is then flattened in the channel direction. Since the input remote sensing image is an RGB three-channel image, after Patch Partition, the image shape changes from [H, W, 3] to [H/4, W/4, 48]. Subsequently, the image after Patch Partition undergoes a linear transformation of each pixel’s channel data via the Linear Embedding layer, changing the shape from [H/4, W/4, 48] to [H/4, W/4, C], where C is the number of channels. Both Patch Partition and the linear transformation through the Linear Embedding layer are implemented using convolutional structures. To improve convolutional efficiency, the asymmetric convolution group is used to replace the traditional convolutional structure.

Following the convolution process, the image is processed through feature maps of varying sizes across four distinct phases. While Phase 1 initially utilizes a Linear Embedding layer for feature transformation, the subsequent three phases employ Patch Merging layers to perform downsampling. This hierarchical structure enables the model to capture multiscale spatial information, which is crucial for accurately classifying complex remote sensing scenes. By progressively reducing the spatial resolution while increasing the depth of feature representations, the network effectively balances the extraction of fine-grained details and broader contextual patterns, enhancing its overall classification performance. The image is then processed through the stacked Swin Kansformer Blocks, followed by Layer Normalization, global pooling, and the employment of a fully-connected layer to acquire the final categorization result. The feature extraction via the modified Swin Kansformer Block and the asymmetric convolution group maximizes the capacity of the network to glean local and global trait-related information, thereby enhancing the classification accuracy and the categorization of scenes in remote sensing. Figure 3 showed the architecture of the proposed Swin Kansformer network.

2.2.1. Asymmetric Convolutional Group Module

Remote sensing images often contain extensive interfering categories and complex spatial contextual information due to the inherent complexity of scenes and the diversity of ground objects. To address the limitations of multihead self-attention mechanisms in local feature extraction, this paper introduces an asymmetric convolution group module, which enhances the extraction of local regional features. Unlike traditional symmetric convolutions (e.g., 1 × 1, 3 × 3, 5 × 5), which incur high computational costs as kernel sizes increase, asymmetric convolution employs a two-step process—first applying an n × 1 convolution followed by a 1 × n convolution. This approach maintains the same receptive field as an n × n convolution but significantly reduces computational complexity while preserving spatial feature information. By leveraging this efficiency, the proposed module aims to achieve more comprehensive feature learning in remote sensing tasks.

The asymmetric convolution group module replaces the traditional 3 × 3 convolution group with three parallel branches: 3 × 3, 3 × 1, and 1 × 3 depthwise separable convolutions, each followed by ReLU activation and batch normalization to enhance nonlinearity and accelerate convergence. This design combines the strengths of symmetric and asymmetric convolutions, balancing spatial feature preservation and computational efficiency. Although the module introduces additional branches, parameter sharing and depthwise separable convolutions reduce the total parameter count to 15C + 3C², significantly lower than the 9C² of traditional 3 × 3 convolutions for large channel numbers C. Experiments demonstrate that this approach achieves an optimal trade-off between computational cost and performance, particularly for high-resolution remote sensing images. Furthermore, the module improves robustness to image transformations such as flipping and rotation, enhancing classification accuracy. Studies [44] confirm that replacing traditional 3 × 3 convolutions with asymmetric convolution groups boosts computational efficiency without compromising performance, underscoring the module’s theoretical and practical value in remote sensing scene classification. The structure of the asymmetric convolution module is shown in Figure 4.

2.2.2. KAN Architecture

The KAN employs a combination of multiple univariate continuous functions and an additive operation to express any multivariate continuous function. Specifically, a complex multivariate function is input into the KAN, where it is decomposed into an accumulation of several univariate functions. The network architecture incorporates hidden layer neurons positioned between the input and output layers, adhering to the Kolmogorov–Arnold Network framework. These intermediate neurons facilitate the transformation of input data into higher-level feature representations, enabling the model to capture complex patterns and relationships inherent in remote sensing imagery. By leveraging this structure, the network effectively bridges the gap between raw input data and the final classification output, enhancing its ability to model intricate spatial and spectral dependencies critical for accurate scene classification. The amount of hidden-layer neurons is associated with the amount of input variables.

During the information propagation process, the first neuron in the hidden layer is assigned the initial univariate function variable resulting from the decomposition. Subsequently, the second neuron receives the next univariate function variable, and this sequential assignment continues until all neurons in the first hidden layer are populated with their respective values. This structured approach ensures that each neuron captures a distinct aspect of the decomposed function, enabling the network to effectively model complex relationships and dependencies within the data. Such a mechanism is particularly advantageous for remote sensing scene classification, as it allows for the precise representation of intricate spatial and spectral patterns essential for accurate classification.

For the second layer neurons, each neuron constructs a univariate function from each neuron in the first layer and performs an additive operation to compute the value for the second layer neuron. The structure of the KAN is illustrated in Figure 5.

The univariate functions decomposed within the KAN can also be intuitively expressed using linear functions. The structure of the KAN’s linear functions is shown in Figure 6.

At this point, a complex multivariate continuous function is converted into a KAN with two network layers [45,46,47,48]. This transformation achieves the conversion from function to network, enabling the KAN to approximate complex functions effectively [49,50,51,52]. It serves as the theoretical foundation for replacing the multilayer perceptron (MLP) in terms of universal approximation of complex functions and nonlinear transformation capabilities.

The feature extraction capability of a simple two-layer Kolmogorov–Arnold Network (KAN) remains limited. To enhance the network’s ability to capture more comprehensive attribute-based information, we can draw inspiration from the architectural principles of traditional convolutional neural networks (CNNs) and recurrent neural networks (RNNs) by increasing the network depth. By deepening the network architecture, it becomes possible to extract more intricate and hierarchical features, thereby transforming the KAN into a more sophisticated and powerful model. This enhanced depth enables the network to better represent complex spatial and spectral relationships, which are critical for achieving high accuracy in remote sensing scene classification tasks. Research by Vaca-Rubio et al. has demonstrated that the KAN outperforms traditional MLPs in satellite traffic prediction tasks, achieving higher model accuracy with fewer parameters. Additionally, their ablation studies have shown the potential of KANs in adaptive prediction models [48]. Bozorgasl [39] developed a wavelet Kolmogorov–Arnold Network based on federated learning, successfully addressing the issue of heterogeneous data distribution across clients, and validated the effectiveness of the model in improving computational efficiency and robustness on datasets like MNIST and CIFAR10. Huang [53] provided an in-depth analysis of the convergence properties of KANs, further validating the excellent approximation and generalization capabilities of KAN. Xing [54] applied the KAN to time-series prediction and control problems, achieving remarkable results. Liang [55] further incorporated the KAN into image classification tasks, obtaining excellent classification performance. A series of recent studies have demonstrated that Kolmogorov–Arnold Networks (KANs) are highly effective in approximating high-dimensional complex functions. Moreover, KANs exhibit strong theoretical interpretability and network explainability, making them particularly valuable for applications requiring transparent and understandable models [56,57,58]. Their robust performance and exceptional results across diverse domains underscore their versatility and effectiveness, particularly in tasks such as remote sensing scene classification, where capturing intricate spatial and spectral relationships is essential. These attributes position KANs as a powerful tool for addressing complex challenges in both theoretical and applied contexts [59].

2.2.3. KAN Functional Module

The KAN functional module encapsulates the performance expression of the KAN. The core of the KAN functional module lies in spline interpolation and a kernel-based weight adjustment mechanism, which enables the KAN to better adapt to nonlinear variations within local regions. Compared to traditional multilayer perceptrons (MLPs), the Kolmogorov–Arnold Network (KAN) achieves superior precision in capturing subtle variations within complex feature information through its use of spline interpolation [60,61,62,63]. Additionally, the KAN enhances its learning capability by leveraging its unique linear layer structure, which facilitates more effective feature representation and transformation. This combination of spline-based interpolation and advanced linear processing enables the KAN to model intricate spatial and spectral relationships more accurately, making it particularly well-suited for tasks such as remote sensing scene classification, where fine-grained feature discrimination is critical for achieving high accuracy [61].

The KAN functional module is implemented through the custom KAN Linear layer. Each KAN Linear layer includes not only standard linear transformations but also nonlinear weight generation based on B-spline interpolation. The data are initially processed through an activation function and subsequently passed through the Base Linear Layer to generate initial weights. The data flow then diverges into two distinct pathways: one pathway employs dropout regularization to selectively deactivate a portion of the neurons, thereby mitigating overfitting and reducing dependency on specific neurons [21,64]. This approach enhances the network’s generalization capability by encouraging the development of more robust and distributed feature representations. Such a design is particularly beneficial in remote sensing scene classification, where the diversity and complexity of spatial patterns necessitate models that can generalize effectively across varied and unseen data. The other path generates nonlinear weights through B-spline interpolation, merging the B-spline function linear layer results with the previous path’s output for a weighted fusion. This forms the KAN Linear layer, and by iteratively stacking KAN Linear layers, the final output is obtained. The structure of the KAN functional module is shown in Figure 7.

KAN Linear

The KAN Linear module is the core component of the entire KAN functional model. It is primarily composed of two parts: the base linear transformation (Base Weight) and spline interpolation (Spline Weight).

Base linear transformation (Base Weight) is a standard linear transformation that is processed using matrix multiplication and activation functions. This module is similar to the linear transformation layer in traditional neural networks and is mainly used to perform conventional matrix multiplication operations. Afterward, a nonlinear activation function is employed. To process the results, producing the activated output. The structure is akin to that of a traditional MLP.

The KAN’s defining feature lies in its utilization of spline interpolation (Spline Weight), which enables nonlinear adjustments to the input data through B-spline interpolation. This process maps the input data to multiple B-spline functions via a series of predefined control points, with these functions linearly combined to interpolate and refine the input. By generating smooth nonlinear curves within localized regions of the input features, spline interpolation enhances the model’s ability to capture fine-grained variations with greater precision [65,66,67]. The curve2coeff function plays a pivotal role in this process, converting the input data and control points into spline-interpolated weight coefficients. These coefficients are then applied to the input features, producing outputs that incorporate the necessary nonlinear adjustments, thereby significantly improving the network’s capacity to model complex data patterns.

Weight Fusion and Multilayer Stacking

Following a base linear transformation and spline interpolation of the input data, the outputs from both pathways are fused at each layer through element-wise addition, yielding the final output for that layer. This additive fusion mechanism enables the base linear transformation and spline interpolation to synergistically address large-scale and fine-grained local details, thereby enhancing the extraction of global features and improving the model’s overall performance. By stacking multiple KAN Linear modules, a KAN is constructed, where the output of each layer serves as the input for the subsequent layer, facilitating hierarchical feature extraction. As the depth of the network increases, its nonlinear expressive capacity is significantly enhanced, enabling the capture of increasingly complex features. This behavior mirrors the pattern observed in traditional neural networks, where deepening the network architecture substantially improves the model’s ability to extract target attributes, underscoring the effectiveness of the proposed approach in advancing feature representation capabilities [68].

KAN Optimization

The original KAN includes a regularization mechanism to control the complexity of the spline interpolation weights and prevent overfitting. The model also integrates an adaptive network update mechanism, which dynamically adjusts the positions of the B-spline control points to better align with the underlying distribution of the input data. This adaptive capability enhances the model’s flexibility and precision in capturing complex spatial and spectral patterns, making it particularly effective for remote sensing scene classification tasks. By continuously optimizing the control points, the mechanism ensures that the network remains responsive to the inherent variability and heterogeneity of remote sensing imagery, thereby improving both feature representation and classification accuracy. To further improve the classification efficiency of the model in complex remote sensing data scenarios, the original KAN layer has been optimized as follows:

First, memory efficiency is improved. In the original KAN implementation, all intermediate variables need to be expanded to perform different activation functions. In this work, the computation process is redesigned, employing different basis functions to activate the input, followed by linear combinations. This method significantly reduces memory costs and enhances the calculating efficiency of the network model.

Second, the regularization method is modified. The original KAN implementation used L1 regularization, which requires nonlinear operations on tensors. This approach is incompatible with the method adopted in this work, where different basis functions are used to activate the input before linear combinations. Therefore, the L1 regularization is applied directly to the KAN weights, which is more consistent with common regularization methods in neural networks and is compatible with the newly designed computational process [69].

Finally, the activation function is fine-tuned. In the KAN architecture, the activation function is designed as a trainable B-spline curve. By adjusting the granularity of the spline grid, the KAN can better approximate the target function and lower the amount of parameters so as to increase computational efficiency. The spline curves are usually initialized using the Xavier initialization method, which ensures that the output variance at each layer matches the input variance. This prevents the common issues of gradient vanishing or explosion in deep networks, and the well-initialized parameters help accelerate network convergence. Xavier initialization plays a crucial role in maintaining stable gradient flow during training, preventing signals from being either excessively diminished or amplified. To better align with the characteristics of remote sensing scene classification datasets, this study modifies the original initialization parameters and adopts the Kaiming initialization method. This approach demonstrates superior compatibility with the ReLU activation function, particularly when applied after asymmetric convolution groups. By leveraging Kaiming initialization, the model achieves more effective weight initialization, which enhances training stability and convergence, ultimately improving the network’s ability to capture complex spatial and spectral features inherent in remote sensing imagery. The Kaiming initialization adjusts the weight distribution to match the variance regarding the KAN in the space spanning its input and output, preventing gradient issues and speeding up the overall network training process.

3. The Findings of Experiments and Their Analysis

3.1. Description of the Dataset

AID, a new large-scale aerial image dataset, was constructed by Wuhan University [56]. The AID dataset, developed from aerial imagery samples sourced from Google Earth imagery, presents unique characteristics. Comprising 30 scene types like airports, bare lands, baseball fields, beaches, bridges, city centers, and churches, among others, all images within this dataset have been expert-labeled in the domain of remote sensing image interpretation. In total, it holds 10,000 images. Distinct from single-source datasets such as the UC Merced dataset, AID’s images are multi-sourced, having been gathered from a variety of remote sensing imaging sensors utilized in Google Earth. This multi-source nature poses extra difficulties for scene classification. Moreover, the images in AID are meticulously chosen from different countries and regions globally, including China, the US, the UK, France, Italy, Japan, and Germany. Captured under diverse imaging conditions, at various times, and across different segments of the annual cycle, this enriches the diversity of intra-class data. Figure 8 depicts the AID dataset.

The NWPU-RESISC45 dataset, developed by Northwestern Polytechnical University, serves as a benchmark for remote sensing image scene classification. It was made publicly available in 2017 [57]. This extensive dataset encompasses 31,500 images, each having a pixel dimension of 256 × 256. It covers 45 scene types, like airports, rivers, train stations, lakes, etc. Spanning more than 100 countries and regions across the globe, the dataset displays substantial differences in elements such as spatial resolution, solar illumination angles, and viewing angles. It demonstrates both inter-class resemblance and intra-class diversity. With 700 images per category, the dataset showcases a wide range of characteristics. The NWPU-RESISC45 dataset is presented in Figure 9.

3.2. Evaluation Metrics

In this research, two assessment metrics—overall accuracy (OA) and the confusion matrix (CM)—are employed to evaluate the performance of remote—sensing scene classification. Overall accuracy (OA), a frequently utilized evaluation metric in domains like remote sensing, machine learning, and image classification, is applied to gauge the performance of classification models. OA is a solitary value representing the classifier’s overall performance across all classes. Overall accuracy (OA) is calculated as the ratio of the number of correctly classified samples across all classes to the total number of samples in the dataset. This metric reflects the classifier’s generalization capability and its overall accuracy across the entire dataset. A higher OA value indicates better classification performance, making it a straightforward and intuitive measure for evaluating the effectiveness of remote sensing scene classification models. Its simplicity and interpretability have established OA as a widely adopted benchmark in the field. The formula for overall accuracy is presented as follows:

P_{o v e r a l l} = \frac{Z}{N} \times 100 %

(5)

In Equation (5), N embodies the total quantity of samples, while Z denotes the number of correctly classified test cases.

The confusion matrix is a critical and widely utilized visualization tool for evaluating the performance of classification models. It is extensively applied in domains such as machine learning and pattern recognition, particularly for addressing classification challenges. The confusion matrix provides a quantitative assessment of the degree of confusion between different classes, offering insights into the model’s ability to distinguish between them. Beyond measuring overall accuracy, it enables researchers to analyze class-specific performance, identify misclassification patterns, and uncover potential biases. This detailed information is invaluable for subsequent model evaluation, optimization, and refinement, making the confusion matrix an indispensable tool in the development of robust and accurate classification systems, especially in remote sensing scene classification where class imbalance and inter-class similarities are common. This allows researchers to comprehensively assess the classifier’s performance, especially in the context of class imbalance and misclassification analysis. By comparing the classifier’s predicted results with the actual outcomes, the confusion matrix displays the classification performance for each class, helping identify the model’s performance across different categories and thus enabling a comprehensive evaluation of the model. The rows of the matrix represent the true classes, while the columns represent the predicted classes. Each element X_ij in the matrix represents the proportion of images from the i-th true class that were predicted as the j-th class, relative to the total number of images in that class.

3.3. Experimental Parameter Settings

To enhance the heterogeneity and complication level of remote sensing image sample sets, the datasets used in this study underwent data augmentation. First, the original remote sensing datasets were subjected to random cropping, resizing the images to 224 × 224 pixels. Following the cropping, the data were further processed with random horizontal rotations and normalization.

In the domain of remote sensing scene classification, to ensure parameter initialization consistency between the Swin Transformer and the KAN model, this study employs the Kaiming initialization method to initialize the basis function weights and activation functions of the KAN model, as well as the multihead attention weights of the Swin Transformer, while uniformly setting the bias terms to zero. The Kaiming initialization method effectively maintains the consistency of input and output variances across network layers by adjusting the weight distribution. This approach not only mitigates the issues of gradient vanishing and explosion but also significantly enhances the training stability and convergence efficiency of the model. For the B-spline curve parameters of the KAN model and the positional encoding of the Swin Transformer, a uniform initialization strategy is adopted, ensuring that the spline grid nodes are evenly distributed within the input range, thereby preserving the spatial uniformity of the spline functions.

Regarding the experimental setup, this study constructs and validates the proposed Swin Kansformer network model using the PyTorch2.0.1 deep learning framework. The model optimization is performed using the AdamW optimizer, with a batch size of 16, 100 training epochs, and a weight decay coefficient of 0.05. The loss function selected is the entropy-based cross-entropy loss, which accurately measures model performance. The experiments are conducted on a Windows 11 operating system, supported by an NVIDIA GeForce RTX 4060 GPU and an Intel i9-13900 CPU, providing robust computational resources for model training. This systematic parameter initialization scheme and experimental configuration establish a reproducible foundation for evaluating model performance in remote sensing scene classification tasks.

3.4. Hyperparameter Analysis

To systematically evaluate the influence mechanisms of hyperparameters on the Swin Kansformer network model in the context of remote sensing scene classification, this study focuses on investigating the impact of key hyperparameters, including the kernel function bandwidth of the KAN module, the number of attention heads, and the learning rate, on the model’s classification accuracy. The optimization strategies for these hyperparameters are also explored to enhance model performance.

The learning rate, as a critical hyperparameter in the training process of deep learning models, directly affects the convergence dynamics and the ultimate classification performance. As illustrated in Figure 10, the quantitative relationship between the learning rate and the model’s classification accuracy is presented. This analysis provides insights into the optimal learning rate configuration, which is essential for achieving stable convergence and high classification accuracy in remote sensing scene classification tasks. The findings underscore the importance of carefully tuning the learning rate to balance convergence speed and model performance, thereby ensuring robust and reliable classification outcomes in complex remote sensing scenarios.

As illustrated in Figure 10, the analysis of two representative remote sensing datasets reveals a significant correlation between the learning rate and the model’s classification accuracy, with consistent patterns observed across both datasets. Specifically, when the learning rate is within a lower range, the model exhibits slower convergence but achieves relatively higher classification accuracy. As the learning rate gradually increases, the convergence speed improves significantly. At a learning rate of 0.0004, the model maintains a rapid convergence rate while achieving stable classification performance, with overall accuracy reaching its optimal level. However, when the learning rate exceeds this critical threshold, the classification accuracy begins to decline noticeably, despite further acceleration in convergence speed. Particularly, with a continued increase in the learning rate, the training process becomes increasingly unstable, ultimately leading to convergence difficulties and a substantial drop in classification accuracy. Based on this experimental analysis, the learning rate is optimized and set to 0.0004 in this study. This value not only ensures efficient convergence of the model in remote sensing scene classification tasks but also maximizes classification accuracy, providing a reliable foundation for stable model performance in complex remote sensing scenarios.

In the task of remote sensing scene classification, the kernel function bandwidth of the KAN module serves as a critical hyperparameter, significantly influencing the model’s feature extraction capability and classification performance. The kernel function bandwidth directly determines the smoothness of the kernel function and its spatial locality characteristics: a smaller bandwidth value concentrates the kernel function on local regions, enhancing the model’s sensitivity to fine-grained features, while a larger bandwidth value increases the smoothness of the kernel function, expanding the coverage of feature extraction and facilitating the capture of global contextual information. As shown in Figure 11, the relationship between the kernel function bandwidth and the model’s classification accuracy is demonstrated, providing insights into the optimal configuration of this hyperparameter for achieving superior performance in remote sensing scene classification tasks.

As illustrated in Figure 10, experimental analysis conducted on two remote sensing datasets reveals a significant nonlinear relationship between the kernel function bandwidth of the KAN module and the model’s classification accuracy, with both datasets exhibiting a similar trend of initial increase followed by a decline. When the kernel function bandwidth is set to smaller values (e.g., 0.1, 0.2), the excessive localization of the kernel function hinders the model’s ability to effectively capture global contextual information in remote sensing scenes, thereby limiting the improvement in classification accuracy. As the kernel function bandwidth gradually increases, the enhanced smoothness of the kernel function allows the model to better balance local detail features and global semantic information, leading to a notable improvement in classification accuracy. When the kernel function bandwidth reaches 0.5, the model achieves an optimal balance between local feature resolution and global feature representation, resulting in peak classification accuracy. However, when the bandwidth exceeds this optimal value, excessive smoothness causes the loss of local detail features, and the classification accuracy begins to decline. With further increases in bandwidth, the model’s performance continues to degrade, and classification accuracy drops significantly. Based on these experimental findings, this study sets the kernel function bandwidth to an optimized value of 0.5. This configuration not only ensures the model’s sensitivity to local detail features in remote sensing scene classification tasks but also retains sufficient capability to extract global contextual information, thereby achieving optimal classification performance.

The number of attention heads in the KAN module, as a critical hyperparameter, plays a significant role in determining the model’s feature extraction capability and computational efficiency. The number of attention heads not only governs the model’s ability to capture multiscale remote sensing features but also directly influences the computational resource consumption. As depicted in Figure 12, the relationship between the number of KAN attention heads and the model’s classification accuracy is illustrated, providing valuable insights into the optimal configuration of this hyperparameter for achieving a balance between feature representation and computational efficiency in remote sensing scene classification tasks.

As illustrated in Figure 12, experimental analysis conducted on two representative remote sensing datasets reveals a significant non-monotonic relationship between the number of KAN attention heads and the model’s classification accuracy, with both datasets exhibiting a similar trend of initial gradual increase followed by a decline. When the number of attention heads is limited (e.g., 1–2), the model’s multiscale feature extraction capability is constrained, making it difficult to fully capture the complex spatial-spectral characteristics of remote sensing scenes, resulting in suboptimal classification accuracy. As the number of attention heads increases, the model’s feature representation capability progressively improves, enabling more effective learning of hierarchical features in remote sensing scenes, which in turn enhances classification accuracy. When the number of attention heads reaches 6, the model achieves an optimal balance between feature extraction capability and computational efficiency, attaining peak classification accuracy. However, when the number of attention heads exceeds this optimal value, although the feature extraction capability continues to improve, the significant increase in computational complexity leads to a tendency for the model to overfit the training data, resulting in reduced generalization performance and a subsequent decline in classification accuracy. Based on these experimental findings, this study sets the number of KAN attention heads to an optimized value of 6. This configuration not only ensures the model’s effective capture of multiscale features in remote sensing scenes but also mitigates the risk of overfitting, thereby achieving optimal classification performance.

3.5. Algorithm Performance Comparison

In order to comprehensively assess the efficacy of the proposed Swin Kansformer network model for remote sensing image scene classification on the AID and NWPU-RESISC45 datasets, this part contrasts the operational performance of the Swin Kansformer model with several advanced algorithms for remote sensing scene discrimination. The specific algorithms employed for comparative analysis are as follows: Generative Adversarial Network (GAN), CFDNN [58], SSCapsNet [59], MRFN [60], MS-GCN [61], ERST [62], EMST [63], EHT [61], a combination of convolutional neural network (CNN) and Graph Convolutional Network (GCN) [21], PVT-V2_B0, Vision Transformer (ViT) + Local Context Attention (LCA), and Vision Transformer (ViT) + Positional Attention (PA) [64]. All models were evaluated under identical experimental settings to ensure a fair and consistent comparison.

As shown in Table 1 and Table 2, this is a performance comparison of each algorithm on the AID and NWPU-RESISC45 remote sensing datasets. During the experiments, for the AID dataset, 20% and 50% of its remote sensing scene images were randomly picked as the training set, while the remaining 80% and 50% were utilized as the test set. Likewise, for the NWPU-RESISC45 dataset, 10% and 20% of its images were employed for training, and the remaining 90% and 80% were used for testing. All algorithms were examined under identical computational conditions, and each test was repeated five times. The classification accuracy results of each model on the two datasets are summarized in Table 1 and Table 2.

From Table 1 and Table 2, it can be observed that when traditional convolutional neural network algorithms such as Generative Adversarial Network (GAN), CFDNN, SSCapsNet, and MRFN are employed, the network models often struggle to effectively extract global feature information due to the inherent limitations of the local characteristics of convolutional kernels in convolutional networks. Furthermore, convolutional networks tend to exhibit sluggish performance when handling image rotation, scaling, and affine transformations, demonstrating limited robustness and slightly lower classification accuracy compared to Vision Transformer (ViT)-based models.

For Transformer-based backbone network models, such as Vision ERST, Efficient Multiscale Transformer (EMST), Efficient Hierarchical Transformer (EHT), PVT-V2_B0, ViT + Local Context Attention (LCA), and ViT + Positional Attention (PA), these models benefit from the Transformer’s ability to capture long-range dependencies and perform global modeling. This enables these models to circumvent the constraints imposed by fixed convolutional kernel windows, offering greater flexibility and scalability. These advantages make Transformer-based models superior to traditional convolutional neural networks in remote sensing scene classification tasks. In the AID remote sensing dataset, the ViT + LCA model, by focusing on the importance of different feature channels, achieves multifeature channel information extraction and delivers the best performance, with overall accuracy (OA) reaching 95.43% and 96.89%, respectively. In the NWPU-RESISC45 dataset, both the ViT + LCA model and the ViT + PA model, which considers only local structural information, achieve outstanding results, with OA values of 92.37%, 94.63%, 92.65%, and 94.24%, respectively.

In comparison, the proposed Swin Kansformer model outperforms the aforementioned models on both remote sensing datasets. For the AID dataset, when 0.2 of the data volume is utilized to conduct training, the Swin Kansformer. The model obtains an OA amounting to 96.66%, a 1.23% improvement over the ViT + LCA model’s 95.43%. When fifty percent of the dataset is applied during the training process, Swin Kansformer accomplishes an OA that equals 97.78%, being a 0.89% improvement over ViT + LCA’s 96.89%. Regarding the NWPU-RESISC45 dataset, the Swin Kansformer achieves an overall accuracy (OA) of 93.50% when 10% of the dataset is used for training, representing a 0.85% improvement over ViT + PA’s accuracy of 92.65%. When the training data proportion is increased to 20%, the Swin Kansformer attains an OA of 94.90%, outperforming ViT + LCA’s 94.63% by 0.27%. Although the Swin Kansformer model demonstrates superior performance on the AID dataset, its results on the NWPU-RESISC45 dataset are slightly less pronounced. This difference may be attributed to the distinct characteristics and complexity of the NWPU-RESISC45 dataset, which includes a wider variety of scene categories and greater intra-class variability, posing additional challenges for classification accuracy. This is primarily owing to the larger number of scenario categories (45) and the more complex nature of the dataset compared to the AID dataset. Nonetheless, the Swin Kansformer model still demonstrates outstanding classification results regarding the NWPU—RESISC45 dataset.

3.6. Confusion Matrix Analysis

The confusion matrix is a frequently utilized visualization method for scene classification tasks, providing a clear and intuitive way to measure the degree of confusion in a classification model. In this study, we evaluate the classification performance and misclassification patterns of the Swin Kansformer model on two widely used remote sensing datasets: the AID and NWPU-RESISC45 datasets. The analysis provides a comprehensive assessment of the model’s ability to accurately classify diverse and complex scenes, while also identifying specific scenarios where misclassifications occur. This evaluation not only highlights the strengths of the Swin Kansformer architecture but also offers insights into potential areas for further improvement, particularly in handling challenging cases such as inter-class similarities and intra-class variability, which are common in remote sensing scene classification tasks. In the confusion matrix, the values on the diagonal represent correctly classified predictions, while the off-diagonal values represent misclassifications. In the confusion matrix, the true class label is associated with each row, while the predicted class label is indicated by each column. The confusion matrix of the Swin Kansformer model in the context of the AID dataset (with 20% of training samples) can be seen in Figure 10.

As depicted in Figure 13, the Swin Kansformer model achieved 100% classification accuracy in nine remote sensing scene categories, including base stations, bridges, forests, grasslands, ports, and sparse residential areas. The model exhibited exceptional performance in classifying beach, desert, mountain, and railway station scenes, achieving classification accuracies exceeding 98%. However, for more complex scenes such as schools and parks, the classification accuracy was comparatively lower due to the intricate internal features and strong interrelationships among objects within these categories. Other scene types, including rivers, overpasses, and resorts, also demonstrated satisfactory performance, with accuracies surpassing 90%. These results underscore the effectiveness of the proposed Swin Kansformer model, which replaces the traditional MLP with a KAN module for feature extraction and integrates asymmetric convolution groups to enhance the extraction of remote sensing feature information. The model excels in handling highly complex and interrelated objects, significantly improving the accuracy of remote sensing scene classification and demonstrating robust performance across diverse and challenging scenarios. Figure 11 presents the matrix used to assess the confusion of the Swin Kansformer model on the NWPU-RESISC45 remote sensing dataset, where only 10% of the samples are used for training.

The confusion matrix of the Swin Kansformer model on the AID dataset, utilizing 50% of the training samples, is depicted in Figure 14. This matrix provides a comprehensive visualization of the model’s classification performance, highlighting the accuracy and potential misclassifications across various remote sensing scene categories. The detailed analysis of the confusion matrix offers valuable insights into the model’s strengths and areas for improvement, contributing to a deeper understanding of its effectiveness in remote sensing scene classification tasks.

As illustrated in Figure 14, the Swin Kansformer model demonstrates exceptional performance in remote sensing scene classification tasks. Specifically, the model achieves a 100% classification accuracy for scene categories such as baseball fields, beaches, forests, grasslands, rivers, and overpasses, highlighting its precise recognition capability for typical land-cover features. Simultaneously, the model exhibits outstanding performance in complex scenes with intricate spatial structures and textural characteristics, such as airports, bare land, and bridges, achieving classification accuracies exceeding 98%, thereby validating its robustness in challenging scenarios. Furthermore, for scene categories with similar spectral features or spatial distribution patterns, such as mountains and ponds, the model also delivers satisfactory classification results. This further substantiates its broad applicability and reliability in multicategory remote sensing scene classification tasks.

As shown in Figure 15, the Swin Kansformer model also demonstrates exceptional performance with respect to the NWPU-RESISC45 remote sensing dataset. The classification accuracy reached 100% for the shrub, cloud, and parking lot scenes. Additionally, classification accuracy in the beach, port, runway, and sea-level scenes exceeded 97%, while other scene categories achieved classification accuracies above 80%. This demonstrates that, despite the presence of a large number of scene categories, high inter-class similarity, and complex internal scene structures, the Swin Kansformer model consistently delivers outstanding and stable classification performance. Its ability to effectively handle these challenges significantly enhances the model’s accuracy on complex remote sensing scene datasets, highlighting its robustness and adaptability in addressing the intricacies inherent in diverse and heterogeneous scenes. This performance underscores the model’s potential as a reliable solution for advanced remote sensing scene classification tasks.

Figure 16 presents the confusion matrix of the Swin Kansformer model on the NWPU-RESISC45 dataset, utilizing 20% of the training samples.

As illustrated in Figure 16, the Swin Kansformer model achieves a perfect classification accuracy of 100% for scenes such as forests, lakes, sea ice, and mobile home parks. Furthermore, the model demonstrates exceptional performance in classifying complex scenes, including airplanes, airports, chaparral, and circular farmland, with accuracy rates exceeding 96%. For the remaining scene categories, the model maintains a robust classification accuracy above 85%. These results highlight the model’s superior capability in handling diverse and challenging remote sensing scenes, showcasing its effectiveness in achieving high-precision classification across a wide range of land-cover types.

3.7. Comparison of Heat Maps

To thoroughly investigate the interpretability of the KAN, this study employs heatmap visualization techniques to quantitatively analyze the model’s attention distribution over regions of interest (ROIs). The heatmap intuitively reflects the contribution of different spatial regions in the input data to the model’s decision-making process through color intensity gradients, providing critical insights into the model’s classification mechanisms. Specifically, the heatmap clearly illustrates the model’s attention distribution over key regions during feature extraction, thereby revealing the decision logic and feature preferences of the model. As shown in Figure 17, a comparative analysis of the heatmaps from the combined models allows for the direct observation of the model’s attention focus across various remote sensing scenes. This not only enhances the interpretability of the model but also offers theoretical support for optimizing its feature extraction capabilities.

As illustrated in Figure 17, the Swin Kansformer model demonstrates exceptional region of interest (ROI) localization capabilities in typical remote sensing scenes such as baseball fields, bridges, and ponds. Compared to the traditional Swin Transformer model and its convolutional-enhanced variant, the Swin Kansformer model significantly improves sensitivity to critical geographic features, primarily due to the integration of the KAN. The KAN enhances the model’s ability to capture salient features in remote sensing images through its unique kernel function mechanism and attention mechanism, enabling more precise localization of regions of interest. The heatmap visualization clearly illustrates the model’s attention distribution across different spatial regions and their relative importance during the decision-making process. This not only validates the effectiveness of the KAN in improving feature extraction capabilities but also enhances the model’s interpretability at a mechanistic level. Through heatmap analysis, the model’s focus on key geographic features in remote sensing scenes can be intuitively observed, providing critical insights for understanding the model’s decision logic and optimizing its performance. Furthermore, this highlights the unique interpretability advantages of the KAN in remote sensing scene classification tasks.

4. Discussion

4.1. Ablation Experiment

To verify the efficacy of the proposed Swin Kansformer model in remote sensing image scene classification, ablation experiments were carried out. The performance of the KAN module was appraised from the aspect of module combinations. Regarding the AID remote sensing dataset, 20% and 50% of it were employed for training. As for the NWPU-RESISC45 remote sensing dataset, training sets with proportions of 10% and 20% were chosen. Figure 18 depicts the ablation experiment results of the Swin Kansformer model on the AID remote sensing dataset, while Figure 19 shows the results on the NWPU-RESISC45 remote sensing dataset.

As shown in Figure 19, when applied to the NWPU-RESISC45 remote sensing dataset, which is approximately three times the size of the AID dataset, the Swin Kansformer network model still demonstrates excellent classification performance. The NWPU-RESISC45 dataset not only represents a substantial increase in scale but also encompasses 45 distinct scene categories, significantly increasing the complexity of the scene classification task. This expanded diversity poses greater challenges for feature extraction, as the model must distinguish between a wider range of spatial and spectral patterns while addressing higher inter-class similarities and intra-class variability. Such characteristics make the NWPU-RESISC45 dataset a rigorous benchmark for evaluating the robustness and generalization capabilities of remote sensing scene classification models. Even under these more challenging conditions, the proposed Swin Kansformer network model shows impressive classification ability, achieving accuracies of 93.50% and 94.90% with 10% and 20% of the training set, respectively. In contrast, the traditional Swin Transformer network model achieved classification accuracies of 90.06% and 91.33%. When asymmetric convolution groups and the KAN module were added to the model, the accuracy improved to 91.67%, 92.75%, 92.77%, and 94.28%, respectively. These results show an improvement of 1.61%, 1.42%, and 2.71% and exceed the traditional Swin Transformer model by 2.95%, indicating that the Swin Kansformer model significantly outperforms the traditional Swin Transformer in terms of classification accuracy. The KAN module also effectively replaces the MLP to enhance feature extraction in complex remote sensing scenes.

The comparative experiments demonstrate that the Swin Kansformer network model, which integrates both asymmetric convolution groups and the KAN module, consistently achieves the highest classification accuracy across all experimental settings. Ablation studies involving different module combinations on two remote sensing datasets further confirm that the KAN not only effectively replaces the multilayer perceptron (MLP) but also surpasses the MLP in classification performance. These results highlight the efficacy of the KAN in classifying complex remote sensing scenes and underscore its potential as a superior alternative to the MLP for advanced feature extraction in remote sensing tasks. The findings validate the KAN’s ability to capture intricate spatial and spectral patterns, making it a robust and versatile tool for addressing the challenges inherent in remote sensing scene classification.

4.2. Comparative Analysis of KAN Optimization

To comprehensively evaluate the performance advantages of the optimized KAN module, this study conducts a systematic comparative analysis with state-of-the-art models, including LSST, ERST, MS-GCN, SANet, HCT, DFFN, EMST, SSGN, MRFN, SSCapsNet, EHT, and ASSN. The analysis focuses on three key dimensions: memory efficiency, regularization methods, and activation functions.

In terms of memory efficiency, the traditional KAN suffers from high memory consumption due to the need to expand all intermediate variables for executing different activation functions. Similarly, models such as LSST, ERST, and MS-GCN rely on complex graph convolution operations or long short-term memory (LSTM) networks, requiring substantial storage for intermediate states and gradient information, leading to significant memory overhead. Models like SANet, HCT, and DFFN, which employ self-attention mechanisms or deep feature fusion strategies, also exhibit low memory efficiency due to the storage of extensive attention weights or feature maps during computation. In contrast, the optimized KAN module significantly reduces the storage requirements for intermediate variables through an innovative design that activates inputs using basis functions and combines them linearly, outperforming the aforementioned models in memory efficiency.

Regarding regularization methods, the original KAN employs L1 regularization, which introduces compatibility issues with the basis function activation due to its nonlinear operations. Models such as EMST, SSGN, and MRFN primarily rely on single regularization strategies like L2 regularization or Dropout, which are insufficient for handling the complexity and variability of remote sensing scene classification tasks. Although models like SSCapsNet, EHT, and ASSN adopt capsule networks or hierarchical Transformer structures, their regularization methods are computationally expensive and overly complex. The optimized KAN module innovatively applies L1 regularization directly to the network weights, aligning with conventional neural network regularization methods. This design not only effectively prevents overfitting but also enhances computational efficiency while ensuring compatibility with the new computational process.

In the context of activation functions, the original KAN uses trainable B-spline curves as activation functions, initialized with the Xavier method, which is suboptimal for remote sensing scene classification tasks. Models such as LSST, ERST, and MS-GCN typically use fixed activation functions like ReLU, which lack the flexibility to capture the complex features of remote sensing data. While models like SANet, HCT, and DFFN employ self-attention mechanisms or deep feature fusion strategies, their activation function initialization methods are not optimized for remote sensing data. The optimized KAN module addresses these limitations by adjusting the spline grid granularity and replacing Xavier initialization with the Kaiming initialization method. This improvement not only enhances compatibility with asymmetric convolutional groups and ReLU activation functions but also enables more accurate approximation of the target function, reducing the number of parameters while effectively mitigating gradient vanishing or explosion issues.

Through these systematic optimizations, the KAN module significantly reduces computational resource requirements while maintaining model performance, providing a more efficient solution for large-scale remote sensing scene classification tasks.

5. Conclusions

Remote sensing scene classification presents significant challenges due to the diversity of scene categories, complex backgrounds, and substantial variations in object features. To address these challenges and enhance the effective extraction of spatial features from remote sensing imagery, this paper proposes the Swin Kansformer model for remote sensing scene classification. The model improves global information extraction through advanced attention mechanisms and captures local features using an asymmetric convolution group module. Furthermore, inspired by the Kolmogorov–Arnold Network (KAN), we introduce a more effective KAN-based network to replace the traditional multilayer perceptron (MLP) for remote sensing scene classification tasks. This innovative approach enables more accurate and robust feature representation, addressing the inherent complexities of remote sensing data and improving classification performance.

To evaluate the effectiveness of the proposed model, multimodule combination experiments took place on the AID and NWPU-RESISC45 datasets. The experimental findings indicate that the KAN interconnected structure achieves excellent classification accuracy for complex scene classification tasks, outperforming the traditional MLP. The utilization of the KAN for remote sensing scene classification is a novel approach, proving its capability to handle complex feature information in remote sensing scenes. The introduction of the KAN presents a fresh approach for higher accuracy and more complex remote sensing scene classification problems and holds great potential for future applications.

Although the KAN has demonstrated promising accuracy when integrated with the traditional Swin Transformer, it still exhibits certain limitations, such as insufficient interpretability, leaving room for improvement toward achieving a fully explainable model. As a relatively novel network architecture, the KAN requires further research and optimization to enhance its performance and broaden its applicability across diverse remote sensing tasks. Future studies should prioritize improving the interpretability of the KAN and developing more efficient variants to address the challenges posed by complex remote sensing applications. These advancements will be critical for unlocking the full potential of the KAN in real-world scenarios, where both accuracy and transparency are essential for reliable decision-making.

Author Contributions

Writing—original draft preparation, investigation, methodology, validation, S.A.; writing—original draft preparation, investigation, methodology, validation, L.Z.; conceptualization, formal analysis, supervision, writing—review and editing, X.L.; investigation, visualization, software, G.Z. data curation, visualization, P.L.; data curation, visualization, K.Z.; validation, H.M.; software, F.H; conceptualization, formal analysis, supervision, writing—review and editing, Z.L., X.L. and Z.L. contributed equally to this paper. All authors have read and agreed to the published version of the manuscript.

Funding

Belt and Road Innovation Talent Exchange Foreign Experts Program (DL2023171002L).

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

Author Zhiyang Lian was employed by China Siwei Surveying and Mapping Technology Co., Ltd., the remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Galleguillos, C.; Belongie, S. Context-based object categorization: A critical survey. Comput. Vis. Image Underst. 2010, 114, 712–722. [Google Scholar] [CrossRef]
Jalal, A.; Ahmed, A.; Rafique, A.A.; Kim, K. Scene Semantic Recognition Based on Modified Fuzzy C-Mean and Maximum Entropy Using Object-to-Object Relations. IEEE Access 2021, 9, 27758–27772. [Google Scholar] [CrossRef]
Cheng, G.; Xie, X.; Han, J.; Xia, G.S. Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
Egenhofer, M.J.; Franzosa, R.D. Point-set topological spatial relations. Int. J. Geogr. Inf. Syst. 1991, 5, 161–174. [Google Scholar]
Zhu, Q.; Zhong, Y.; Zhao, B.; Xia, G.; Zhang, L. Bag-of-Visual-Words Scene Classifier with Local and Global Features for High Spatial Resolution Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2016, 13, 747–751. [Google Scholar]
Zhao, L.J.; Tang, P.; Huo, L.Z. Land-use scene classification using a concentric circle-structured multiscale bag-of-visual-words model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 4620–4631. [Google Scholar] [CrossRef]
Dong, R.; Xu, D.; Jiao, L.; Zhao, J.; An, J. A Fast Deep Perception Network for Remote Sensing Scene Classification. Remote Sens. 2020, 12, 729. [Google Scholar] [CrossRef]
Chen, J.; Wang, C.; Ma, Z.; Chen, J.; He, D.; Ackland, S. Remote Sensing Scene Classification Based on Convolutional Neural Networks Pre-Trained Using Attention-Guided Sparse Filters. Remote Sens. 2018, 10, 290. [Google Scholar] [CrossRef]
Cheng, G.; Ma, C.; Zhou, P.; Yao, X.; Han, J. Scene Classification of High Resolution Remote Sensing Images Using Convolutional Neural Networks. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 767–770. [Google Scholar]
Zhou, W.; Shao, Z.; Cheng, Q. Deep Feature Representations for High-Resolution Remote Sensing Scene Classification. In Proceedings of the 2016 4th International Workshop on Earth Observation and Remote Sensing Applications (EORSA), Guangzhou, China, 4–6 July 2016; pp. 338–342. [Google Scholar]
Liu, Y.; Liu, Y.; Ding, L. Scene classification based on two-stage deep feature fusion. IEEE Geosci. Remote Sens. Lett. 2017, 15, 183–186. [Google Scholar] [CrossRef]
Han, X.; Zhong, Y.; Cao, L.; Zhang, L. Pre-trained alexnet architecture with pyramid pooling and supervision for high spatial resolution remote sensing image scene classification. Remote Sens. 2017, 9, 848. [Google Scholar] [CrossRef]
Muhammad, U.; Wang, W.; Chattha, S.P.; Ali, S. Pre-trained VGGNet architecture for Remote-Sensing Image Scene Classification. In Proceedings of the 2018 24th International Conference on Pattern Recognition, Beijing, China, 20–24 August 2018; pp. 1622–1627. [Google Scholar]
Tang, P.; Wang, H.; Kwong, S. G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition. Neurocomputing 2017, 225, 188–197. [Google Scholar] [CrossRef]
Wang, M.; Zhang, X.; Niu, X.; Wang, F.; Zhang, X. Scene classification of high-resolution remotely sensed image based on ResNet. J. Geovisualization Spat. Anal. 2019, 3, 16. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zheng, F.; Lin, S.; Zhou, W.; Huang, H. A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification. Remote Sens. 2023, 15, 2865. [Google Scholar] [CrossRef]
Li, Y.; Yao, T.; Pan, Y.; Mei, T. Contextual Transformer networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1489–1500. [Google Scholar]
Wang, X.; Duan, L.; Ning, C.; Zhou, H. Relation-attention networks for remote sensing scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 15, 422–439. [Google Scholar]
Liang, J.; Deng, Y.; Zeng, D. A deep neural network combined CNN and GCN for remote sensing scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4325–4338. [Google Scholar]
Guo, Y.; Ji, J.; Lu, X.; Huo, H.; Fang, T.; Li, D. Global-local attention network for aerial scene classification. IEEE Access 2019, 7, 67200–67212. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Haykin, S. Neural networks: A Comprehensive Foundation; Prentice Hall PTR: Hoboken, NJ, USA, 1994. [Google Scholar]
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control. Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar]
Cunningham, H.; Ewart, A.; Riggs, L.; Huben, R.; Sharkey, L. Sparse autoencoders find highly interpretable features in language models. arXiv 2023, arXiv:2309.08600. [Google Scholar]
Hoefler, T.; Alistarh, D.; Ben-Nun, T.; Dryden, N.; Peste, A. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res. 2021, 22, 1–124. [Google Scholar]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljacĭc, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
Leshno, M.; Lin, V.Y.; Pinkus, A.; Schocken, S. Multilayer feedforward networks with a non-polynomial activation function can approximate any function. Neural Netw. 1993, 6, 861–867. [Google Scholar] [CrossRef]
Pinkus, A. Approximation theory of the mlp model in neural networks. Acta Numer. 1999, 8, 143–195. [Google Scholar] [CrossRef]
Kolmogorov, A.N. On the representation of continuous functions of several variables as superpositions of continuous functions of a smaller number of variables. Proc. USSR Acad. Sci. 1956, 108, 179–182. [Google Scholar]
Kolmogorov, A.N. On the Representation of Continuous Functions of Many Variables by Superposition of Continuous Functions of One Variable and Addition. In Doklady Akademii Nauk; Russian Academy of Sciences: Saint Petersburg, Russia, 1957; Volume 114, pp. 953–956. [Google Scholar]
Braun, J.; Griebel, M. On a constructive proof of kolmogorov’s superposition theorem. Constr. Approx. 2009, 30, 653–675. [Google Scholar] [CrossRef]
Schmidt-Hieber, J. The Kolmogorov-Arnold representation theorem revisited. Neural Netw. 2020, 137, 119–126. [Google Scholar] [CrossRef]
He, J.; Li, L.; Xu, J.; Zheng, C. Relu deep neural networks and linear finite elements. arXiv 2018, arXiv:1807.03973. [Google Scholar]
He, J.; Xu, J. Deep neural networks and finite elements of any order on arbitrary dimensions. arXiv 2023, arXiv:2312.14276. [Google Scholar]
Vaca-Rubio, C.J.; Blanco, L.; Pereira, R.; Caus, M. Kolmogorov-arnold networks (kans) for time series analysis. arXiv 2024, arXiv:2405.08790. [Google Scholar]
Wolf, T.; Kolmogorov-Arnold Networks: The Latest Advance in Neural Networks, Simply Explained. Towards Data Science. 2024. Available online: https://towardsdatascience.com/kolmogorov-arnold-networks-the-latest-advance-in-neural-networks-simply-explained-f083cf994a85/ (accessed on 22 January 2025).
Bozorgasl, Z.; Chen, H. Wav-kan: Wavelet kolmogorov-arnold networks. arXiv 2024, arXiv:2405.12832. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [PubMed]
Udrescu, S.-M.; Tegmark, M. Ai feynman: A physics-inspired method for symbolic regression. Sci. Adv. 2020, 6, eaay2631. [Google Scholar]
Udrescu, S.-M.; Tan, A.; Feng, J.; Neto, O.; Wu, T.; Tegmark, M. Ai feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity. Adv. Neural Inf. Process. Syst. 2020, 33, 4860–4871. [Google Scholar]
Kemker, R.; McClure, M.; Abitino, A.; Hayes, T.; Kanan, C. Measuring Catastrophic Forgetting in Neural Networks. In Proceedings of the AAAI conference on artificial intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Ding, X.H.; Guo, Y.C.; Ding, G.G.; Han, J.G. ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Con-volution Blocks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1911–1920. [Google Scholar] [CrossRef]
Sprecher, D.A.; Draghici, S. Space-filling curves and kolmogorov superpositionbased neural networks. Neural Netw. 2002, 15, 57. [Google Scholar]
Köppen, M. On the training of a kolmogorov network. In Proceedings of the Artificial Neural Networks—ICANN 2002: International Conference Madrid, Spain, 28–30 August 2002; Proceedings 12; Springer: Berlin/Heidelberg, Germany, 2002; pp. 474–479. [Google Scholar]
Lin, J.-N.; Unbehauen, R. On the realization of a kolmogorov network. Neural Comput. 1993, 5, 18. [Google Scholar] [CrossRef]
Lai, M.-J.; Shen, Z. The kolmogorov superposition theorem can break the curse of dimensionality when approximating high dimensional functions. arXiv 2021, arXiv:2112.09963. [Google Scholar]
Leni, P.-E.; Fougerolle, Y.D.; Truchetet, F. The kolmogorov spline network for image processing. In Image Processing: Concepts, Methodologies, Tools, and Applications; IGI Global: Hershey, PA, USA, 2013; pp. 54–78. [Google Scholar]
Fakhoury, D.; Fakhoury, E.; Speleers, H. Exsplinet: An interpretable and expressive spline-based neural network. Neural Netw. 2022, 152, 332–346. [Google Scholar]
Montanelli, H.; Yang, H. Error bounds for deep relu networks using the kolmogorov–arnold superposition theorem. Neural Netw. 2020, 129, 1–6. [Google Scholar]
He, J. On the optimal expressive power of relu dnns and its application in approximation with kolmogorov superposition theorem. arXiv 2023, arXiv:2308.05509. [Google Scholar]
Huang, G.-B.; Zhao, L.; Xing, Y. Towards theory of deep learning on graphs: Optimization landscape and train ability of Kolmogorov-Arnold representation. Neurocomputing 2017, 251, 10–21. [Google Scholar]
Xing, Y.; Zhao, L.; Huang, G.-B. KolmogorovArnold Representation Based Deep Learning for Time Series Forecasting. In 2018 IEEE Symposium Series on Computational Intelligence (SSCI); IEEE: Piscataway, NJ, USA, 2018; pp. 1483–1490. [Google Scholar]
Liang, X.; Zhao, L.; Huang, G.-B. Deep Kolmogorov-Arnold representation for learning dynamics. IEEE Access 2018, 6, 49436–49446. [Google Scholar]
Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
Ma, A.L.; Yu, N.; Zheng, Z.; Zhong, Y.F.; Zhang, L.P. A supervised progressive growing generative adversarial network for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Xu, J.; Zhang, H.; Li, Y. Spectral-Spatial Capsule Networks for Remote Sensing Scene Classification. In IEEE Geoscience and Remote Sensing Letters; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar]
Liu, Y.; Zhang, X.; Chen, H. Multi-Resolution Fusion Networks for Remote Sensing Scene Classification. In IEEE Transactions on Pattern Analysis and Machine Intelligence; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar]
Chen, L.; Yang, T.; Zhou, Y. Multi-scale graph convolutional networks for remote sensing scene classification. ISPRS J. Photogramm. Remote Sens. 2024, 62, 4402618. [Google Scholar]
Wang, J.; Liu, H.; Zhang, W. Efficient Remote Sensing Transformer for Scene Classification. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS); IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Zhao, X.; Wang, Y.; Liu, Z. Efficient Multi-Scale Transformers for Remote Sensing Scene Classification. In IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Lv, P.Y.; Wu, W.J.; Zhong, Y.F.; Du, F.; Zhang, L.P. SCViT: A spatial-channel feature preserving vision transformer for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4409512. [Google Scholar] [CrossRef]
Zhang, Y.; Li, X.; Wang, L. Kernel Attention Network for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1234–1245. [Google Scholar]
Chen, J.; Liu, H.; Zhang, W. KANet: A Kernel Attention Network for semantic segmentation of remote sensing images. Remote Sens. 2023, 15, 567–580. [Google Scholar]
Wang, X.; Li, Y. Kernel Attention Network for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2023, 20, 789–793. [Google Scholar]
Liu, Y.; Zhang, Z. KANet: A Kernel Attention Network for change detection in remote sensing images. Int. J. Remote Sens. 2023, 44, 2345–2360. [Google Scholar]
Li, H.; Wang, J. Kernel Attention Network for multi-temporal remote sensing image analysis. Remote Sens. Environ. 2023, 285, 113456. [Google Scholar]

Figure 1. Visualization of the B-spline curve and its corresponding control points. The B-spline curve (blue) is fitted through a series of control points (red), with the control points marked by green plus signs.

Figure 2. Comparison between multilayer perceptron (MLP) and Kolmogorov–Arnold Network (KAN).

Figure 3. Structure of the Swin Kansformer model.

Figure 4. Structure of the asymmetric convolutional group.

Figure 5. Intuitive representation of the KAN structure.

Figure 6. Linear function expression structure of the KAN.

Figure 7. Structure of the KAN functional module.

Figure 8. AID dataset (30 categories).

Figure 9. NWPU-RESISC45 dataset (45 categories).

Figure 10. Changes in learning rate and model accuracy.

Figure 11. Kernel function bandwidth and model accuracy variation in the KAN module.

Figure 12. The number of KAN attention heads and the change in model accuracy.

Figure 13. Confusion matrix of Swin Kansformer using the AID dataset (20% training samples).

Figure 14. The confusion matrix of the Swin Kansformer model on the AID dataset (50% training samples).

Figure 15. Confusion matrix of Swin Kansformer on the NWPU-RESISC45 dataset (10% training samples).

Figure 16. The confusion matrix of the Swin Kansformer model on the NWPU-RESISC45 dataset (20% training samples).

Figure 17. Thermal diagram of module combination.

Figure 18. Ablation experiment results of the Swin Kansformer network model on the AID remote sensing dataset.

Figure 19. Ablation experiment results of the Swin Kansformer network model on the NWPU-RESISC45 remote sensing dataset.

Table 1. Comparison of classification results of various networks on the AID dataset (%).

Classification Method	Training Ratio
Classification Method	20%	50%
SSCapsNet	92.67 ± 0.43	94.01 ± 0.37
MRFN	92.45 ± 0.44	93.89 ± 0.39
MS-GCN	92.12 ± 0.48	93.56 ± 0.41
ERST	92.89 ± 0.42	94.21 ± 0.36
EMST	92.78 ± 0.42	94.12 ± 0.36
EHT	92.78 ± 0.42	94.12 ± 0.36
GAN	94.51 ± 0.15	96.45 ± 0.19
CFDNN	94.56 ± 0.17	96.56 ± 0.24
Combined CNN and GCN	94.93 ± 0.31	96.89 ± 0.10
PVT-V2_B0	93.52 ± 0.35	96.27 ± 0.14
ViT + LCA	95.43 ± 0.22	96.89 ± 0.41
ViT + PA	95.31 ± 0.11	96.72 ± 0.16
Swin Kansformer	96.66 ± 0.24	97.78 ± 0.22

Table 2. Comparison of classification results of various networks on the NWPU-RESISC45 dataset (%).

Classification Method	Training Ratio
Classification Method	10%	20%
SSCapsNet	90.34 ± 0.51	91.95 ± 0.47
MRFN	90.12 ± 0.52	91.78 ± 0.48
MS-GCN	89.78 ± 0.53	91.45 ± 0.49
ERST	90.45 ± 0.51	92.10 ± 0.46
EMST	90.45 ± 0.51	92.01 ± 0.46
EHT	90.45 ± 0.51	92.01 ± 0.46
GAN	91.06 ± 0.11	93.63 ± 0.12
CFDNN	91.17 ± 0.13	93.83 ± 0.09
Combined CNN and GCN	90.75 ± 0.21	92.87 ± 0.13
PVT-V2_B0	89.72 ± 0.16	92.95 ± 0.09
ViT + LCA	92.37 ± 0.20	94.63 ± 0.13
ViT + PA	92.65 ± 0.20	94.24 ± 0.16
Swin Kansformer	93.50 ± 0.18	94.90 ± 0.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

An, S.; Zhang, L.; Li, X.; Zhang, G.; Li, P.; Zhao, K.; Ma, H.; Lian, Z. Global–Local Feature Fusion of Swin Kansformer Novel Network for Complex Scene Classification in Remote Sensing Images. Remote Sens. 2025, 17, 1137. https://doi.org/10.3390/rs17071137

AMA Style

An S, Zhang L, Li X, Zhang G, Li P, Zhao K, Ma H, Lian Z. Global–Local Feature Fusion of Swin Kansformer Novel Network for Complex Scene Classification in Remote Sensing Images. Remote Sensing. 2025; 17(7):1137. https://doi.org/10.3390/rs17071137

Chicago/Turabian Style

An, Shuangxian, Leyi Zhang, Xia Li, Guozhuang Zhang, Peizhe Li, Ke Zhao, Hua Ma, and Zhiyang Lian. 2025. "Global–Local Feature Fusion of Swin Kansformer Novel Network for Complex Scene Classification in Remote Sensing Images" Remote Sensing 17, no. 7: 1137. https://doi.org/10.3390/rs17071137

APA Style

An, S., Zhang, L., Li, X., Zhang, G., Li, P., Zhao, K., Ma, H., & Lian, Z. (2025). Global–Local Feature Fusion of Swin Kansformer Novel Network for Complex Scene Classification in Remote Sensing Images. Remote Sensing, 17(7), 1137. https://doi.org/10.3390/rs17071137

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Global–Local Feature Fusion of Swin Kansformer Novel Network for Complex Scene Classification in Remote Sensing Images

Abstract

1. Introduction

2. Method

2.1. KAN Theory

2.2. Swin Kansformer Network

2.2.1. Asymmetric Convolutional Group Module

2.2.2. KAN Architecture

2.2.3. KAN Functional Module

KAN Linear

Weight Fusion and Multilayer Stacking

KAN Optimization

3. The Findings of Experiments and Their Analysis

3.1. Description of the Dataset

3.2. Evaluation Metrics

3.3. Experimental Parameter Settings

3.4. Hyperparameter Analysis

3.5. Algorithm Performance Comparison

3.6. Confusion Matrix Analysis

3.7. Comparison of Heat Maps

4. Discussion

4.1. Ablation Experiment

4.2. Comparative Analysis of KAN Optimization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI