Next Article in Journal
Short-Term Degradation of Aquatic Vegetation Induced by Demolition of Enclosure Aquaculture Revealed by Remote Sensing
Previous Article in Journal
Airborne SAR Imaging Algorithm for Ocean Waves Oriented to Sea Spike Suppression
Previous Article in Special Issue
DNMF-AG: A Sparse Deep NMF Model with Adversarial Graph Regularization for Hyperspectral Unmixing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hyperspectral Image Classification Using SIFANet: A Dual-Branch Structure Combining CNN and Transformer

1
School of Artificial Intelligence, China University of Geosciences (Beijing), Beijing 100083, China
2
Hebei Key Laboratory of Geospatial Digital Twin and Collaborative Optimization, Beijing 100083, China
3
Frontier Science Center for Deep-Time Digital Earth, China University of Geosciences (Beijing), Beijing 100083, China
4
State Key Laboratory of Geological Processes and Mineral Resources, Beijing 100083, China
5
School of Natural Resources and Geomatics, Nanning Normal University, Nanning 530100, China
6
University Engineering Research Center of “Satellite +” Space AI Intelligent Governance of Natural Resources, Nanning 530100, China
7
Guangxi Engineering Research Center for Smart Monitoring and Governance of Agricultural Land, Nanning 530100, China
8
Guangxi Zhuang Autonomous Region Institute of Geographic Information Surveying and Mapping, Liuzhou 545006, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(3), 398; https://doi.org/10.3390/rs18030398 (registering DOI)
Submission received: 24 December 2025 / Revised: 21 January 2026 / Accepted: 22 January 2026 / Published: 24 January 2026

Highlights

What are the main findings?
  • A novel dual-branch network, SIFANet, was developed, which synergistically integrates CNN and Transformers for enhanced HSI classification.
  • The architecture successfully captures both local spectral–spatial features via the CNN branch and long-range global dependencies through the Transformer branch, overcoming the limitations of single-backbone models.
What are the implications of the main findings?
  • SIFANet provides a robust solution for HSI classification in complex scenarios, significantly improving accuracy by effectively balancing local texture details and global contextual information.
  • The dual-branch fusion strategy offers a scalable framework for hybrid deep learning models, potentially advancing the processing of high-dimensional remote sensing data with limited labeled samples.

Abstract

The hyperspectral image (HSI) is rich in spectral information and has important applications in the field of ground objects classification. However, HSI data have high dimensions and variable spatial–spectral features, which make it difficult for some models to adequately extract the effective features. Recent studies have shown that fusing spatial and spectral features can significantly improve accuracy by exploiting multi-dimensional correlations. Based on this, this article proposes a spectral integration and focused attention network (SIFANet) with a two-branch structure. SIFANet captures the local spatial features and global spectral dependencies through the parallel-designed spatial feature extractor (SFE) and spectral sequence Transformer (SST), respectively. A cross-module attention fusion (CMAF) mechanism dynamically integrates features from both branches before final classification. Experiments on the Salinas dataset and Xiong’an hyperspectral dataset show that the overall accuracy on these two datasets is 99.89% and 99.79%, which is higher than the other models compared. The proposed method also had the lowest standard deviation of category accuracy and optimal computational efficiency metrics, demonstrating robust spatial–spectral feature integration for improved classification.

1. Introduction

The hyperspectral image (HSI) is a three-dimensional data cube acquired by remote sensing platforms (such as aircrafts or satellites) using push-broom or staring imaging techniques [1,2]. It contains two-dimensional spatial images corresponding to multiple spectral bands. Unlike traditional multispectral images, HSIs collect continuous spectral information from hundreds of narrow bands. These bands span the visible, near-infrared, and shortwave infrared regions (400–2500 nm) of the electromagnetic spectrum. As a result, each pixel in the HSI contains reflectance value across hundreds of spectral channels, forming a continuous spectral curve [3]. This enables HSIs to provide rich spatial–spectral information [4]. In recent years, the rapid development of computer technology has greatly advanced HSI classification tasks. HSIs have been widely applied in agricultural production [5,6], city planning [7,8], environmental science [9,10], and other fields.
HSI classification faces challenges from three aspects: data, model training, and application. At the data level, the typical 200–300 spectral bands of HSIs lead to high dimensionality, raising computational complexity [11]. Moreover, the limited availability of ground truth training samples results in data sparsity in the high-dimensional feature space, violating statistical learning assumptions. This leads to the Hughes phenomenon. Studies have shown that when the ratio of feature dimensionality to sample size exceeds a critical threshold, the generalization performance of classifiers such as Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) will sharply decline due to overfitting. In addition, strong inter-band correlations will introduce substantial redundant information to models, which hinders effective learning and training [12]. At the model training level, large training sample sizes demand substantial computational resources, significantly increasing training costs [13]. In practical applications, HSI preprocessing involves complex procedures such as radiometric and atmospheric correction, geometric registration, and manual sample labeling [14]. Additionally, to further improve the quality of hyperspectral data, research on enhanced deep image prior has been applied to unsupervised hyperspectral image super-resolution [15], which effectively enhances the spatial resolution and reduces noise without the need for large-scale training datasets. Furthermore, cross-scene model transfer issues caused by various imaging angles and sensor inconsistencies present significant challenges [16]. These factors will greatly affect the HSI classification performance in practical applications. In response, researchers have proposed and refined various classification algorithms to address these challenges, including traditional machine learning methods, deep learning approaches, and multi-model fusion methods.
Traditional machine learning methods depend on manual feature engineering. The SVM uses kernel functions to map features into high-dimensional space and performs well especially in small-sample scenarios. Random forest (RF), using bagging and random subspaces, effectively handles high-dimensional nonlinear data. The K-Nearest Neighbor (K-NN) algorithm classifies samples by distance-based majority voting. Despite the K-NN being simple, it faces limitations in high-dimensional spaces due to the “curse of dimensionality”, which makes it difficult to achieve satisfactory classification results. Additionally, K-NN is sensitive to parameters and has lower computational efficiency. Champa et al. [17] proposed a hybrid technique which combined tree-based classifiers with feature dimensionality reduction. The results show that decision trees (DTs), extra trees (ETs), and RF perform well in HSI classification.
Compared to traditional methods, deep learning approaches offer better adaptability for HSI classification tasks. Convolutional neural networks (CNNs) automatically extract spatial–spectral features, mimicking biological visual mechanisms [18]. Residual neural networks (ResNet) incorporate residual structures for cross-layer information transfer. This structure effectively mitigates the vanishing gradient problem [19]. Li et al. [20] introduced the deep belief network (DBN) for feature extraction and image classification. Zhao et al. [21] proposed a feature-level fusion classification framework combining CNN with texture features, utilizing HSI and LiDAR data for high classification accuracy. As for sequential spectral data, the recurrent neural network (RNN) captures spectral sequence correlations via recursive feedback mechanisms [22]. Long short-term memory (LSTM) networks use gating units to alleviate long-term dependency issues [23]. Recently, the Transformer has gained popularity in natural language processing (NLP), which improves semantic extraction through global dependency modeling [24]. Inspired by this, researchers have integrated attention mechanisms into HSI classification to enhance accuracy by strengthening feature interactions. Hong et al. [25] proposed the SpectralFormer network, which uses self-attention to model long-range spectral dependencies. However, its unimodal (spectral-only) structure neglects critical spatial information of HSIs, limiting effectiveness. To address this issue, Wang et al. [26] introduced a 3D attention mechanism to capture joint spectral–spatial features. Ding et al. [27] developed a global–local Transformer network (GLT-Net) for joint classification using multi-scale feature fusion. However, inherent architectural constraints limit the performance of single-structure models. Specifically, CNNs rely on local receptive fields, which make them effective at capturing local textures but less capable of modeling long-range spatial dependencies or the continuous sequential correlation of spectral bands. Conversely, while Transformers excel at global dependency modeling via self-attention, they lack the spatial inductive bias (such as translation invariance and locality) inherent in CNNs. This often leads to difficulties in capturing fine-grained local spatial structures and results in quadratic computational complexity when dealing with high-dimensional spectral sequences. Consequently, deep learning models still face two key limitations: they require large amounts of annotated data, which is costly and time-consuming [28], and they involve high computational complexity, hindering practical deployment [29]. To alleviate this, transudative few-shot learning with enhanced spectral–spatial embedding [30] has been proposed to achieve robust classification with minimal labeled samples.
In recent years, multi-model fusion methods have shown great promise in pattern recognition and image classification by leveraging the complementary strengths of different structures. Researchers optimize feature extraction and processing by combining various models. For example, Wei et al. [31] proposed a strategy that uses CNNs for feature extraction, followed by SVM classification. This approach not only improves accuracy but also reduces overfitting and alleviates reliance on large sample datasets. Combining multi-scale feature fusion with gradient boosting decision trees (GBDTs) has also been proven effective. This ensemble approach integrates multiple weak learners to enhance classification stability and accuracy, especially in complex land cover scenarios [32]. To better exploit the complementary nature of spatial and spectral information, researchers have increasingly turned to dual-branch architectures. For example, the Double-Branch Dual-Attention (DBDA) network [33] utilizes two independent branches to capture spatial and spectral features, respectively, enhanced by dual-attention mechanisms. Similarly, the Multiscale Neighborhood Attention Transformer (MSNAT) [34] further refines this by extracting multi-scale features to adapt to varying land cover sizes. To address feature extraction from a frequency perspective, the Dual Frequency Transformer Network (DFTN) [35] was developed. It utilizes a dual-branch frequency domain feature extraction block to simultaneously capture high-frequency local details and low-frequency global variations, demonstrating the effectiveness of frequency domain information in enhancing HSI classification. Alkhatib et al. [36] proposed an attention-based dual-branch network that fuses features from a real-valued neural network (RVNN) and a complex-valued neural network (CVNN). By using the Fourier transformation to extract frequency information, this model enhances HSI classification performance. While these models have significantly improved classification accuracy, they often face challenges such as high parameter redundancy and suboptimal fusion of heterogeneous features from the two branches. Many researchers have focused on efficient spatial–spectral feature extraction. Yang et al. [3] introduced the multi-scale hybrid CNN–attention (MS-Hybrid-A) network, which uses 3D convolutions for spectral–spatial features extraction. Additionally, it supplementary extracts spatial details with 2D convolutions and incorporates a convolutional block attention module (CBAM) to improve classification performance. Liang et al. [37], inspired by the Transformer, proposed HSI-Mixer, which utilizes a hybrid measurement-based linear projection (HMLP) module for deep spectral–spatial feature fusion. Kong et al. [38] developed a co-feature extraction framework that integrates graph embedding with deep learning. They constructed a supervised within-class/between-class hypergraph (SWBH) for spectral feature learning and introduced a random zero-masking strategy to generate augmented labeled samples. It could facilitate CNN-based spatial feature extraction and mitigate overfitting in small-sample settings. Ahmad et al. [39] proposed the spatial morphological Mamba (SMM) and spatial–spectral morphological Mamba (SSMM) networks, which employ depth-wise separable convolutions to implement morphological operations, like erosion and dilation. By leveraging State Space Models (SSMs), these Mamba-based approaches achieve linear computational complexity while effectively modeling long-range dependencies, offering a promising, efficient alternative to standard Transformers for processing high-dimensional spectral sequences. To address the challenge of cross-layer information loss, Chen et al. [40] designed a hybrid pooling attention (HPA) module and a cross-layer feature fusion (CFF) module to preserve crucial information during the propagation process. Gao et al. [41] introduced a plug-and-play adaptive feature fusion (AFF) module that processes multi-layer networks to better utilize spatial and spectral features. Guo et al. [42] proposed an adaptive score-weighting method to fuse features from spatial and spectral branches. Similarly, the concept of reference-based adaptive modulation has been explored in multi-style fusion tasks [43], providing insights into the dynamic adjustment of features across different domains.
In conclusion, traditional machine learning methods such as SVM and RF perform well on high-dimensional data when only a few labeled samples are available. However, they have limited capacity to model nonlinear relationships and thus struggle to process large-scale, high-dimensional data with complex noise. Deep learning methods, by contrast, greatly improve feature representation through hierarchical abstraction. Nevertheless, they require large quantities of high-quality labeled samples and bring drawbacks such as long training times and high computational cost. Furthermore, multi-model fusion methods can mitigate overfitting effectively and reduce dependence on labeled samples. However, they pose challenges in designing compatible modules and incur heavy computational overhead.
The existing research still lacks effective methods to dynamically fuse spatial and spectral features for HSI classification. Furthermore, the number of model parameters and computational overhead remain too high. In the broader field of remote sensing, highly efficient architectures like SFEARNet [44] have successfully combined semantic flow and edge-aware refinement for tasks such as change detection, demonstrating the importance of structural information in efficient network design. Furthermore, the hybrid CNN–Transformer architecture has been widely validated in diverse remote sensing applications, including cloud detection using Landsat/Sentinel data [45], wind speed sensing using Global Navigation Satellite System Reflectometry (GNSS-R) [46], and chlorophyll concentration inversion using SeaWiFS data [47]. These successes in handling complex, multi-modal remote sensing data provide a strong rationale for adopting a hybrid design to capture both local spatial details and global spectral dependencies in HSI classification. Building on the need for efficiency, this article proposes a dual-branch network called the spectral integration and focused attention network (SIFANet). SIFANet is built on a hybrid CNN–Transformer structure and incorporates a channel attention mechanism to emphasize important spectral bands. By optimizing feature extraction and fusion, SIFANet could improve HSI classification performance and enhance classification accuracy to a certain extent.
The main contributions of this article are as follows:
  • Efficient feature extraction structure
    This article designs a dual-branch network composed of a spatial feature extractor (SFE) and a spectral sequence Transformer (SST). The SFE is enhanced with residual blocks (RBs) to alleviate the vanishing gradient problem and accelerate convergence. Simultaneously, the SST incorporates a Conv-Former module to improve spectral feature extraction, enabling the efficient and parallel extraction of spatial–spectral features.
  • Cross-Module Attention Fusion (CMAF)
    This article introduces a channel attention-based CMAF mechanism to dynamically and adaptively fuse features from different branches, which significantly reduces information loss during the feature integration step.
  • Comprehensive HSI classification accuracy assessment indices
    This article develops a novel computation accuracy parameter efficiency (CAPE) index to quantify the computational efficiency of different models. In addition, the proposed evaluation index system (EIS) also includes classification accuracy metrics, confusion category performance, and computational efficiency index, enabling a comprehensive, multidimensional assessment of the model performance.

2. Methodology

The research workflow is illustrated in Figure 1. First, principal component analysis (PCA) was applied to reduce the dimensionality of the HSI data. To retain most of the information while significantly compressing the data, we retained the first few principal components (PCs) based on a cumulative variance contribution rate threshold of >99%. Specifically, for the Salinas dataset (originally 204 bands), the first 7 PCs were retained, preserving 99.12% of the spectral information. For the Xiong’an hyperspectral dataset (originally 256 bands), the first 6 PCs were retained, preserving 99.09% of the information. Subsequently, spatial and spectral feature tensors were generated via a masking operation and then were input into SIFANet. Within SIFANet, spatial and spectral features were captured through SFE and SST branches, respectively. CMAF was employed to achieve the dynamic weighted fusion. Finally, the fused features were fed into the classifier to produce the result. During the validation step, SIFANet was comprehensively evaluated and compared against other comparable models using the EIS.

2.1. SIFANet Structure

The SIFANet adopts a parallel dual-branch structure rather than a serial one to decouple the extraction of spatial and spectral features. While serial structures often lead to the loss of original information during sequential processing, the parallel design allows the SFE and SST to independently capture local textures and global dependencies. This simultaneous processing ensures that the distinctive characteristics of both domains are fully preserved before being dynamically integrated through the CMAF module, thereby maximizing feature complementarity.

2.1.1. Spatial Feature Extraction

SFE branch adopts a “Conv + RB” CNN variant structure. It applies residual learning to improve gradient flow and mitigate the vanishing gradient problem. Additionally, max-pooling downsampling is incorporated to reduce computational complexity and accelerate training.
Figure 2 illustrates the detailed structure of the SFE module. It begins by applying a 3 × 3 convolutional kernel to extract preliminary spatial features from local neighborhoods. To further reduce spatial dimensionality, a max-pooling layer is subsequently applied, formulated as
X 3 d = f M P f C o n v X
where X is the input tensor, X 3 d is the processing result, f C o n v and f M P represent the 3 × 3 convolution layer and the max-pooling layer, respectively. f M P uses a 2 × 2 window and a stride of 2 for pooling.
The RB module adopts a two-branch structure to facilitate feature reuse. The main branch consists of two consecutive “Conv + BatchNorm (BN)” layers, which are used to extract deeper spatial features. BN standardizes intermediate feature distributions across mini batches to mitigate the vanishing gradient problem and accelerate model convergence. The residual output is computed as follows:
X B N = X X ¯ σ X 2 + ϵ · γ + β
where X is the residual block input tensor, X B N is the output of the residual block, X ¯ is the mean of the input tensor, σ X 2 is the variance of the input data, ϵ is a small constant added for numerical stability, γ and β are learnable parameters.
Nonlinearity is introduced through the ReLU activation function, The operation is expressed as follows:
X = f M a x 0 , X
where X denotes the input tensor, X is the output tensor, and f M a x represents the element-wise maximum function used ReLU.
The skip connection employs cross-layer identity mapping to directly propagate the shallow features X 3 d to deeper layers, thereby preserving important spatial information. The fused output is obtained through element-wise feature summation. Denoting the sequence of operations in the main path as f R B , and using to represent summation, the operation can be defined as follows:
X R B = X 3 d f R B X 3 d
Finally, at the output stage, an adaptive average pooling layer is applied to adjust the spatial resolution of the feature maps to a fixed size, and the result is denoted as F C N N . This operation ensures consistent output feature dimensions across both processing branches.

2.1.2. Spectral Sequence Transformer

Some land cover categories are difficult to distinguish when only relying on spatial features due to their high spatial similarity. To address this issue, this article proposes an SST branch, which is illustrated in Figure 3.
A fully connected layer processes the input X , converting the spectral vector of each pixel into a token embedding. This transformation constructs a sequence of tokens X t o k e n , forming the primary input sequence for the Transformer. Concurrently, a lower branch generates a learnable positional embedding P designed to encode the position. P is added element-wise to the token sequence to obtain the position-augmented input representation X s t a c k e d , i.e.,
X s t a c k e d = X t o k e n P
The position-augmented spectral sequence X s t a c k e d is then fed into the improved Conv-Former encoder module for feature encoding, producing Z o u t . Finally, a linear layer projects Z o u t to match the output dimensions of the SFE branch. The final spectral feature representation is given by
F T r a n s = W p Z o u t + b p
where W p is the weight matrix of the linear layer, b p is the bias vector of the linear layer, and F T r a n s is the spectral feature information output by the SST branch.
This article combined convolution with a Transformer encoder to leverage the advantages of convolution operations and self-attention mechanisms. Inspired by Lin et al. [48], the improved “Conv-Former” structure is shown in Figure 4. We have significantly adapted the module for HSI-specific characteristics. Specifically, the conventional feed-forward network is replaced with a convolutional feed-forward network based on 1 × 1 convolutions, which efficiently reorganizes and compresses high-dimensional spectral channels while preserving spatial information. Batch normalization is employed throughout the module to improve training stability under the small-sample regime commonly encountered in HSI applications. Together, these modifications allow the proposed module to better capture discriminative spatial–spectral representations and make it more suitable for HSI analysis.
The convolution-attention residual module processes features through its dual-branch structure. The main branch extracts features with a convolution-attention block (CAB), while the auxiliary branch preserves the original input features. The output of the module is computed as
X C A = X f C A B X
where X is the input tensor, X C A is the module’s output, and f C A B denotes the operations within the CAB.
The CAB adopts a “Conv + Self-Attention” processing method. It first applies a 3 × 3 convolution to extract local spectral features, then constructs the query ( Q ), key ( K ), and value ( V ) matrices for self-attention (SA) computation. A multi-head self-attention (MHSA) mechanism with h = 4 is employed to model global contextual relationships, balancing computational costs and model representational capacity. The SA process is defined as
M h = f S M Q K T d k
S A Q , K , V = M h V
where d k denotes the input sequence dimension, M h represents the attention weight matrix, K T is the transpose of K , indicates matrix multiplication, and f S M is the SoftMax function.
The MHSA mechanism captures diverse dependencies and hierarchical relationships by integrating multiple self-attention heads [24]. The overall MHSA output is expressed as
M H S A Q , K , V = f C A T S A 1 ,   S A 2 , S A 3 , , S A h
where S A i i = 1,2 , 3 , , h represents the self-attention result of the i -th part. f C A T is the Concat function.
Finally, a 1 × 1 convolution layer is applied to integrate the features obtained from global dependencies, resulting in the refined output X S A .
The current feed-forward network (CFFN) refines features from the upper layer through the following operation:
X C F F N = X C A f C F F N X C A
where X C F F N is the output and f C F F N refers to the transformation process within the current feed-forward network (CFFN).

2.1.3. Cross-Module Attention Fusion

To facilitate cross-modal feature fusion, this article designs the CMAF module, as shown in Figure 5, inspired by the convolutional block attention module (CBAM) [49].
CMAF first constructs a channel attention weight matrix with a shared multi-layer perception (Shared MLP). Specifically, the features F C N N and F T r a n s are fed into the Shared MLP, where the input and hidden layers are connected via fully connected networks to enable feature interaction. After being processed by the hidden layer and activated by the ReLU function, the compressed and expanded feature vectors F C N N and F T r a n s are output through another fully connected layer. Then, F C N N and F T r a n s are concatenated along the channel dimension, followed by Sigmoid activation to produce the channel attention weight matrix M c . The process is defined as follows:
s C N N = f R e l u W 0 F C N N
s T r a n s = f R e l u W 0 F T r a n s
M c = σ W 1 s C N N + W 1 s T r a n s
The above Equations (12) and (13) denote the hidden layer processing and ReLU activation, while (14) describes the steps of output and Sigmoid activation. σ is the Sigmoid function; W 0 and W 1 correspond to the weight matrices of the input layer and output layer, respectively. s C N N is the spatial feature information after hidden layer processing, s T r a n s is the spectral feature information after hidden layer processing, and f R e l u is the ReLU function.
After calculating M c , the original features F C N N and F T r a n s are, respectively, channel-weighted with M c through matrix multiplication and the weighted results are stacked to obtain the fused feature F f u s e d .
F f u s e d = M c F C N N M c F T r a n s
Although the proposed CMAF module is inspired by classical channel attention mechanisms such as the Squeeze-and-Excitation Network (SE-Net) [50] and CBAM, it differs fundamentally in both design objective and information flow. SE-Net and CBAM generate channel attention weights based on the statistical information of a single feature branch, focusing on self-channel recalibration. In contrast, CMAF jointly models CNN-based features and Transformer-based features through a shared MLP to generate a unified channel attention matrix.
By leveraging cross-module feature interaction, the attention weights in CMAF dynamically reflect the relative contribution of different channels from heterogeneous representations, enabling adaptive fusion of local spatial features and global contextual features. Therefore, CMAF acts as a dynamic cross-module fusion mechanism rather than a conventional self-attention-based feature enhancement module.

2.2. Space–Spectral Feature Tensor Generation

SIFANet is for multi-modal tensor inputs, with the SFE and SST branches requiring spatial feature tensors and spectral feature tensors, respectively.
For the SFE branch, the input tensor is constructed from multiple image blocks, with each input tensor having the shape X R B × P × P , where B is the batch size and P is the image block size. As shown in Figure 6, the specific operation for obtaining image blocks is as follows:
First, the original image is padded with a margin of P 1 2 pixels on all sides to ensure that edge pixels are adequately included. Then, a sliding window (also referred to as a mask) of size P × P is applied, starting from the first valid pixel in the top-left corner. The window moves across the entire image with a stride of 1 pixel, excluding padded and unclassified pixels. For each valid central pixel, a corresponding local neighborhood is extracted to form an image block. Each block is represented as X P a t c h R P × P , and multiple such blocks are batched together to construct the spatial input tensor.
For the SST branch, as shown in Figure 7, the multi-band spectral reflectance data of each pixel are directly extracted and assembled in batches to construct the spectral feature input tensor X R B × 1 × M , where M is the number of bands. It retains the spatial location of pixels while highlighting the discriminative features, avoiding spatial neighborhood interference, and enhancing the sensitivity to subtle spectral differences.

2.3. Evaluation Index System

As illustrated in Figure 8, the proposed EIS incorporates three core modules. The first module utilizes classification accuracy indexes, including class accuracy (CA), overall accuracy (OA), the Kappa coefficient, and standard deviation (S.D.). These indices jointly evaluate the model’s recognition capability of individual categories and its overall classified consistency and stability. The second module specifically addresses the performance evaluation of confusing classes by integrating the precision–recall curve (PRC) and F1-Score into a unified PR–F comprehensive evaluation model. It focuses on comparative assessment of a model’s discriminatory capacity for challenging, spectrally similar categories. The third module evaluates computational efficiency through the novel CAPE index, which provides a quantitative assessment by systematically integrating key factors including the model’s computational demands, parameter count, achieved classification accuracy, and measured stability performance.

2.3.1. Overall Accuracy

OA represents the proportion of correctly classified samples, calculated as
O A = T c N
where T c denotes the number of correctly classified samples and N denotes the total number of samples.

2.3.2. Kappa Coefficient

The Kappa coefficient measures classification effectiveness while accounting for class imbalance. The calculation formula is as follows:
p e = i = 1 m R i × C i N 2
K a p p a = O A p e 1 p e
where p e is the chance agreement probability under random classification, m denotes the number of categories, R i denotes the sum of row i in the confusion matrix, C i denotes the sum of the column i in the confusion matrix, N denotes the total number of samples.

2.3.3. PR–F Comprehensive Evaluation Model

For the easily confusable ground object types, this article utilizes the PR–F comprehensive evaluation model, draws PRC, and statistically reports the F1-Score and AP values for auxiliary evaluation.
The F1-Score is a comprehensive metric used to evaluate a model’s performance by balancing precision and recall. Precision refers to the proportion of correctly predicted positive samples among all samples predicted as positive. Recall measures the proportion of correctly predicted positive samples among all actual positive samples, indicating the model’s ability to identify all relevant cases. These metrics are derived from the confusion matrix: true positives (TPs) represent the number of samples correctly predicted as positive; false positives (FPs) are the samples incorrectly predicted as positive; and false negatives (FNs) are the actual positive samples that the model failed to identify. Precision, recall, and F1-Score are calculated as follows:
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1 S c o r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
The precision–recall curve (PRC) is used to evaluate the performance of a model by illustrating how precision and recall change at different classification thresholds. The closer the model’s PRC is to the upper right of the coordinate system, the better the model’s performance. To quantify overall model performance, the average precision (AP) is calculated as the area under the PRC. The closer the AP value is to 1, the better the model balances precision and recall.

2.3.4. Computational Accuracy Parameter Efficiency

Traditional single metrics such as OA focus solely on performance, overlooking computational burden and offering limited insight into model applicability [51]. To address this limitation, the CAPE index integrates multiple factors: the number of parameters, the number of floating-point operations (FLOPs), the S.D., and OA. It provides a comprehensive quantitative evaluation of a model’s computational efficiency, classification accuracy, and prediction stability. A higher CAPE value indicates superior model performance—simultaneously reflecting higher accuracy, greater stability, optimized efficiency, and balanced computational design.
To unify measurement scales and introduce a penalization mechanism, the number of parameters (P) and FLOPs (F) are logarithmically transformed using y = log 10 x . Penalization is implemented via the reciprocal:
N P = 1 log 10 1 + P
N F = 1 log 10 1 + F
here, N P and N F represent the normalized values of parameters and FLOPs. When either P or F is large, their corresponding normalized values decrease significantly, penalizing models with excessive computational costs.
The intermediate metric E , representing computational burden, is then calculated as the geometric mean of N P and N F . A lower E score indicates higher computational resource consumption:
E = N P × N F
The final CAPE score is computed by E, OA, and the inverse of the S.D. ( 1 σ ):
C A P E = E × O A × 1 σ
This multiplicative formulation highlights performance differences among models and balances high accuracy with prediction stability. Notably, even models with high OA could receive a low CAPE score if they have a high S.D.

3. Experimental Design and Results

3.1. Dataset

3.1.1. Salinas Dataset

The Salinas dataset (denoted as SL), acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) (developed by NASA’s Jet Propulsion Laboratory, Pasadena, CA, USA) covers a portion of the Salinas Valley in California, USA. The image is 217 pixels × 512 pixels, with a spatial resolution of 3.7 m. It originally comprised 224 spectral bands. After removing bands affected by water absorption, 204 spectral bands remain, spanning the 400–1000 nm wavelength range. This dataset primarily includes 16 categories of crops. The corresponding ground truth map is shown in Figure 9.

3.1.2. Xiong’an Hyperspectral Dataset

The Xiong’an dataset (denoted as XA) was acquired using the Full-spectrum Multi-modal Imaging Spectrometer (manufactured by Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai, China) [52]. It covers an area of approximately 1320 km2, including Xiong County, Anxin County, Rongcheng County, and the main water area of Baiyangdian Lake. This article uses a subset of Matiwan Village, sized 3750 × 1580 pixels at 0.5 m spatial resolution, with 256 spectral bands (400–1000 nm). It includes 20 land cover classes, mainly cash crops (Figure 10).
The key characteristics of the two datasets are summarized in Table 1 for a clear comparison.
The Salinas and Xiong’an datasets were selected for their complementary traits: the Salinas dataset, being small-scale with limited samples, tests small-sample learning and performance under scarcity. In contrast, the large-scale Xiong’an dataset assesses generalization ability and computational efficiency. The Salinas dataset, as a benchmark, enables direct comparison with prior works, while Xiong’an, with complex land cover and higher resolution, reflects real-world diversity and tests model adaptability.
To evaluate the performance and robustness of SIFANet under various data constraints, we first conducted a sensitivity analysis by varying the training sample proportions from 1% to 20%. The results, summarized in Table 2, reveal that SIFANet achieves competitive accuracy even with extremely limited training data (e.g., 1% or 5%). As the proportion increases to 20%, the classification performance tends to saturate, indicating that the model has effectively captured the discriminative spectral–spatial features.
The rationale for selecting the 20% ratio as our baseline is twofold. First, high-spectral data typically exhibit strong spatial correlation; using higher ratios (e.g., 30% or 40%) may lead to overfitting on local spatial patterns rather than learning generalized features [53]. Second, the 20% ratio strikes a balance between ensuring deep learning model convergence and simulating real-world scenarios where labeled samples are scarce and expensive to obtain. Based on this analysis, the detailed partitioning of training and testing samples for each dataset is provided in Table 3.

3.2. Experimental Design

This article employs a controlled variable experimental design, varying only the model type to reduce systematic errors and to enhance comparability. All experiments were implemented using the PyTorch (version 2.6.0) framework. For clarity, the specific structural hyperparameters of SIFANet—including kernel sizes, embedding dimensions, and fusion settings—are listed in Table 4. Specifically, the training process consisted of 200 epochs with a batch size of 256 and an initial learning rate of 0.0001. All experimental results reported are an average of 50 independent experiments.
The EIS proposed in this article is used to assess and compare the classification performance of SIFANet with that of SVM, 3D-CNN, and LSTM. Here, SVM is a typical traditional machine learning model, while 3D-CNN and LSTM are deep learning models. Specifically, 3D-CNN can capture local spatial features through convolution. LSTM excels at processing sequence data. By comparing these diverse benchmarks, this article validates the effectiveness of SIFANet for HSI classification tasks.

3.3. Key Parameter Setting Experiments

3.3.1. Image Block Size

Due to constraints in image width and computing resources, the original HSI is partitioned into multiple blocks of identical size, each containing an odd number of pixels. Since block size significantly affects model performance [54], the optimal size was determined experimentally. Holding other parameters constant, this study tested block sizes from 7 × 7 to 25 × 25 using a 3D-CNN on both the Salinas and Xiong’an datasets. OA was used for evaluation, with results averaging over 10 runs, as shown on Figure 11.
Block size nonlinearly influences model performance. Both datasets show a consistent trend: OA first rises with increasing block size, peaks, and then declines. Specifically, smaller blocks (e.g., 7 × 7, 9 × 9) lead to insufficient spatial feature extraction due to limited receptive fields, yielding lower OA. At 15 × 15, OA reaches 98.99% (SL) and 99.72% (XA)—gains of 6.1% and 6.35% over 7 × 7. Although XA peaks at 99.79% with 17 × 17, the 0.07% difference from 15 × 15 is statistically insignificant. Beyond 17 × 17, performance drops due to boundary effects and local feature dilution, likely caused by increased background noise, reduced focus on discriminative features, and edge information loss.
Considering both marginal gains and computational cost, 15 × 15 was selected as optimal. It achieves peak accuracy on the Salinas dataset and near-peak on Xiong’an dataset, with high computational efficiency, and served as the input size for all subsequent experiments.

3.3.2. Optimizer and Loss Function

The loss function quantifies the error between the predictions and ground truth, guiding parameter updates. The optimizer’s role is to iteratively adjust parameters to minimize the loss, thereby enhancing predictive capability.
Common optimizers include stochastic gradient descent (SGD) and adaptive moment estimation (Adam). SGD uses random data subsets for gradient updates, offering higher efficiency but potentially causing oscillations or slow convergence. Adam estimates gradient moments to combine momentum and adaptive learning rates, enabling faster and more stable convergence. For loss functions, squared hinge loss (SHL) strengthens misclassification penalties via a squared term to maximize margins, while categorical cross-entropy (CCE) minimizes the Kullback–Leibler divergence between predicted and true distributions, aligning the model with inherent class distributions.
This article experimentally evaluated combinations of these optimizers and loss functions, with results averaged over 10 runs (Figure 12).
“Adam + CCE” achieved the best OA on both Salinas (99.04%) and Xiong’an (99.42%). For a fixed optimizer, CCE consistently outperformed SHL; for a fixed loss, Adam surpassed SGD. Thus, “Adam + CCE” was selected for subsequent experiments.

3.4. Experiments Results

Based on the experimental setup discussed above, this article compares the classification performance of SVM, 3D-CNN, LSTM, and SIFANet on the Salinas and the Xiong’an hyperspectral datasets. The results are shown in Table 5 and Table 6. These tables present the classification accuracy in the format of “accuracy ± difference from SIFANet”. Different colors are used to distinguish comparison results: orange indicates accuracy lower than SIFANet, blue indicates accuracy better than SIFANet, and black indicates parity. Red highlights easily confused categories. Box plots of classification accuracy are also provided for enhanced visualization (Figure 13).
SIFANet demonstrates superior performance with the OA of 99.89% (SL)/99.79% (XA), outperforming the second-best 3D-CNN model by 0.84% and 0.37%, respectively. The Kappa coefficient of the proposed SIFANet reaches 99.88% (SL)/99.76% (XA), significantly exceeding other models. The standard deviation remains low at 0.33% (SL)/0.58% (XA), indicated strong adaptability and robustness across land cover categories, whereas other models exhibit substantial fluctuations.
Figure 13 shows pronounced performance differences among the models. Ranked by median accuracy, the order is consistently SIFANet > 3D-CNN > LSTM > SVM. SVM exhibits high inter-category variability, with interquartile ranges of 40% (SL) and 100% (XA), and numerous outliers. LSTM outperforms SVM but still shows limitations (e.g., 45% interquartile range on XA). The 3D-CNN maintains relatively compact quartile ranges and high accuracy but has minor outliers. In contrast, SIFANet shows minimal variance, with narrow quartile ranges and peaked distributions, indicating exceptional stability.
Across all land cover categories, SIFANet consistently maintains over 97% accuracy. The SVM model, constrained by linear classification mechanisms, drops to 0% in several categories with 25.18% (SL)/37.09% (XA) standard deviations. Additionally, both LSTM and 3D-CNN fail to fully capture the spatial–spectral information, causing significant decline in specific categories’ accuracy such as classes 8, 14, and 15 (SL) and 11, 16, and 17 (XA), where performance substantially lags behind the proposed SIFANet. Focusing on poorly classified categories, as for class 8 (Grapes_untrained) in Salinas, SVM’s accuracy is merely 64.16%, LSTM reaches only 80.80%, while SIFANet achieves 99.99%. Similarly for class 17 of Xiong’an hyperspectral dataset (Sparse Forest), LSTM accuracy drops to 25.00% and 3D-CNN to 93.33%, whereas SIFANet maintains 99.32% accuracy.
In summary, SIFANet not only sustains high accuracy across all categories but also exhibits minimal fluctuations and stable performance, conclusively demonstrating exceptional hyperspectral feature extraction and discrimination capabilities.

4. Discussion

4.1. Classification Performance and Accuracy Consistency Verification

Some models, although performing well in quantitative metrics such as OA, still have significant discrepancies between these metrics and the actual classification results. This article analyses the classification result maps of all models and selected typical regions with multi-category boundaries for locally enlarged comparisons, which are shown in Figure 14 and Figure 15.
SVM and LSTM, lacking spatial contextual modeling capabilities, display diffusely distributed misclassification. Severe boundary penetration occurs between spectrally similar categories like “Grapes_untrained” and “Vinyard_untrained”, accompanied by noticeable intra-class noise and boundary blurring.
Although 3D-CNN achieves high accuracy (OA > 99.05%), salt-and-pepper noise appears in the detailed views of both datasets, manifesting as single-pixel-scale categorical mutations. This primarily stems from excessive smoothing of local details in deep networks—an irreversible spatial resolution loss during hierarchical abstraction.
In contrast, SIFANet maintains continuous, clear category boundaries and smooth homogeneous regions in the detailed views. The results validate the effectiveness of the model’s structure design. The spatial–spectral dual-branch could extract complementary features, where residual blocks in the SFE branch effectively alleviate vanishing gradients in deep networks. Concurrently, the Conv-Former module enhances spectral sequence modeling capability. Ultimately, the CMAF module achieves the dynamic adaptive fusion of dual-branch features, minimizing information loss during feature transmission.

4.2. Confusing Classes Classification Performance Comparison

Based on the experimental results in Table 5 and Table 6, this article identifies land cover categories where multiple models perform poorly (i.e., confusing categories) and assesses them using the PR–F comprehensive evaluation model. The experimental results are shown in Figure 16 and Figure 17.
The PRCs reveal significant fluctuations in SVM’s accuracy at varying recall thresholds, characterized by pronounced sawtooth patterns. On the large-scale Xiong’an hyperspectral dataset, LSTM exhibits similar abrupt curve variations. Conversely, 3D-CNN and SIFANet display nearly overlapping trajectories on the Salinas dataset. However, as data complexity increases in Xiong’an, SIFANet maintains superior accuracy through its spatial–spectral fusion mechanism, while 3D-CNN shows significant deviation and accuracy degradation—confirming SIFANet’s efficacy in high-difficulty tasks.
F1-Score and AP values validate PRC observations. SIFANet consistently achieves optimal performance in imbalanced datasets, demonstrating robust results: F1-Score > 0.99 and saturated AP values across all Salinas categories, while maintaining > 0.92 F1-Score and >0.95 AP in Xiong’an dataset. Though 3D-CNN performs second best in most scenes, its F1-Score and AP decrease markedly (20.3% and 14.5%, respectively) when processing challenging features like “Sparse Forest” compared to other confusing categories. SVM and LSTM perform notably worse in the Xiong’an dataset with the F1-Score and AP dropping below 0.5 and even reaching zero for categories like “Black locust” and “Sparse Forest”.
It is worth noting that the Salinas dataset suffers from class imbalance, where several land cover categories contain only a limited number of training samples. Such imbalance often leads to classification bias toward majority classes, particularly for models that rely on single-branch feature representations. The proposed CMAF module helps mitigate this issue by performing dynamic cross-module feature fusion between the spatial-oriented SFE branch and the spectral sequence modeling SST branch. By jointly learning channel-wise attention weights from heterogeneous features, CMAF adaptively enhances discriminative cues that are critical for minority classes, which are often characterized by subtle spectral differences or limited spatial support. This dynamic fusion mechanism contributes to more stable precision–recall behavior and the consistently high F1-Score and AP values across confusing and low-sample categories in the Salinas dataset, effectively reducing the model’s bias toward majority classes.

4.3. Evaluation of Model Calculation Efficiency

Pursuing higher accuracy and stability under the same hardware conditions has always been an inevitable trend in model development. Table 7 summarizes the number of parameters and FLOPs of each model. Inputting the model efficiency parameter data from Table 7 into the CAPE value calculation formula yields the results shown in Figure 18.
SIFANet’s CAPE values of 49.93 (SL) and 27.91 (XA) are significantly higher than those of 3D-CNN (SL: 18.06/XA: 9.29), LSTM (SL: 2.91/XA: 0.91), and SVM (SL: 1.47/XA: 0.75). The results indicate that SIFANet excels in comprehensive performance across multiple dimensions, validating its adaptability advantage in complex hyperspectral data scenarios. Furthermore, the CAPE index provides a pragmatic benchmark for model selection in real-world deployment. In resource-constrained environments such as drones or edge devices, researchers can leverage CAPE to quantify the “accuracy gain per unit of computational cost”. This facilitates the identification of models that strike an optimal balance between inference speed and diagnostic precision.

4.4. Ablation Experiment

It is necessary to conduct ablation experiments to validate the effectiveness of the three core modules—SFE, SST, and CMAF. The results are shown in Table 8. Different colors in the table are used to distinguish accuracy comparison results: orange-marked values indicate relative accuracy lower than the SIFANet benchmark. “√” indicates that this structure is enabled in the experiment, while “×” indicates that it is disabled. Experiment 1 served as the baseline control group, incorporating all SFE, SST, and CMAF modules. The experimental design followed a progressive decoupling approach to analyze module interactions.
Validation of SFE/SST independent effectiveness used Experiment 1 (SFE + SST + CMAF) as the baseline. Performance comparisons with Experiment 2 (SST + CMAF only) and Experiment 3 (SFE + CMAF only) confirmed that adding either SFE or SST significantly enhanced classification accuracy. Notably, the proposed SFE module outperformed the conventional 3D-CNN due to its multi-scale convolution kernels and residual connections that effectively mitigate deep feature degradation.
Comparing Experiments 2/3 (“single module + CMAF”) against Experiments 5/6 (“single module only”) shows higher OA and Kappa coefficients for the former, demonstrating CMAF’s ability to extract valuable information from SFE/SST branches while reducing interference from ineffective features. Contrasting Experiment 1 (CMAF) with Experiment 4 (simply Add fusion) confirms that it could achieve complementary spatial–spectral information, improving OA by 4.77% (SL)/4.88% (XA) and Kappa by 5.29% (SL)/5.72% (XA). This superiority stems from CMAF’s attention-driven gating mechanism. Unlike simple element-wise addition, which treats all features with equal importance and may lead to “feature dilution” or noise propagation, the attention-based strategy dynamically re-weights the feature maps. It prioritizes discriminative spatial–spectral components while effectively suppressing redundant or non-informative features from individual branches. This adaptive fusion ensures a more precise alignment and integration of heterogeneous information.
In summary, each SIFANet module demonstrates scientifically sound design with significant synergistic effects and collective enhancing classification accuracy.

5. Conclusions

This article proposes a dual-branch collaborative network for HSI classification. The SIFANet is based on spectral integration and focused attention mechanisms. Specifically, the SFE and SST modules extract spatial and spectral features parallelly and efficiently. Subsequently, the proposed CMAF module dynamically fuses the feature information from both branches to enhance image detail representation and class discriminability. The proposed network significantly reduces feature loss and effectively relieves the insufficient feature extraction capability, which could learn the correlation of ground objects from multi-dimensional image feature spaces. As a result, SIFANet significantly outperforms the comparative models in accuracy assessment metrics such as CA, S.D., and CAPE.
However, the dual-branch architecture and attention mechanisms increase the computational complexity, which may limit its efficiency in real-time applications. Future work will focus on developing lightweight modules to reduce computational costs and exploring self-supervised learning to improve the model’s robustness in scenarios with limited labeled samples.

Author Contributions

Conceptualization, Y.G. and L.X.; methodology, Y.G. and L.X.; validation, Y.G.; writing—original draft preparation, Y.G.; writing—review and editing, L.X., D.M., Y.W., and M.H.; visualization, Y.G.; funding acquisition, L.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was jointly supported by the Beijing Natural Science Foundation under grant L251045, the Deep Earth Probe and Mineral Resources Exploration—National Science and Technology Major Project under grant 2024ZD1002100, and the Fundamental Research Funds in China University of Geoscience (Beijing) under grant 590125018.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

References

  1. Rast, M.; Painter, T.H. Earth Observation Imaging Spectroscopy for Terrestrial Systems: An Overview of Its History, Techniques, and Applications of Its Missions. Surv. Geophys. 2019, 40, 303–331. [Google Scholar] [CrossRef]
  2. Liu, T.; Gu, Y.; Jia, X. Class-guided coupled dictionary learning for multispectral-hyperspectral remote sensing image collaborative classification. Sci. China Technol. Sci. 2022, 65, 744–758. [Google Scholar] [CrossRef]
  3. Yang, Y.; Zhou, Y.; Chen, J.-N. Hyperspectral image classification based on multi-scale hybrid convolutional network. Chin. J. Liq. Cryst. Disp. 2023, 38, 368–377. [Google Scholar] [CrossRef]
  4. Sun, Y.; Liu, B.; Wang, R.; Zhang, P.; Dai, M. Spectral–Spatial MLP-like Network with Reciprocal Points Learning for Open-Set Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5513218. [Google Scholar] [CrossRef]
  5. Zhang, B.; Chen, Y.; Li, Z.; Xiong, S.; Lu, X. SANet: A Self-Attention Network for Agricultural Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5501315. [Google Scholar] [CrossRef]
  6. Zhang, H.; Feng, S.; Wu, D.; Zhao, C.; Liu, X.; Zhou, Y.; Wang, S.; Deng, H.; Zheng, S. Hyperspectral Image Classification on Large-Scale Agricultural Crops: The Heilongjiang Benchmark Dataset, Validation Procedure, and Baseline Results. Remote Sens. 2024, 16, 478. [Google Scholar] [CrossRef]
  7. Yang, B.; Wang, S.; Li, S.; Zhou, B.; Zhao, F.; Ali, F.; He, H. Research and application of UAV-based hyperspectral remote sensing for smart city construction. Cogn. Robot. 2022, 2, 255–266. [Google Scholar] [CrossRef]
  8. Duan, P.; Xie, Z.; Kang, X.; Li, S. Self-supervised learning-based oil spill detection of hyperspectral images. Sci. China Technol. Sci. 2022, 65, 793–801. [Google Scholar] [CrossRef]
  9. Leung, J.-H.; Tsao, Y.-M.; Karmakar, R.; Mukundan, A.; Lu, S.-C.; Huang, S.-Y.; Saenprasarn, P.; Lo, C.-H.; Wang, H.-C. Water pollution classification and detection by hyperspectral imaging. Opt. Express 2024, 32, 23956–23965. [Google Scholar] [CrossRef]
  10. Xu, L.; Chen, Y.; Feng, A.; Shi, X.; Feng, Y.; Yang, Y.; Wang, Y.; Wu, Z.; Zou, Z.; Ma, W.; et al. Study on detection method of microplastics in farmland soil based on hyperspectral imaging technology. Environ. Res. 2023, 232, 116389. [Google Scholar] [CrossRef]
  11. Arya, R.K.; Peddi, R.; Srivastava, R. Hyperspectral image classification using hybrid convolutional-based cross-patch retentive network. Comput. Vis. Image Underst. 2025, 257, 104382. [Google Scholar] [CrossRef]
  12. Hong, D.; He, W.; Yokoya, N.; Yao, J.; Gao, L.; Zhang, L.; Chanussot, J.; Zhu, X. Interpretable Hyperspectral Artificial Intelligence: When nonconvex modeling meets hyperspectral remote sensing. IEEE Geosci. Remote Sens. Mag. 2021, 9, 52–87. [Google Scholar] [CrossRef]
  13. Zheng, X.; Jia, J.; Dong, S.; Wang, Y.; Lu, R.; Chen, Y.; Wang, Y. Training and inference Time Efficiency Assessment Framework for machine learning algorithms: A case study for hyperspectral image classification. Int. J. Appl. Earth Obs. Geoinf. 2025, 141, 104591. [Google Scholar] [CrossRef]
  14. Zhang, Q.; Yuan, Q.; Zeng, C.; Li, X.; Wei, Y. Missing Data Reconstruction in Remote Sensing Image with a Unified Spatial–Temporal–Spectral Deep Convolutional Neural Network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4274–4288. [Google Scholar] [CrossRef]
  15. Li, J.; Zheng, K.; Gao, L.; Han, Z.; Li, Z.; Chanussot, J. Enhanced Deep Image Prior for Unsupervised Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5504218. [Google Scholar] [CrossRef]
  16. Bi, K.; Li, Z.; Chen, Y.; Du, Q.; Ma, L.; Wang, Y.; Fang, Z.; Qi, M. Open-Set Domain Adaptation for Hyperspectral Image Classification Based on Weighted Generative Adversarial Networks and Dynamic Thresholding. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5507717. [Google Scholar] [CrossRef]
  17. Champa, A.I.; Rabbi, M.F.; Hasan, S.M.M.; Zaman, A.; Kabir, M.H. Tree-Based Classifier for Hyperspectral Image Classification via Hybrid Technique of Feature Reduction. In Proceedings of the 2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD), Dhaka, Bangladesh, 27–28 February 2021; pp. 115–119. [Google Scholar]
  18. Yu, H.; Yang, T.; Zhou, L.; Wang, Y. PDNet: A Lightweight Deep Convolutional Neural Network for InSAR Phase Denoising. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5239309. [Google Scholar] [CrossRef]
  19. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  20. Li, T.; Zhang, J.; Zhang, Y. Classification of hyperspectral image based on deep belief networks. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 5132–5136. [Google Scholar]
  21. Zhao, W.; Li, S.; Li, A.; Zhang, B.; Chen, J. Deep fusion of hyperspectral images and multi-source remote sensing data for classification with convolutional neural network. Natl. Remote Sens. Bull. 2021, 25, 1489–1502. [Google Scholar] [CrossRef]
  22. Li, H.C.; Li, S.S.; Hu, W.S.; Feng, J.H.; Sun, W.W.; Du, Q. Recurrent Feedback Convolutional Neural Network for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5504405. [Google Scholar] [CrossRef]
  23. Yin, J.; Qi, C.; Chen, Q.; Qu, J. Spatial-Spectral Network for Hyperspectral Image Classification: A 3-D CNN and Bi-LSTM Framework. Remote Sens. 2021, 13, 2353. [Google Scholar] [CrossRef]
  24. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
  25. Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5518615. [Google Scholar] [CrossRef]
  26. Wang, Y.; Zhang, Y.; Duan, K. Hyperspectral Image Classification Using 3D Attention Mechanism in Collaboration with Transformer. In Proceedings of the 2023 6th International Conference on Artificial Intelligence and Pattern Recognition, Xiamen, China, 22–24 September 2023; pp. 165–172. [Google Scholar]
  27. Ding, K.; Lu, T.; Fu, W.; Li, S.; Ma, F. Global–Local Transformer Network for HSI and LiDAR Data Joint Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5541213. [Google Scholar] [CrossRef]
  28. Wang, Y.; Liu, M.; Yang, Y.; Li, Z.; Du, Q.; Chen, Y.; Li, F.; Yang, H. Heterogeneous Few-Shot Learning for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5510405. [Google Scholar] [CrossRef]
  29. Xi, B.; Zhang, Y.; Li, J.; Zheng, T.; Zhao, X.; Xu, H.; Xue, C.; Li, Y.; Chanussot, J. MCTGCL: Mixed CNN–Transformer for Mars Hyperspectral Image Classification with Graph Contrastive Learning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5503214. [Google Scholar] [CrossRef]
  30. Xi, B.; Zhang, Y.; Li, J.; Huang, Y.; Li, Y.; Li, Z.; Chanussot, J. Transductive Few-Shot Learning with Enhanced Spectral-Spatial Embedding for Hyperspectral Image Classification. IEEE Trans. Image Process. 2025, 34, 854–868. [Google Scholar] [CrossRef]
  31. Wei, Z.; Taoyang, M.; Dan, L. Classification of hyperspectral images based on two-channel convolutional neural network combined with support vector machine algorithm. J. Appl. Remote Sens. 2020, 14, 024514. [Google Scholar] [CrossRef]
  32. Li, S.; Sun, L.; Tian, Y.; Lu, X.; Fu, Z.; Lv, G.; Zhang, L.; Xu, Y.; Che, W. Research on non-destructive identification technology of rice varieties based on HSI and GBDT. Infrared Phys. Technol. 2024, 142, 105511. [Google Scholar] [CrossRef]
  33. Li, R.; Zheng, S.; Duan, C.; Yang, Y.; Wang, X. Classification of Hyperspectral Image Based on Double-Branch Dual-Attention Mechanism Network. Remote Sens. 2020, 12, 582. [Google Scholar] [CrossRef]
  34. Qiao, X.; Roy, S.K.; Huang, W. Multiscale Neighborhood Attention Transformer with Optimized Spatial Pattern for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5523815. [Google Scholar] [CrossRef]
  35. Qiao, X.; Huang, W. A Dual Frequency Transformer Network for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 10344–10358. [Google Scholar] [CrossRef]
  36. Alkhatib, M.Q.; Al-Saad, M.; Aburaed, N.; Zitouni, M.S.; Ahmad, H.A. Attention Based Dual-Branch Complex Feature Fusion Network for Hyperspectral Image Classification. In Proceedings of the 2023 13th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), Athens, Greece, 31 October–2 November 2023; pp. 1–5. [Google Scholar]
  37. Liang, H.; Bao, W.; Shen, X.; Zhang, X. HSI-Mixer: Hyperspectral Image Classification Using the Spectral–Spatial Mixer Representation from Convolutions. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6013005. [Google Scholar] [CrossRef]
  38. Kong, Y.; Wang, X.; Cheng, Y. Spectral–Spatial Feature Extraction for HSI Classification Based on Supervised Hypergraph and Sample Expanded CNN. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 4128–4140. [Google Scholar] [CrossRef]
  39. Ahmad, M.; Butt, M.H.F.; Khan, A.M.; Mazzara, M.; Distefano, S.; Usama, M.; Roy, S.K.; Chanussot, J.; Hong, D. Spatial–spectral morphological mamba for hyperspectral image classification. Neurocomputing 2025, 636, 129995. [Google Scholar] [CrossRef]
  40. Chen, P.; He, W.; Qian, F.; Shi, G.; Yan, J. A synergistic CNN-transformer network with pooling attention fusion for hyperspectral image classification. Digit. Signal Process. 2025, 160, 105070. [Google Scholar] [CrossRef]
  41. Gao, H.; Chen, Z.; Xu, F. Adaptive spectral-spatial feature fusion network for hyperspectral image classification using limited training samples. Int. J. Appl. Earth Obs. Geoinf. 2022, 107, 102687. [Google Scholar] [CrossRef]
  42. Guo, T.; Wang, R.; Luo, F.; Gong, X.; Zhang, L.; Gao, X. Dual-View Spectral and Global Spatial Feature Fusion Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5512913. [Google Scholar] [CrossRef]
  43. Liu, X.; Lu, Y.; Wang, X.; Wu, X. Training-Free Multi-style Fusion Through Reference-Based Adaptive Modulation. In Proceedings of the Pattern Recognition and Computer Vision: 8th Asian Conference on Pattern Recognition, ACPR 2025, Proceedings, Part II, Gold Coast, QLD, Australia, 10–13 November 2025; pp. 149–163. [Google Scholar]
  44. Li, M.; Ming, D.; Xu, L.; Dong, D.; Zhang, Y. SFEARNet: A Network Combining Semantic Flow and Edge-Aware Refinement for Highly Efficient Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5402518. [Google Scholar] [CrossRef]
  45. Wu, P.; Pan, Z.; Tang, H.; Hu, Y. Cloudformer: A Cloud-Removal Network Combining Self-Attention Mechanism and Convolution. Remote Sens. 2022, 14, 6132. [Google Scholar]
  46. Qiao, X.; Huang, W. A Hybrid CNN-Transformer Network for Global Ocean Wind Speed Retrieval from GNSS-R Data. In Proceedings of the OCEANS 2024—Halifax, Halifax, NS, Canada, 23–26 September 2024; pp. 1–4. [Google Scholar]
  47. Cui, Y.; Xie, T.; Li, J.; Zhang, X.; Bai, S.; Wang, C.; Liu, H. A Chlorophyll Concentration Inversion Method Based on OWTs and 1D CNN-Transformer Feature Extraction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2026, 19, 2006–2032. [Google Scholar] [CrossRef]
  48. Lin, X.; Yan, Z.; Deng, X.; Zheng, C.; Yu, L. ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2023, Cham, Switzerland, 8–12 October 2023; pp. 642–651. [Google Scholar]
  49. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018, Cham, Switzerland, 8–14 September 2018; pp. 3–19. [Google Scholar]
  50. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  51. Mu, Y.-Z.; Wang, Z.; Chen, X.; Chen, J.-J.; Zhao, J.-K.; Wang, J.-M. Deep Learning Test Optimization Method Using Multi-objective Optimization. Int. J. Softw. Inform. 2022, 33, 2499–2524. [Google Scholar] [CrossRef]
  52. Cen, Y.; Zhang, L.; Zhang, X.; Wang, Y.; Qi, W.; Tang, S.; Zhang, P. Aerial hyperspectral remote sensing classification dataset of Xiongan New Area (Matiwan Village). Natl. Remote Sens. Bull. 2020, 24, 1299–1306. [Google Scholar] [CrossRef]
  53. Chen, Y.; Zhao, X.; Jia, X. Spectral–Spatial Classification of Hyperspectral Data Based on Deep Belief Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2381–2392. [Google Scholar] [CrossRef]
  54. Wang, D.; Du, B.; Zhang, L.; Xu, Y. Adaptive Spectral–Spatial Multiscale Contextual Feature Extraction for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 2461–2477. [Google Scholar] [CrossRef]
Figure 1. Workflow of the SIFANet.
Figure 1. Workflow of the SIFANet.
Remotesensing 18 00398 g001
Figure 2. Spatial feature extractor.
Figure 2. Spatial feature extractor.
Remotesensing 18 00398 g002
Figure 3. Spectral sequence Transformer.
Figure 3. Spectral sequence Transformer.
Remotesensing 18 00398 g003
Figure 4. Improved Conv-Former encoder structure.
Figure 4. Improved Conv-Former encoder structure.
Remotesensing 18 00398 g004
Figure 5. Cross-module attention fusion.
Figure 5. Cross-module attention fusion.
Remotesensing 18 00398 g005
Figure 6. Spatial feature tensor generation.
Figure 6. Spatial feature tensor generation.
Remotesensing 18 00398 g006
Figure 7. Spectral feature tensor generation.
Figure 7. Spectral feature tensor generation.
Remotesensing 18 00398 g007
Figure 8. Evaluation index system.
Figure 8. Evaluation index system.
Remotesensing 18 00398 g008
Figure 9. Salinas dataset ground truth.
Figure 9. Salinas dataset ground truth.
Remotesensing 18 00398 g009
Figure 10. Xiong’an hyperspectral dataset ground truth.
Figure 10. Xiong’an hyperspectral dataset ground truth.
Remotesensing 18 00398 g010
Figure 11. The impact of selected image block size on OA.
Figure 11. The impact of selected image block size on OA.
Remotesensing 18 00398 g011
Figure 12. The impact of optimizers and loss functions on OA.
Figure 12. The impact of optimizers and loss functions on OA.
Remotesensing 18 00398 g012
Figure 13. Box plots of classification accuracy. (a) Salinas dataset. (b) Xiong’an hyperspectral dataset.
Figure 13. Box plots of classification accuracy. (a) Salinas dataset. (b) Xiong’an hyperspectral dataset.
Remotesensing 18 00398 g013
Figure 14. Classification results and details of each model on the Salinas dataset.
Figure 14. Classification results and details of each model on the Salinas dataset.
Remotesensing 18 00398 g014
Figure 15. Classification results and details of each model on the Xiong’an hyperspectral dataset.
Figure 15. Classification results and details of each model on the Xiong’an hyperspectral dataset.
Remotesensing 18 00398 g015
Figure 16. PR–F evaluation results on Salinas dataset. (a) Class 8 (Grapes_untrained) performance. (b) Class 14 (Lettuce_ romaine_7wk) performance. (c) Class 15 (Vinyard_untrained) performance.
Figure 16. PR–F evaluation results on Salinas dataset. (a) Class 8 (Grapes_untrained) performance. (b) Class 14 (Lettuce_ romaine_7wk) performance. (c) Class 15 (Vinyard_untrained) performance.
Remotesensing 18 00398 g016
Figure 17. PR–F evaluation results on Xiong’an hyperspectral dataset. (a) Class 11 (Black locust) performance. (b) Class 16 (Vegetable field) performance. (c) Class 17 (Sparse Forest) performance.
Figure 17. PR–F evaluation results on Xiong’an hyperspectral dataset. (a) Class 11 (Black locust) performance. (b) Class 16 (Vegetable field) performance. (c) Class 17 (Sparse Forest) performance.
Remotesensing 18 00398 g017
Figure 18. Comparison of CAPE values of each model.
Figure 18. Comparison of CAPE values of each model.
Remotesensing 18 00398 g018
Table 1. Key characteristics of the two datasets.
Table 1. Key characteristics of the two datasets.
DatasetImage Size (pixel)Spatial Resolution (m)Wavelength Range (nm)BandsClasses
Salinas217 × 5123.7400–100020416
Xiong’an3750 × 15800.5400–100025620
Table 2. OA of SIFANet under different training sample proportions. (Unit: %).
Table 2. OA of SIFANet under different training sample proportions. (Unit: %).
Dataset1%5%10%15%20%
Salinas91.0297.4598.9299.3499.89
Xiong’an85.3094.1897.0598.4299.79
Table 3. Sample sizes for training and testing in the dataset.
Table 3. Sample sizes for training and testing in the dataset.
Salinas DatasetXiong’an Hyperspectral Dataset
NO.Class NameTrainingTestClass NameTrainingTest
1Broccoli_green_weeds_13991610Compound-leaved maple45,153180,764
2Broccoli_green_weeds_27842942Willow35,829144,124
3Fallow3811595Elm306712,222
4Fallow_rough_plow2841110Rice90,871361,574
5Fallow_smooth5352143Pagoda tree94,755380,562
6Stubble7983161Fraxinus34,056135,125
7Celery7302849Golden rain tree462918,695
8Grapes_untrained22229049Water33,125132,610
9Soil_vinyard_develop12264977Bare soil780030,611
10Corn_senesced_green_weeds6262652Rice stubble38,763155,348
11Lettuce_romaine_4wk229839Black locust10934471
12Lettuce_romaine_5wk3901537Corn11,67547,331
13Lettuce_romaine_6wk174742Pear tree205,019820,983
14Lettuce_romaine_7wk230840Soybean14375711
15Vinyard_untrained14635805Poplar18,29073,019
16Vinyard_vertical_trellis3541453Vegetable field587523,138
17 Sparse forest2971218
18 Grassland84,691337,850
19 Peach tree13,07052,552
20 Building592623,781
Total 10,82543,304 735,4212,941,689
Table 4. Detailed configuration of SIFANet architecture.
Table 4. Detailed configuration of SIFANet architecture.
ModuleParameter/ComponentValue/Configuration
SFE BranchStructureConv + MaxPool + Residual Block
Conv Kernel Size3 × 3
Max PoolingWindow 2 × 2, Stride 2
Residual Block2 Layers (3 × 3 Conv + BN + ReLU)
SST BranchStructureConv-Former
Encoder Layers3
Attention Heads4
Feed-Forward NetworkCFFN (1 × 1 Conv)
Embedding Dimension64
CMAFInteraction MechanismShared MLP
Activation FunctionSigmoid
Hidden Layer Units32
Table 5. Results of Salinas dataset. (Unit: %).
Table 5. Results of Salinas dataset. (Unit: %).
No.SVM3D-CNNLSTMSIFANet
196.56−3.44100.000.00100.000.00100.00
291.74−8.16100.000.1099.50−0.4099.90
388.84−10.9799.07−0.7497.12−2.6999.81
498.18−0.5698.930.1999.040.3098.74
583.31−16.6098.82−1.0999.02−0.8999.91
699.97−0.0399.87−0.13100.000.00100.00
792.92−6.94100.000.1499.930.0799.86
864.16−35.8399.04−0.9580.80−19.1999.99
986.58−13.4299.88−0.1299.13−0.87100.00
1069.74−30.1899.62−0.3096.39−3.5399.92
110.00−99.41100.000.5998.25−1.1699.41
1266.84−32.7098.93−0.6197.25−2.2999.54
1391.30−8.56100.000.1498.30−1.5699.86
1494.24−5.4197.89−1.7696.40−3.2599.65
1558.87−41.1396.35−3.6578.02−21.98100.00
1699.01−0.85100.000.1499.860.0099.86
OA79.18−20.7199.05−0.8492.36−7.5399.89
Kappa76.55−23.3398.95−0.9391.48−8.4099.88
S.D.25.18−24.851.00−0.676.68−6.350.33
Table 6. Results of Xiong’an hyperspectral dataset. (Unit: %).
Table 6. Results of Xiong’an hyperspectral dataset. (Unit: %).
No.SVM3D-CNNLSTMSIFANet
133.20−66.5299.53−0.1978.48−21.2499.72
251.81−48.1099.84−0.0777.48−22.4399.91
30.00−99.9299.61−0.3178.27−21.6599.92
493.80−6.1999.97−0.0298.11−1.8899.99
540.49−59.4099.31−0.5872.22−27.6799.89
652.00−47.9499.72−0.2277.28−22.6699.94
785.99−13.9999.87−0.1188.03−11.9599.98
889.17−10.8299.93−0.0697.06−2.9399.99
975.15−24.84100.000.0198.61−1.3899.99
1086.08−13.92100.000.0098.34−1.66100.00
110.00−99.6297.05−2.5754.95−44.6799.62
121.19−98.3998.46−1.1269.14−30.4499.58
1353.59−46.0799.05−0.6174.35−25.3199.66
140.00−99.2599.370.1273.71−25.5499.25
1558.06−41.5099.03−0.5372.60−26.9699.56
1682.67−14.9394.82−2.7870.46−27.1497.60
170.00−99.3293.33−5.9925.00−74.3299.32
1845.36−54.4599.54−0.2774.37−25.4499.81
190.00−99.7899.46−0.3274.68−25.1099.78
2097.84−1.6099.600.1695.87−3.5799.44
OA59.1140.6899.42−0.3780.19−19.6099.79
Kappa50.80−48.9699.33−0.4376.83−22.9399.76
S.D.37.0936.511.951.3718.5417.960.58
Table 7. Efficiency parameters of each model.
Table 7. Efficiency parameters of each model.
IndexSVM3D-CNNLSTMSIFANet
Parameters16058,99256,080269,873
FLOPs (×103)0.122005.24854.0167979.392
OASL (%)79.1899.0592.3699.89
XA (%)59.1199.4280.1999.79
S.D.SL (%)25.181.006.680.33
XA (%)37.091.9518.540.58
Table 8. Ablation experiment results.
Table 8. Ablation experiment results.
No.SFESSTCMAFSLXA
OAKappaOAKappa
199.8999.8899.7999.76
2×94.33−5.5693.12−6.7690.61−9.1888.23−11.53
3×99.26−0.6399.33−0.5599.54−0.2599.41−0.35
4×95.12−4.7794.59−5.2994.91−4.8894.04−5.72
5××99.15−0.7499.05−0.8399.51−0.2899.41−0.35
6××91.74−8.1590.01−9.8789.84−9.9588.26−11.5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gui, Y.; Xu, L.; Ming, D.; Wei, Y.; Huang, M. Hyperspectral Image Classification Using SIFANet: A Dual-Branch Structure Combining CNN and Transformer. Remote Sens. 2026, 18, 398. https://doi.org/10.3390/rs18030398

AMA Style

Gui Y, Xu L, Ming D, Wei Y, Huang M. Hyperspectral Image Classification Using SIFANet: A Dual-Branch Structure Combining CNN and Transformer. Remote Sensing. 2026; 18(3):398. https://doi.org/10.3390/rs18030398

Chicago/Turabian Style

Gui, Yuannan, Lu Xu, Dongping Ming, Yanfei Wei, and Ming Huang. 2026. "Hyperspectral Image Classification Using SIFANet: A Dual-Branch Structure Combining CNN and Transformer" Remote Sensing 18, no. 3: 398. https://doi.org/10.3390/rs18030398

APA Style

Gui, Y., Xu, L., Ming, D., Wei, Y., & Huang, M. (2026). Hyperspectral Image Classification Using SIFANet: A Dual-Branch Structure Combining CNN and Transformer. Remote Sensing, 18(3), 398. https://doi.org/10.3390/rs18030398

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop