1. Introduction
The hyperspectral image (HSI) is a three-dimensional data cube acquired by remote sensing platforms (such as aircrafts or satellites) using push-broom or staring imaging techniques [
1,
2]. It contains two-dimensional spatial images corresponding to multiple spectral bands. Unlike traditional multispectral images, HSIs collect continuous spectral information from hundreds of narrow bands. These bands span the visible, near-infrared, and shortwave infrared regions (400–2500 nm) of the electromagnetic spectrum. As a result, each pixel in the HSI contains reflectance value across hundreds of spectral channels, forming a continuous spectral curve [
3]. This enables HSIs to provide rich spatial–spectral information [
4]. In recent years, the rapid development of computer technology has greatly advanced HSI classification tasks. HSIs have been widely applied in agricultural production [
5,
6], city planning [
7,
8], environmental science [
9,
10], and other fields.
HSI classification faces challenges from three aspects: data, model training, and application. At the data level, the typical 200–300 spectral bands of HSIs lead to high dimensionality, raising computational complexity [
11]. Moreover, the limited availability of ground truth training samples results in data sparsity in the high-dimensional feature space, violating statistical learning assumptions. This leads to the Hughes phenomenon. Studies have shown that when the ratio of feature dimensionality to sample size exceeds a critical threshold, the generalization performance of classifiers such as Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) will sharply decline due to overfitting. In addition, strong inter-band correlations will introduce substantial redundant information to models, which hinders effective learning and training [
12]. At the model training level, large training sample sizes demand substantial computational resources, significantly increasing training costs [
13]. In practical applications, HSI preprocessing involves complex procedures such as radiometric and atmospheric correction, geometric registration, and manual sample labeling [
14]. Additionally, to further improve the quality of hyperspectral data, research on enhanced deep image prior has been applied to unsupervised hyperspectral image super-resolution [
15], which effectively enhances the spatial resolution and reduces noise without the need for large-scale training datasets. Furthermore, cross-scene model transfer issues caused by various imaging angles and sensor inconsistencies present significant challenges [
16]. These factors will greatly affect the HSI classification performance in practical applications. In response, researchers have proposed and refined various classification algorithms to address these challenges, including traditional machine learning methods, deep learning approaches, and multi-model fusion methods.
Traditional machine learning methods depend on manual feature engineering. The SVM uses kernel functions to map features into high-dimensional space and performs well especially in small-sample scenarios. Random forest (RF), using bagging and random subspaces, effectively handles high-dimensional nonlinear data. The K-Nearest Neighbor (K-NN) algorithm classifies samples by distance-based majority voting. Despite the K-NN being simple, it faces limitations in high-dimensional spaces due to the “curse of dimensionality”, which makes it difficult to achieve satisfactory classification results. Additionally, K-NN is sensitive to parameters and has lower computational efficiency. Champa et al. [
17] proposed a hybrid technique which combined tree-based classifiers with feature dimensionality reduction. The results show that decision trees (DTs), extra trees (ETs), and RF perform well in HSI classification.
Compared to traditional methods, deep learning approaches offer better adaptability for HSI classification tasks. Convolutional neural networks (CNNs) automatically extract spatial–spectral features, mimicking biological visual mechanisms [
18]. Residual neural networks (ResNet) incorporate residual structures for cross-layer information transfer. This structure effectively mitigates the vanishing gradient problem [
19]. Li et al. [
20] introduced the deep belief network (DBN) for feature extraction and image classification. Zhao et al. [
21] proposed a feature-level fusion classification framework combining CNN with texture features, utilizing HSI and LiDAR data for high classification accuracy. As for sequential spectral data, the recurrent neural network (RNN) captures spectral sequence correlations via recursive feedback mechanisms [
22]. Long short-term memory (LSTM) networks use gating units to alleviate long-term dependency issues [
23]. Recently, the Transformer has gained popularity in natural language processing (NLP), which improves semantic extraction through global dependency modeling [
24]. Inspired by this, researchers have integrated attention mechanisms into HSI classification to enhance accuracy by strengthening feature interactions. Hong et al. [
25] proposed the SpectralFormer network, which uses self-attention to model long-range spectral dependencies. However, its unimodal (spectral-only) structure neglects critical spatial information of HSIs, limiting effectiveness. To address this issue, Wang et al. [
26] introduced a 3D attention mechanism to capture joint spectral–spatial features. Ding et al. [
27] developed a global–local Transformer network (GLT-Net) for joint classification using multi-scale feature fusion. However, inherent architectural constraints limit the performance of single-structure models. Specifically, CNNs rely on local receptive fields, which make them effective at capturing local textures but less capable of modeling long-range spatial dependencies or the continuous sequential correlation of spectral bands. Conversely, while Transformers excel at global dependency modeling via self-attention, they lack the spatial inductive bias (such as translation invariance and locality) inherent in CNNs. This often leads to difficulties in capturing fine-grained local spatial structures and results in quadratic computational complexity when dealing with high-dimensional spectral sequences. Consequently, deep learning models still face two key limitations: they require large amounts of annotated data, which is costly and time-consuming [
28], and they involve high computational complexity, hindering practical deployment [
29]. To alleviate this, transudative few-shot learning with enhanced spectral–spatial embedding [
30] has been proposed to achieve robust classification with minimal labeled samples.
In recent years, multi-model fusion methods have shown great promise in pattern recognition and image classification by leveraging the complementary strengths of different structures. Researchers optimize feature extraction and processing by combining various models. For example, Wei et al. [
31] proposed a strategy that uses CNNs for feature extraction, followed by SVM classification. This approach not only improves accuracy but also reduces overfitting and alleviates reliance on large sample datasets. Combining multi-scale feature fusion with gradient boosting decision trees (GBDTs) has also been proven effective. This ensemble approach integrates multiple weak learners to enhance classification stability and accuracy, especially in complex land cover scenarios [
32]. To better exploit the complementary nature of spatial and spectral information, researchers have increasingly turned to dual-branch architectures. For example, the Double-Branch Dual-Attention (DBDA) network [
33] utilizes two independent branches to capture spatial and spectral features, respectively, enhanced by dual-attention mechanisms. Similarly, the Multiscale Neighborhood Attention Transformer (MSNAT) [
34] further refines this by extracting multi-scale features to adapt to varying land cover sizes. To address feature extraction from a frequency perspective, the Dual Frequency Transformer Network (DFTN) [
35] was developed. It utilizes a dual-branch frequency domain feature extraction block to simultaneously capture high-frequency local details and low-frequency global variations, demonstrating the effectiveness of frequency domain information in enhancing HSI classification. Alkhatib et al. [
36] proposed an attention-based dual-branch network that fuses features from a real-valued neural network (RVNN) and a complex-valued neural network (CVNN). By using the Fourier transformation to extract frequency information, this model enhances HSI classification performance. While these models have significantly improved classification accuracy, they often face challenges such as high parameter redundancy and suboptimal fusion of heterogeneous features from the two branches. Many researchers have focused on efficient spatial–spectral feature extraction. Yang et al. [
3] introduced the multi-scale hybrid CNN–attention (MS-Hybrid-A) network, which uses 3D convolutions for spectral–spatial features extraction. Additionally, it supplementary extracts spatial details with 2D convolutions and incorporates a convolutional block attention module (CBAM) to improve classification performance. Liang et al. [
37], inspired by the Transformer, proposed HSI-Mixer, which utilizes a hybrid measurement-based linear projection (HMLP) module for deep spectral–spatial feature fusion. Kong et al. [
38] developed a co-feature extraction framework that integrates graph embedding with deep learning. They constructed a supervised within-class/between-class hypergraph (SWBH) for spectral feature learning and introduced a random zero-masking strategy to generate augmented labeled samples. It could facilitate CNN-based spatial feature extraction and mitigate overfitting in small-sample settings. Ahmad et al. [
39] proposed the spatial morphological Mamba (SMM) and spatial–spectral morphological Mamba (SSMM) networks, which employ depth-wise separable convolutions to implement morphological operations, like erosion and dilation. By leveraging State Space Models (SSMs), these Mamba-based approaches achieve linear computational complexity while effectively modeling long-range dependencies, offering a promising, efficient alternative to standard Transformers for processing high-dimensional spectral sequences. To address the challenge of cross-layer information loss, Chen et al. [
40] designed a hybrid pooling attention (HPA) module and a cross-layer feature fusion (CFF) module to preserve crucial information during the propagation process. Gao et al. [
41] introduced a plug-and-play adaptive feature fusion (AFF) module that processes multi-layer networks to better utilize spatial and spectral features. Guo et al. [
42] proposed an adaptive score-weighting method to fuse features from spatial and spectral branches. Similarly, the concept of reference-based adaptive modulation has been explored in multi-style fusion tasks [
43], providing insights into the dynamic adjustment of features across different domains.
In conclusion, traditional machine learning methods such as SVM and RF perform well on high-dimensional data when only a few labeled samples are available. However, they have limited capacity to model nonlinear relationships and thus struggle to process large-scale, high-dimensional data with complex noise. Deep learning methods, by contrast, greatly improve feature representation through hierarchical abstraction. Nevertheless, they require large quantities of high-quality labeled samples and bring drawbacks such as long training times and high computational cost. Furthermore, multi-model fusion methods can mitigate overfitting effectively and reduce dependence on labeled samples. However, they pose challenges in designing compatible modules and incur heavy computational overhead.
The existing research still lacks effective methods to dynamically fuse spatial and spectral features for HSI classification. Furthermore, the number of model parameters and computational overhead remain too high. In the broader field of remote sensing, highly efficient architectures like SFEARNet [
44] have successfully combined semantic flow and edge-aware refinement for tasks such as change detection, demonstrating the importance of structural information in efficient network design. Furthermore, the hybrid CNN–Transformer architecture has been widely validated in diverse remote sensing applications, including cloud detection using Landsat/Sentinel data [
45], wind speed sensing using Global Navigation Satellite System Reflectometry (GNSS-R) [
46], and chlorophyll concentration inversion using SeaWiFS data [
47]. These successes in handling complex, multi-modal remote sensing data provide a strong rationale for adopting a hybrid design to capture both local spatial details and global spectral dependencies in HSI classification. Building on the need for efficiency, this article proposes a dual-branch network called the spectral integration and focused attention network (SIFANet). SIFANet is built on a hybrid CNN–Transformer structure and incorporates a channel attention mechanism to emphasize important spectral bands. By optimizing feature extraction and fusion, SIFANet could improve HSI classification performance and enhance classification accuracy to a certain extent.
The main contributions of this article are as follows:
Efficient feature extraction structure
This article designs a dual-branch network composed of a spatial feature extractor (SFE) and a spectral sequence Transformer (SST). The SFE is enhanced with residual blocks (RBs) to alleviate the vanishing gradient problem and accelerate convergence. Simultaneously, the SST incorporates a Conv-Former module to improve spectral feature extraction, enabling the efficient and parallel extraction of spatial–spectral features.
Cross-Module Attention Fusion (CMAF)
This article introduces a channel attention-based CMAF mechanism to dynamically and adaptively fuse features from different branches, which significantly reduces information loss during the feature integration step.
Comprehensive HSI classification accuracy assessment indices
This article develops a novel computation accuracy parameter efficiency (CAPE) index to quantify the computational efficiency of different models. In addition, the proposed evaluation index system (EIS) also includes classification accuracy metrics, confusion category performance, and computational efficiency index, enabling a comprehensive, multidimensional assessment of the model performance.
4. Discussion
4.1. Classification Performance and Accuracy Consistency Verification
Some models, although performing well in quantitative metrics such as OA, still have significant discrepancies between these metrics and the actual classification results. This article analyses the classification result maps of all models and selected typical regions with multi-category boundaries for locally enlarged comparisons, which are shown in
Figure 14 and
Figure 15.
SVM and LSTM, lacking spatial contextual modeling capabilities, display diffusely distributed misclassification. Severe boundary penetration occurs between spectrally similar categories like “Grapes_untrained” and “Vinyard_untrained”, accompanied by noticeable intra-class noise and boundary blurring.
Although 3D-CNN achieves high accuracy (OA > 99.05%), salt-and-pepper noise appears in the detailed views of both datasets, manifesting as single-pixel-scale categorical mutations. This primarily stems from excessive smoothing of local details in deep networks—an irreversible spatial resolution loss during hierarchical abstraction.
In contrast, SIFANet maintains continuous, clear category boundaries and smooth homogeneous regions in the detailed views. The results validate the effectiveness of the model’s structure design. The spatial–spectral dual-branch could extract complementary features, where residual blocks in the SFE branch effectively alleviate vanishing gradients in deep networks. Concurrently, the Conv-Former module enhances spectral sequence modeling capability. Ultimately, the CMAF module achieves the dynamic adaptive fusion of dual-branch features, minimizing information loss during feature transmission.
4.2. Confusing Classes Classification Performance Comparison
Based on the experimental results in
Table 5 and
Table 6, this article identifies land cover categories where multiple models perform poorly (i.e., confusing categories) and assesses them using the PR–F comprehensive evaluation model. The experimental results are shown in
Figure 16 and
Figure 17.
The PRCs reveal significant fluctuations in SVM’s accuracy at varying recall thresholds, characterized by pronounced sawtooth patterns. On the large-scale Xiong’an hyperspectral dataset, LSTM exhibits similar abrupt curve variations. Conversely, 3D-CNN and SIFANet display nearly overlapping trajectories on the Salinas dataset. However, as data complexity increases in Xiong’an, SIFANet maintains superior accuracy through its spatial–spectral fusion mechanism, while 3D-CNN shows significant deviation and accuracy degradation—confirming SIFANet’s efficacy in high-difficulty tasks.
F1-Score and AP values validate PRC observations. SIFANet consistently achieves optimal performance in imbalanced datasets, demonstrating robust results: F1-Score > 0.99 and saturated AP values across all Salinas categories, while maintaining > 0.92 F1-Score and >0.95 AP in Xiong’an dataset. Though 3D-CNN performs second best in most scenes, its F1-Score and AP decrease markedly (20.3% and 14.5%, respectively) when processing challenging features like “Sparse Forest” compared to other confusing categories. SVM and LSTM perform notably worse in the Xiong’an dataset with the F1-Score and AP dropping below 0.5 and even reaching zero for categories like “Black locust” and “Sparse Forest”.
It is worth noting that the Salinas dataset suffers from class imbalance, where several land cover categories contain only a limited number of training samples. Such imbalance often leads to classification bias toward majority classes, particularly for models that rely on single-branch feature representations. The proposed CMAF module helps mitigate this issue by performing dynamic cross-module feature fusion between the spatial-oriented SFE branch and the spectral sequence modeling SST branch. By jointly learning channel-wise attention weights from heterogeneous features, CMAF adaptively enhances discriminative cues that are critical for minority classes, which are often characterized by subtle spectral differences or limited spatial support. This dynamic fusion mechanism contributes to more stable precision–recall behavior and the consistently high F1-Score and AP values across confusing and low-sample categories in the Salinas dataset, effectively reducing the model’s bias toward majority classes.
4.3. Evaluation of Model Calculation Efficiency
Pursuing higher accuracy and stability under the same hardware conditions has always been an inevitable trend in model development.
Table 7 summarizes the number of parameters and FLOPs of each model. Inputting the model efficiency parameter data from
Table 7 into the CAPE value calculation formula yields the results shown in
Figure 18.
SIFANet’s CAPE values of 49.93 (SL) and 27.91 (XA) are significantly higher than those of 3D-CNN (SL: 18.06/XA: 9.29), LSTM (SL: 2.91/XA: 0.91), and SVM (SL: 1.47/XA: 0.75). The results indicate that SIFANet excels in comprehensive performance across multiple dimensions, validating its adaptability advantage in complex hyperspectral data scenarios. Furthermore, the CAPE index provides a pragmatic benchmark for model selection in real-world deployment. In resource-constrained environments such as drones or edge devices, researchers can leverage CAPE to quantify the “accuracy gain per unit of computational cost”. This facilitates the identification of models that strike an optimal balance between inference speed and diagnostic precision.
4.4. Ablation Experiment
It is necessary to conduct ablation experiments to validate the effectiveness of the three core modules—SFE, SST, and CMAF. The results are shown in
Table 8. Different colors in the table are used to distinguish accuracy comparison results: orange-marked values indicate relative accuracy lower than the SIFANet benchmark. “√” indicates that this structure is enabled in the experiment, while “×” indicates that it is disabled. Experiment 1 served as the baseline control group, incorporating all SFE, SST, and CMAF modules. The experimental design followed a progressive decoupling approach to analyze module interactions.
Validation of SFE/SST independent effectiveness used Experiment 1 (SFE + SST + CMAF) as the baseline. Performance comparisons with Experiment 2 (SST + CMAF only) and Experiment 3 (SFE + CMAF only) confirmed that adding either SFE or SST significantly enhanced classification accuracy. Notably, the proposed SFE module outperformed the conventional 3D-CNN due to its multi-scale convolution kernels and residual connections that effectively mitigate deep feature degradation.
Comparing Experiments 2/3 (“single module + CMAF”) against Experiments 5/6 (“single module only”) shows higher OA and Kappa coefficients for the former, demonstrating CMAF’s ability to extract valuable information from SFE/SST branches while reducing interference from ineffective features. Contrasting Experiment 1 (CMAF) with Experiment 4 (simply Add fusion) confirms that it could achieve complementary spatial–spectral information, improving OA by 4.77% (SL)/4.88% (XA) and Kappa by 5.29% (SL)/5.72% (XA). This superiority stems from CMAF’s attention-driven gating mechanism. Unlike simple element-wise addition, which treats all features with equal importance and may lead to “feature dilution” or noise propagation, the attention-based strategy dynamically re-weights the feature maps. It prioritizes discriminative spatial–spectral components while effectively suppressing redundant or non-informative features from individual branches. This adaptive fusion ensures a more precise alignment and integration of heterogeneous information.
In summary, each SIFANet module demonstrates scientifically sound design with significant synergistic effects and collective enhancing classification accuracy.