A Dual-Branch CNN with Depthwise Separable Fusion for Hyperspectral Image Classification

Li, Teng; Cao, Yunhua; Guo, Xing; Zhang, Shikun; Yan, Lining

doi:10.3390/rs18111685

Open AccessArticle

A Dual-Branch CNN with Depthwise Separable Fusion for Hyperspectral Image Classification

by

Teng Li

¹,

Yunhua Cao

^1,*

,

Xing Guo

²

,

Shikun Zhang

¹ and

Lining Yan

¹

School of Physics, Xidian University, Xi’an 710071, China

²

School of Electronic Engineering, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(11), 1685; https://doi.org/10.3390/rs18111685

Submission received: 8 April 2026 / Revised: 15 May 2026 / Accepted: 18 May 2026 / Published: 22 May 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

DSFA-CNN introduces a dual-branch framework to jointly preserve spatial–spectral coupling and learn complementary spectral and spatial features.
CBAM-enhanced feature extraction and depthwise separable fusion improve classification performance while reducing feature redundancy.

What are the implications of the main findings?

The proposed method achieves a favorable balance between classification accuracy, interpretability, and computational efficiency.
The framework provides an effective and well-balanced design for hyperspectral image classification in complex scenes.

Abstract

Hyperspectral image classification remains challenging because robust recognition requires preserving spatial–spectral coupling, extracting complementary spectral and spatial cues, and fusing heterogeneous features without excessive redundancy. To address this issue, a dual-branch convolutional neural network (CNN) with depthwise separable fusion, termed DSFA-CNN, is developed. The network combines a 3D convolution branch for coupled spatial–spectral representation learning with a 1D+2D branch for efficient spectral and spatial modeling. A convolutional block attention module (CBAM) is introduced in the decomposed branch to emphasize informative spectral responses and salient spatial regions, and a depthwise separable fusion module is used to improve cross-branch integration while limiting fusion-stage redundancy and the risk of overfitting. Experiments on Indian Pines, University of Pavia, Salinas, and Houston2013 yield overall accuracies of 95.62 ± 0.13%, 99.25 ± 0.13%, 99.89 ± 0.11%, and 97.62 ± 0.23%, respectively. The gains are most evident on the more challenging Indian Pines and Houston2013 scenes. Ablation results show that the dual-branch design provides complementary information, whereas CBAM and the fusion module further improve representation selectivity and feature integration. Computational cost analysis further indicates that DSFA-CNN achieves a more favorable trade-off between classification accuracy and computational efficiency than several recent competitive baselines. These results demonstrate the effectiveness of parallel coupled–decomposed modeling with efficient feature fusion for robust hyperspectral image classification.

Keywords:

hyperspectral image classification; spectral–spatial feature learning; dual-branch convolutional neural network; depthwise separable fusion; convolutional block attention module; land-cover classification

1. Introduction

Hyperspectral imaging (HSI) has become an important modality for Earth observation because it records rich and continuous spectral responses of ground objects across hundreds of narrow bands [1,2,3]. Unlike conventional multispectral imagery, HSI provides a much finer description of material composition and subtle class differences, and has therefore been widely used in precision agriculture, geological and mineral exploration, urban remote sensing, mining inspection, and medical diagnosis [4,5,6,7,8,9,10]. Yet accurate HSI classification remains difficult. The high dimensionality of hyperspectral data, the Hughes phenomenon, strong inter-band redundancy, mixed pixels, high spectral similarity among classes, and the scarcity of labeled samples jointly make robust recognition challenging [11,12,13,14]. As a result, how to exploit spectral and spatial information more effectively without sacrificing robustness has remained a central problem in HSI classification [15,16,17].

Early HSI classification methods mainly relied on traditional machine learning models such as support vector machines, k-nearest neighbors, and random forests [18,19,20]. These methods can be useful when labeled data are limited, but they usually depend on hand-crafted features and often struggle to represent complex spectral variation together with local spatial structure. Deep learning, especially convolutional neural networks (CNNs), has substantially advanced the field by enabling data-driven feature extraction [21,22,23]. Existing CNN-based approaches generally fall into three categories. A spectral-only convolutional model is efficient and well suited to modeling spectral dependencies, but it makes limited use of spatial context [22]. A 2D-CNN is effective for texture and neighborhood modeling, yet it does not naturally preserve continuous spectral correlations. A 3D-CNN can directly encode spatial–spectral coupling, but this advantage often comes with higher computational cost and a greater risk of overfitting in small-sample settings [23,24,25]. These trade-offs suggest that no single convolutional path is fully satisfactory for complex HSI scenes.

To address the limitations of single-path CNNs, hybrid and multi-branch architectures have become an important direction in HSI classification. Hybrid 3D–2D CNNs, such as HybridSN-like models, combine 3D convolutions for local spatial–spectral interaction modeling with 2D convolutions for more efficient spatial abstraction. Multi-branch and dual-branch CNNs further improve feature diversity by learning spectral, spatial, or spatial–spectral representations through different paths, while attention-based and dual-attention methods enhance informative spectral responses, discriminative regions, or spectral–spatial responses. Methods such as the online spectral information compensation network (OSICN) further attempt to strengthen spectral–spatial interaction through online spectral information compensation, showing the importance of coupling spectral cues with spatial feature extraction [26,27,28,29,30,31]. These methods are closely related to our work because they all aim to improve spectral–spatial representation under limited labeled samples.

Nevertheless, several issues remain. In many hybrid CNNs, 3D and 2D operations are arranged sequentially, so the native spatial–spectral coupling may be gradually weakened during feature abstraction. In many multi-branch architectures, different branches can produce complementary representations, but the fusion stage often relies on direct concatenation or simple addition, which may introduce redundant responses, feature mismatch, and unnecessary parameter burden. Attention mechanisms can improve feature selectivity, but they do not by themselves solve the problem of compact and effective fusion between heterogeneous branch features. Recent works have further addressed hyperspectral learning from different perspectives, including weakly supervised image-to-pixel representation, incomplete-supervision full-model classification, cross-domain few-shot adaptation, and related Transformer-based hyperspectral representation or target-detection frameworks [32,33,34,35]. In particular, Transformer-based models such as GSCViT improve global contextual modeling and long-range dependency learning through attention mechanisms [36,37,38,39,40,41,42], while state-space or Mamba-based models such as HSIRMamba provide an efficient alternative for modeling long sequential dependencies in hyperspectral data [43,44,45,46,47]. These methods enhance global representation ability, but their main focus differs from the compact fusion of heterogeneous local spectral–spatial representations. These observations suggest that the remaining challenge is not simply to extract spectral and spatial features separately, but to simultaneously preserve native spatial–spectral coupling, learn complementary decomposed spectral/spatial cues, and fuse heterogeneous representations in a compact and selective manner.

To address this challenge, we propose DSFA-CNN, a dual-branch convolutional neural network (CNN) with depthwise separable fusion for hyperspectral image classification. The proposed dual-branch CNN performs collaborative feature learning by modeling coupled spatial–spectral information and decomposed spectral/spatial cues in parallel rather than forcing them into a single feature extraction path. One branch uses 3D convolutions to retain the original spatial–spectral coupling of HSI data. The other branch adopts a 1D+2D strategy to separately model spectral dependencies and spatial textures with lower computational burden. To further strengthen discriminative responses, a convolutional block attention module (CBAM) is embedded into the decomposed branch so that informative spectral responses and critical spatial regions can be emphasized adaptively. At the fusion stage, a depthwise separable fusion module is introduced to integrate the two branches more compactly and selectively than direct concatenation-based fusion.

We evaluate DSFA-CNN on four public benchmark datasets, namely Indian Pines, University of Pavia, Salinas, and Houston2013, and compare it with support vector machine (SVM), k-nearest neighbors (KNN), 3D-CNN, HybridSN, ResNet-50 [48], GSCViT, and HSIRMamba. These datasets cover agricultural and urban scenes with different spatial resolutions, class compositions, and difficulty levels, providing a comprehensive test bed for model assessment. Experimental results show that the proposed network achieves accurate and stable performance across all four datasets. It is particularly effective in challenging scenarios such as Indian Pines and Houston2013, where mixed-pixel interference, high inter-class spectral similarity, and complex backgrounds place greater demands on feature modeling. Ablation studies further confirm that the dual-branch design, attention enhancement, and depthwise separable fusion all contribute to the final performance. Interpretability analysis also shows that the model learns meaningful spectral and spatial responses, supporting both its effectiveness and its physical plausibility.

The main contributions of this work are threefold. First, we develop a dual-branch CNN framework that performs parallel coupled–decomposed feature learning. The 3D branch preserves native spatial–spectral coupling, whereas the 1D+2D branch learns complementary spectral and spatial cues with lower computational burden. Second, we design a depthwise separable fusion module (DSF) to improve cross-branch consistency while reducing fusion-stage redundancy and parameter burden relative to direct concatenation-based fusion. Third, we introduce an attention-enhanced feature extraction strategy that strengthens informative spectral responses and spatial regions, leading to a favorable balance between classification accuracy, interpretability, and computational efficiency.

2. Materials and Methods

2.1. Overall Architecture of DSFA-CNN

The proposed DSFA-CNN mainly consists of four components: a principal component analysis (PCA)-based dimensionality reduction module, a dual-branch collaborative feature extraction structure, a CBAM attention mechanism, and a depthwise separable fusion operation. The overall workflow of DSFA-CNN is summarized as follows:

(1) PCA is first used to preprocess the HSI data. From the original hundreds of spectral bands, 30 principal components are retained as the basis for subsequent feature extraction and classification, thereby alleviating abrupt dimensional variation and spectral redundancy.

(2) A dual-branch collaborative feature extraction structure is then employed. The 3D branch uses 3D convolutions to directly extract spatial–spectral features, whereas the 1D+2D branch adopts 1D and 2D convolutions to jointly extract spectral and spatial information, thereby strengthening the feature learning capability of the network.

(3) A CBAM attention mechanism is embedded into the 1D+2D branch, i.e., the CBAM-enhanced 1D+2D branch, to adaptively enhance spectral and spatial responses and further amplify the distinction between targets and background, thus improving classification performance.

(4) Finally, a depthwise separable fusion operation is used to integrate the spatial–spectral features extracted by the two branches. The features are adaptively weighted and combined along the feature/channel dimension, refined by convolution, and then concatenated. This operation enhances feature consistency, limits fusion-stage redundancy compared with direct high-dimensional concatenation, and helps alleviate overfitting.

The overall architecture of DSFA-CNN when the input size is (1, 30, 13, 13) is shown in Figure 1. Table 1 presents the network configuration using an example with the number of classes C = 7. For different datasets, the output dimensions of the class-dependent layers are adjusted according to the actual number of categories.

2.2. Dimensionality Reduction by PCA

HSIs contain abundant spectral and spatial information, but their high dimensionality easily leads to data redundancy and abrupt dimensional variation, thereby increasing both computational difficulty and memory complexity. Principal component analysis (PCA) is a simple and efficient dimensionality reduction method that maps the original high-dimensional data into a low-dimensional space through linear transformation, where the new dimensions are linear combinations of the original features. These transformed dimensions are principal components, which have maximal variance and are mutually orthogonal.

The PCA procedure includes the following steps: (1) compute the mean so that the data are centered at the origin; (2) compute the covariance matrix; (3) obtain eigenvalues and eigenvectors from the covariance matrix; (4) select the eigenvectors corresponding to the largest eigenvalues as the principal components; and (5) project the original data onto the low-dimensional feature space. The workflow is illustrated in Figure 2.

In the proposed DSFA-CNN, PCA is used to reduce the spectral dimensionality of the HSI data. Let the initial input be X, where the width, height, and number of spectral bands are denoted by W, H, and B, respectively. After PCA, the reduced data can be expressed as

X_{k}

, where k denotes the retained spectral dimensionality. In this study, k is set to 30.

2.3. Dual-Branch Collaborative Feature Extraction

The key to HSI classification is to simultaneously characterize fine-grained spectral differences and local spatial structures. However, existing convolutional strategies often cannot preserve spatial–spectral coupling and maintain high extraction efficiency at the same time. A 1D convolution can effectively model local dependencies among contiguous PCA-retained spectral components with relatively few parameters and high efficiency, but it lacks the ability to model spatial context. A 2D convolution captures textures, edges, and neighborhood structures more effectively, but makes limited use of spectral correlation. Although a 3D convolution can directly model joint spatial–spectral features and preserve the inherent integrated nature of HSIs, it usually brings a larger parameter burden and higher computational cost, and is more prone to overfitting when only limited training samples are available.

To address this issue, we construct a dual-branch collaborative feature extraction structure to unify the preservation of spatial–spectral coupled representations with decomposed feature modeling. Specifically, the coupled 3D branch uses 3D convolutions to jointly model the input and retain the spatial–spectral coupling relationships in the original HSI data as much as possible. In contrast, the 1D+2D branch adopts a cascaded 1D+2D convolutional design to separately learn spectral dependencies and spatial textures, thereby obtaining discriminative decomposed representations at lower computational cost. The two branches are not simply redundant parallel paths; rather, they serve different purposes, namely preserving coupled representations and extracting efficient abstract features, and together form a complementary information expression mechanism.

In DSFA-CNN, the PCA-reduced HSI data are simultaneously fed into the two branches. In the 1D+2D branch, a 1D convolution first captures key inter-component dependencies, and a 2D convolution then extracts local spatial structures and aggregates features. In the coupled 3D branch, a 3D convolution directly learns the original spatial–spectral joint responses. Because the 3D branch is advantageous for preserving spatial–spectral coupling, whereas the 1D+2D branch is more computationally efficient, their collaboration provides a favorable balance between representational power and model complexity. This is also consistent with the ablation and time–cost analysis reported later, where removing either branch degrades performance, while the complete dual-branch structure maintains high classification accuracy with controlled computational overhead.

The specific computations in the CBAM-enhanced 1D+2D branch are given as follows:

F_{unite} = Flatten (p_{{(1 D)}_{i, j}}^{x}) + Flatten (p_{{(2 D)}_{i, j}}^{x, y})

(1)

p_{{(1 D)}_{i, j}}^{x} = f (\sum_{m} \sum_{h = 0}^{H_{x} - 1} k_{i, j, m}^{h} p_{(l - 1), m}^{(x + h)} + b_{i, j})

(2)

p_{{(2 D)}_{l, j}}^{x, y} = f (\sum_{m} \sum_{h = 0}^{H_{l} - 1} \sum_{w = 0}^{W_{l} - 1} k_{i, j, m}^{h, w} p_{(l - 1), m}^{(x + h), (y + w)} + b_{l, j})

(3)

where

F_{unite}

denotes the output of the CBAM-enhanced 1D+2D branch, and

p_{{(1 D)}_{i, j}}^{x}

,

p_{{(2 D)}_{i, j}}^{x, y}

, and

p_{{(2 D)}_{l, j}}^{x, y}

are the feature-map values produced by one- and two-dimensional convolution.

Flatten (\cdot)

denotes the operation that reshapes feature maps into one-dimensional vectors.

f (\cdot)

denotes the activation function, implemented as ReLU in this work. l denotes the layer index, and i is used as an auxiliary layer/block index in the 1D+2D branch for notational consistency. The symbol j denotes the index of the output feature map or output channel, and m denotes the index of the input feature map or input channel. x and y denote spatial positions, whereas h and w denote spatial offsets of the convolution kernel. In general, H and W denote the height and width of the convolution kernel; specifically,

H_{x}

and

H_{l}

denote the one-dimensional kernel length in Equation (2) and the height of the two-dimensional convolution kernel in layer l, respectively, and

W_{l}

denotes the width of the two-dimensional convolution kernel in layer l.

k_{i, j, m}^{h}

and

k_{i, j, m}^{h, w}

represent convolution kernel weights at positions h and

(h, w)

, respectively.

p_{(l - 1), m}^{(x + h)}

and

p_{(l - 1), m}^{(x + h), (y + w)}

denote the feature-map values of the mth input feature map in the

(l - 1)

th layer.

b_{i, j}

and

b_{l, j}

denote bias terms. The specific computations in the coupled 3D branch are given as follows:

F_{direct} = Linear (p_{{(3 D)}_{l, j}}^{x, y, z})

(4)

p_{{(3 D)}_{l, j}}^{x, y, z} = f (\sum_{m} \sum_{b = 0}^{B_{l} - 1} \sum_{h = 0}^{H_{l} - 1} \sum_{w = 0}^{W_{l} - 1} k_{l, j, m}^{h, w, b} p_{(l - 1), m}^{(x + h), (y + w), (z + b)} + b_{l, j})

(5)

where

F_{direct}

denotes the output of the coupled 3D branch, and

p_{{(3 D)}_{l, j}}^{x, y, z}

denotes the feature-map value after the three-dimensional convolution layer.

Linear (\cdot)

denotes the fully connected mapping.

f (\cdot)

denotes the activation function, implemented as ReLU in this work. l denotes the layer index, j denotes the index of the output feature map or output channel, and m denotes the index of the input feature map or input channel. x and y denote spatial positions, z denotes the spectral position, h and w denote spatial offsets of the convolution kernel, and b denotes the spectral offset of the 3D convolution kernel.

H_{l}

and

W_{l}

denote the height and width of the convolution kernel in layer l, respectively, and

B_{l}

denotes the spectral depth of the 3D convolution kernel in layer l.

k_{l, j, m}^{h, w, b}

denotes the convolution kernel weight at position

(h, w, b)

.

p_{(l - 1), m}^{(x + h), (y + w), (z + b)}

denotes the feature-map value of the mth input feature map in the

(l - 1)

th layer.

b_{l, j}

denotes the bias term.

Finally, the outputs of the two branches are fused by the depthwise separable fusion module to obtain the final prediction, which can be expressed as follows:

F_{output} = DSFusion (F_{unite}, F_{direct})

(6)

where

F_{output}

represents the final output of the DSFA-CNN network.

DSFusion (\cdot)

denotes the proposed depthwise separable fusion operation, which takes

F_{unite}

and

F_{direct}

as its inputs.

2.4. Attention Mechanism

The attention mechanism is inspired by the human visual and cognitive system and aims to selectively allocate limited computational resources to more informative and critical regions while suppressing relatively unimportant information. In HSI classification, attention can refine spatial, spectral, and texture features and is therefore frequently embedded into network architectures to improve feature learning and classification performance.

To improve the classification capability of the proposed network, we embed a convolutional block attention module (CBAM) into the 1D+2D branch of DSFA-CNN. CBAM is a simple yet effective feed-forward attention module designed to compensate for the limitations of conventional CNNs when handling information of different scales, shapes, and directions. Specifically, CBAM introduces channel attention and spatial attention. Channel attention enhances feature representation across different feature channels, whereas spatial attention highlights informative positions in the spatial domain. The structure of the CBAM module is illustrated in Figure 3.

In the 1D convolutional layers of the CBAM-enhanced 1D+2D branch, the channel attention component of CBAM is introduced to enhance spectral responses while suppressing less relevant spatial interference. In the 2D convolutional layers, the spatial attention component is introduced to emphasize important spatial regions while suppressing irrelevant information. After attention reweighting, the features are fed into the corresponding 1D and 2D convolutional layers for training. This design further enlarges the distinction between targets and background and improves the network’s classification effectiveness. The specific formulations of the channel attention and spatial attention modules are given as follows:

\begin{matrix} M_{channel} (x) = & σ (Conv2d (ReLU (Conv2d (Avg (x)))) \\ + Conv2d (ReLU (Conv2d (Max (x))))) \end{matrix}

(7)

F_{channel_output} = x ⊙ M_{channel} (x)

(8)

F_{spatial_output} = x ⊙ σ (Conv2d ([Avg (x); Max (x)]))

(9)

where x denotes the input feature map of the CBAM module.

Avg (\cdot)

and

Max (\cdot)

denote the average pooling and max pooling operations, respectively. ReLU denotes the rectified linear unit activation function,

Conv2d (\cdot)

denotes the two-dimensional convolution operation, and

σ

represents the sigmoid activation function.

[\cdot]

denotes the concatenation operation, and ⊙ denotes element-wise multiplication.

M_{channel} (x)

denotes the channel attention map.

F_{channel_output}

and

F_{spatial_output}

denote the outputs of the channel attention and spatial attention modules, respectively.

2.5. Depthwise Separable Fusion

Within the dual-branch collaborative extraction framework, effectively fusing heterogeneous features is one of the key factors determining model performance. The CBAM-enhanced 1D+2D branch outputs high-level semantic features obtained from decomposed modeling and thus has strong abstraction capability, whereas the coupled 3D branch outputs local detailed representations that preserve the original spatial–spectral coupling and therefore better retain raw information. Because these two types of features differ substantially in generation mechanism, semantic level, and representation form, direct concatenation, simple summation, or ordinary fully connected mapping can easily introduce redundancy, semantic mismatch, and information interference, weakening the complementary advantages of the dual-branch design. Therefore, the contribution of this work lies not only in building a dual-branch architecture, but also in designing an efficient fusion mechanism specifically for heterogeneous spatial–spectral features.

To this end, a depthwise separable fusion module (DSF) is introduced at the feature fusion stage, as shown in Figure 4. The module collaboratively fuses the two branch features from both channel and spatial perspectives. In the implementation, the branch outputs are first projected into comparable feature representations before DSF performs channel-wise weighting and convolution-based refinement. Along the channel dimension, weighted summation is used to adaptively adjust the contribution of each branch so as to highlight critical channel responses while suppressing redundant information. Along the spatial dimension, convolution is used to further model local spatial relationships across branches, thereby enhancing the spatial consistency and discriminative stability of the fused representation. The two fusion outputs are then concatenated to preserve both channel-selective aggregation and spatial structure consistency. Compared with straightforward high-dimensional concatenation followed by a fully connected layer, DSF reduces fusion-stage parameter redundancy and the risk of overfitting while better preserving the original spatial–spectral representation of the 3D branch and the efficient semantic abstraction of the 1D+2D branch. This interpretation is also supported by the later ablation results, where removing DSF leads to a clear performance drop, indicating that the gain of the full model comes not only from the dual-branch structure itself but also from the tailored heterogeneous feature fusion strategy.

Although OSICN also aims to strengthen spectral–spatial interaction in HSI classification, its design principle is different from the proposed DSF module. OSICN introduces online spectral information compensation, where spectral information is progressively supplemented to guide spatial feature extraction during network learning [31]. In contrast, the proposed DSF module is not designed as a spectral compensation mechanism. Instead, it focuses on the fusion stage of a dual-branch CNN, where the coupled 3D spatial–spectral representation and the decomposed 1D+2D spatial–spectral representation need to be integrated. By using depthwise separable operations, DSF provides a more compact alternative to direct high-dimensional concatenation, reducing fusion-stage redundancy and parameter burden while preserving cross-branch complementarity.

The core advantage of DSF is that it adaptively adjusts the contribution weights of the two branches through channel-wise weighted summation, thereby suppressing redundant channels and enhancing key features. At the spatial level, convolution is used to mine cross-branch spatial context and strengthen spatial consistency. Compared with naive fully connected fusion of concatenated branch features, the proposed design is intended to reduce fusion-stage parameter redundancy and the risk of overfitting while preserving the complementarity of the two branches, namely the spatial–spectral representation ability of the 3D branch and the efficient semantic abstraction capability of the 1D+2D branch.

3. Results and Discussion

3.1. Benchmark Datasets

To comprehensively evaluate the HSI classification performance of the proposed DSFA-CNN in different scenarios, experiments are conducted on four public benchmark datasets: Indian Pines (IP), University of Pavia (PU), Salinas (SA), and Houston2013. These datasets cover agricultural, urban, and complex land-cover scenes, with different spatial resolutions, class compositions, and classification difficulty levels, and therefore provide a comprehensive test bed for assessing spatial–spectral feature extraction, feature fusion, and complex-background suppression.

3.1.1. Indian Pines (IP)

The Indian Pines dataset was acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) in June 1992 over northwestern Indiana, USA. The size of the original image is 145 × 145 pixels and it contains 220 spectral bands covering the wavelength range from 0.4 to 2.5

μ

m. Bands 104–108, 150–163, and 220, which are strongly affected by water absorption and noise, are usually removed, leaving 200 valid bands for classification experiments. The spatial resolution is approximately 20 m, and mixed pixels are common, making the dataset relatively challenging. IP contains 16 land-cover classes and 10,249 labeled samples in total. The false-color image and ground-truth map are shown in Figure 5, and the class categories and sample sizes are listed in Table 2.

3.1.2. University of Pavia (PU)

The University of Pavia dataset was captured by the Reflective Optics System Imaging Spectrometer (ROSIS) in 2003 over the urban area of the University of Pavia, Italy. The original image size is 610 × 340 pixels with 115 spectral bands spanning 0.43–0.86

μ

m. After removing 12 noisy bands, 103 valid bands are retained for the experiments. The spatial resolution is 1.3 m, and the dataset contains rich spatial textures and urban structural information. PU includes nine land-cover classes and 42,776 labeled samples. The false-color image and ground-truth map are shown in Figure 6, and the corresponding class categories and sample sizes are given in Table 3.

3.1.3. Salinas (SA)

The Salinas dataset was also acquired by AVIRIS over Salinas Valley, California, USA. The image size is 512 × 217 pixels, and the raw data contain 224 spectral bands in the wavelength range from 0.4 to 2.5

μ

m. After removing water absorption bands, 204 valid bands are retained for the experiments. This dataset has a relatively high spatial resolution of 3.7 m and contains well-structured land-cover distributions, making it suitable for evaluating classification performance in high-accuracy scenarios. SA includes 16 land-cover classes and 54,129 labeled samples. The false-color image and ground-truth map are shown in Figure 7, and the class categories and sample sizes are listed in Table 4.

3.1.4. Houston2013

The Houston2013 dataset was released for the IEEE GRSS Data Fusion Contest 2013 and covers the University of Houston campus and its surrounding urban area in Texas, USA. It contains multiple typical urban land-cover types, including roads, buildings, vegetation, and water, and features complex scene structure and fine-grained class discrimination, making it suitable for evaluating model performance in complicated urban HSI classification scenarios. Houston2013 includes 15 land-cover classes and 16,372 labeled samples in total. The false-color image and ground-truth map are shown in Figure 8, and the corresponding class categories and sample sizes are listed in Table 5.

3.2. Evaluation Metrics

To objectively evaluate the classification performance of DSFA-CNN, four widely used metrics are adopted: overall accuracy (OA), average accuracy (AA), the Kappa coefficient, and macro-averaged F1-score (Macro-F1). These metrics assess the model from the perspectives of overall classification performance, mean recognition ability across classes, consistency between the predicted labels and the ground truth, and balanced class-wise recognition under class imbalance.

Overall accuracy (OA) denotes the ratio of correctly classified samples to the total number of samples and is used to measure the overall classification performance of the model. Its formulation is given in Equation (10).

O A = \frac{\sum_{i = 1}^{N} C_{i, i}}{\sum_{i = 1}^{N} \sum_{j = 1}^{N} C_{i, j}}

(10)

where N is the total number of classes,

C_{i, i}

represents the number of samples belonging to class i that are correctly classified as class i, and

C_{i, j}

represents the number of samples belonging to class i but misclassified as class j.

Average accuracy (AA) denotes the mean of the per-class accuracies and thus more comprehensively reflects the recognition capability of the model across different land-cover categories. Its formulation is given in Equation (11).

A A = \frac{1}{N} \sum_{i = 1}^{N} \frac{C_{i, i}}{\sum_{j = 1}^{N} C_{i, j}}

(11)

where N is the total number of classes, and

\frac{C_{i, i}}{\sum_{j = 1}^{N} C_{i, j}}

denotes the classification accuracy of class i.

The Kappa coefficient measures the agreement between the predicted classification results and the ground-truth labels. Compared with overall accuracy, it additionally considers the agreement occurring by chance, thereby enabling a more objective assessment of model performance. Its value lies in the range of

[- 1, 1]

, and a value closer to 1 indicates a higher degree of agreement between the predicted results and the ground-truth labels. The Kappa coefficient can be calculated as

κ = \frac{T \sum_{i = 1}^{N} C_{i, i} - \sum_{i = 1}^{N} (\sum_{j = 1}^{N} C_{i, j}) (\sum_{j = 1}^{N} C_{j, i})}{T^{2} - \sum_{i = 1}^{N} (\sum_{j = 1}^{N} C_{i, j}) (\sum_{j = 1}^{N} C_{j, i})}

(12)

where N denotes the number of land-cover classes, T denotes the total number of testing samples, and

C_{i, j}

represents the number of samples whose ground-truth label is class i but are predicted as class j. The term

\sum_{j = 1}^{N} C_{i, j}

represents the number of ground-truth samples in class i, whereas

\sum_{j = 1}^{N} C_{j, i}

represents the number of samples predicted as class i. Therefore, the Kappa coefficient evaluates the agreement between the predicted labels and the ground-truth labels while accounting for chance agreement. For consistency with OA and AA, the Kappa coefficient is reported as

100 \times κ

in all tables.

In addition to OA, AA, and Kappa, the macro-averaged F1-score is introduced to further evaluate classification performance under class-imbalanced conditions. For the confusion matrix C,

C_{i, j}

denotes the number of samples whose ground-truth label is class i and whose predicted label is class j. Therefore, the recall of class i, also referred to as the per-class accuracy or producer’s accuracy, is defined as

R_{i} = \frac{C_{i, i}}{\sum_{j = 1}^{N} C_{i, j}}

(13)

where

R_{i}

measures the proportion of samples from class i that are correctly recognized. The per-class recall values are reported in the class-wise rows of Table 6, Table 7, Table 8 and Table 9.

The precision of class i is defined as

P_{i} = \frac{C_{i, i}}{\sum_{j = 1}^{N} C_{j, i}}

(14)

which measures the proportion of samples predicted as class i that truly belong to class i. Based on precision and recall, the F1-score of class i is calculated as

F 1_{i} = \frac{2 P_{i} R_{i}}{P_{i} + R_{i}}

(15)

Finally, the macro-averaged F1-score is obtained by averaging the F1-scores over all N classes:

Macro - F 1 = \frac{1}{N} \sum_{i = 1}^{N} F 1_{i}

(16)

Unlike OA, which can be dominated by majority classes, Macro-F1 assigns equal weight to each class and is therefore more sensitive to minority-class recognition. Accordingly, Macro-F1 is added as an additional summary metric in Table 6, Table 7, Table 8 and Table 9 to complement OA, AA, Kappa, and the class-wise recall results. In this way, the per-class recall rows and Macro-F1 jointly provide a more direct assessment of recognition ability for minority or difficult land-cover classes.

3.3. Comparative Evaluation

To comprehensively evaluate the performance of DSFA-CNN, comparative experiments are conducted on the four public HSI datasets, i.e., Indian Pines (IP), University of Pavia (PU), Salinas (SA), and Houston2013. The experiments are implemented on a Windows platform equipped with an NVIDIA RTX 3070 GPU and an Intel Core i7-12700 CPU at 2.10 GHz, using Python 3.9.16 and PyTorch 2.0.1. During training, the number of epochs is set to 150, the learning rate is 0.001, the number of retained principal components after PCA is 30, the batch size is 128, the loss function is cross-entropy loss, and Adagrad is used for parameter optimization. The input patch size is

13 \times 13

. In each experiment, the labeled samples are randomly split into training and testing sets at a ratio of 1:9.

To ensure fair and reproducible comparisons, no separate validation set was used and early stopping was not applied. All deep learning models were trained for 150 epochs with fixed hyperparameters across all datasets and independent runs, including a learning rate of 0.001, batch size of 128, Adagrad optimizer, 30 retained PCA components, and a

13 \times 13

input patch. No dataset-specific hyperparameter tuning was performed, except that the output dimension of the final classifier was adjusted according to the number of classes in each dataset. The settings of 30 PCA components and

13 \times 13

patch size are further supported by the sensitivity analysis in Section 3.4, which shows a favorable accuracy–inference time trade-off. The same ten predefined random seeds were used for all deep learning methods to control both train/test splitting and network initialization.

To verify the effectiveness of the proposed model, seven representative methods were selected for comparison, including two traditional machine learning methods, SVM and KNN, and five deep learning methods, 3D-CNN, HybridSN, ResNet-50, GSCViT, and HSIRMamba. Together with DSFA-CNN, eight models are compared in total. For fairness, all compared methods were evaluated under the same preprocessing and train/test split protocol. For the deep learning methods, the same training settings described above were adopted. OA, AA, Kappa, and Macro-F1 scores of the deep learning methods are reported as mean ± standard deviation over the ten independent runs, where the mean reflects average classification performance and the standard deviation reflects stability; smaller standard deviations indicate stronger robustness. Note that the per-class accuracies reported in the tables are taken from the first independent run of each model and are provided for reference only. The classification maps shown for each dataset correspond to the run with the highest OA among the ten independent experiments.

3.3.1. Results on the IP Dataset

The quantitative results on the IP dataset are listed in Table 6, and the corresponding classification maps are shown in Figure 9. The results on the IP dataset show that DSFA-CNN achieves the best OA, AA, and Kappa performance, with values of 95.62 ± 0.13%, 94.65 ± 0.18%, and 95.01 ± 0.10%, respectively. Its advantage is also reflected in a more balanced class-wise performance, especially on difficult categories such as Grass-trees, Grass-pasture-mowed, Oats, Buildings-Grass-Trees-Drivers, and Stone-Steel-Towers. This result is consistent with the model design: the 3D branch preserves native spatial–spectral coupling, the 1D+2D branch captures complementary spectral and spatial cues, and CBAM together with DSF improves feature selectivity and fusion quality.

A notable result is the relatively weak performance of ResNet-50 on IP. This is reasonable given the characteristics of the dataset. IP has relatively low spatial resolution, frequent mixed pixels, and high spectral similarity among crop classes, while several categories contain very few labeled samples. Under these conditions, a deep residual network originally developed for natural-image recognition is less effective at preserving fine spectral continuity and is more susceptible to overfitting, which explains why its OA and AA remain below those of 3D-CNN, HybridSN, GSCViT, and HSIRMamba. Compared with the recent HSIRMamba baseline, DSFA-CNN also achieves higher OA and AA on IP, indicating that broad contextual modeling alone is still insufficient to match the benefit of the proposed dual-branch feature extraction in such a mixed agricultural scene. By contrast, 3D-CNN and HybridSN remain competitive because direct spatial–spectral modeling is better suited to the boundary ambiguity and class overlap of IP. Overall, the main advantage of DSFA-CNN lies not in isolated gains on a few easy classes, but in its stronger robustness across a highly heterogeneous agricultural dataset.

Macro-F1 provides an additional class-balanced precision–recall view for the imbalanced IP dataset. Although DSFA-CNN does not obtain the highest Macro-F1 on IP, its value of 90.93 ± 0.34% outperforms SVM, KNN, ResNet-50, and 3D-CNN, while showing a relatively small standard deviation. Combined with the class-wise recall results, DSFA-CNN maintains strong recognition on several minority or difficult classes such as Grass-pasture-mowed, Oats, and Stone-Steel-Towers, although extremely small classes such as Alfalfa remain challenging.

3.3.2. Results on the PU Dataset

The quantitative results on the PU dataset are listed in Table 7, and the classification maps are shown in Figure 10. The results on the PU dataset show that DSFA-CNN achieves the best OA, AA, and Kappa performance, with values of 99.25 ± 0.13%, 98.71 ± 0.21%, and 99.01 ± 0.11%, respectively. It also maintains a strong class-wise profile, reaching the best or tied-best accuracy on most categories, including asphalt, meadows, gravel, trees, painted metal sheets, bare soil, and bitumen. This result is consistent with the characteristics of PU. As a high-resolution urban dataset, PU contains clear boundaries and rich local textures, so accurate classification depends on both spectral discrimination and spatial structure modeling.

Notably, the gap among the stronger deep models is relatively small in terms of OA, but larger in AA. In particular, GSCViT and HSIRMamba achieve OA values close to that of DSFA-CNN, whereas their AA values are lower, indicating less balanced performance across categories. This is reasonable for PU, where several classes are easy to recognize because of their distinct structure, while others, such as gravel, bitumen, and self-blocking bricks, remain more confusable due to material similarity. By contrast, 3D-CNN and HybridSN remain competitive because explicit spatial–spectral modeling is better suited to the fine boundaries and structural heterogeneity of PU. Overall, the main advantage of DSFA-CNN on this dataset lies in its ability to preserve very high overall accuracy while maintaining more balanced discrimination across urban categories.

For the PU dataset, Macro-F1 further complements the OA-, AA-, and Kappa-based evaluation. DSFA-CNN obtains a Macro-F1 score of 98.47 ± 0.34%, which is slightly lower than several strong deep learning baselines but still indicates competitive class-balanced performance. Together with its best OA, AA, and Kappa values, the high recall on relatively small categories such as Painted metal sheets, Bitumen, and Shadows suggests that the proposed model preserves balanced discrimination across urban land-cover classes.

3.3.3. Results on the SA Dataset

The quantitative results on the SA dataset are listed in Table 8, and the classification maps are shown in Figure 11. DSFA-CNN achieves the best performance across the reported summary metrics, with OA, AA, Kappa, and Macro-F1 values of 99.89 ± 0.11%, 99.87 ± 0.12%, 99.83 ± 0.11%, and 99.75 ± 0.11%, respectively, indicating that the model can still provide measurable gains even when the baseline accuracies are already high.

The highest Macro-F1 score is consistent with the best OA, AA, and Kappa values on this dataset. This indicates that the proposed model not only improves overall accuracy but also maintains strong class-balanced precision–recall performance across different agricultural categories.

The SA dataset has relatively high spatial resolution and regular land-cover distributions, but subtle spectral differences still exist among vegetables and vineyards at different growth stages. For example, lettuce categories at different growth stages differ in chlorophyll content, water content, and canopy structure, and these differences are reflected in subtle spectral responses. DSFA-CNN can simultaneously capture such fine-grained spectral variations and regular spatial structures, which explains its higher accuracy on this dataset. As shown in Figure 11, its classification map is closer to the ground truth, with clearer region boundaries.

3.3.4. Results on the Houston2013 Dataset

The quantitative results on the Houston2013 dataset are listed in Table 9, and the classification maps are shown in Figure 12. DSFA-CNN again achieves the best OA, AA, and Kappa performance, with values of 97.62 ± 0.23%, 97.65 ± 0.18%, and 97.01 ± 0.30%, respectively, indicating a stable advantage in complex urban scenes.

Houston2013 contains various complex urban categories such as roads, railways, parking lots, buildings, and vegetation. Different classes may exhibit both material similarity and structural differences, which imposes a higher requirement on discriminative capability. Through collaborative dual-branch extraction and attention enhancement, DSFA-CNN can better distinguish urban targets with similar reflectance characteristics but different spatial organizations. As shown in Figure 12, the model yields more complete road networks, more continuous building regions, and less local noise in vegetation areas.

On the Houston2013 dataset, Macro-F1 provides a complementary view of class-balanced recognition in complex urban scenes. DSFA-CNN obtains a Macro-F1 score of 97.93 ± 0.45%, outperforming SVM, KNN, ResNet-50, and 3D-CNN, and remaining competitive with recent deep learning baselines, although it is not the highest value in Table 9. Together with the best OA, AA, and Kappa values, the high recall values for classes such as Water, Synthetic_grass, Parking_Lot_2, Tennis_Court, and Running_Track further support its effectiveness on relatively small or structurally distinctive classes.

Considering the results on all four public datasets, DSFA-CNN consistently achieves the best OA, AA, and Kappa values, while also showing competitive or the best Macro-F1 performance depending on the dataset. The OA values reach 95.62 ± 0.13%, 99.25 ± 0.13%, 99.89 ± 0.11%, and 97.62 ± 0.23%, respectively; the AA values are 94.65 ± 0.18%, 98.71 ± 0.21%, 99.87 ± 0.12%, and 97.65 ± 0.18%; and the Kappa coefficients are 95.01 ± 0.10%, 99.01 ± 0.11%, 99.83 ± 0.11%, and 97.01 ± 0.30%. Compared with traditional machine learning methods such as SVM and KNN, DSFA-CNN exhibits clear accuracy advantages on all four datasets, suggesting that conventional classifiers have difficulty simultaneously capturing complex spectral variation and spatial structures in HSIs. Compared with deep learning baselines such as 3D-CNN, HybridSN, ResNet-50, GSCViT, and HSIRMamba, DSFA-CNN also maintains higher or more stable classification accuracy, demonstrating stronger overall ability in joint spatial–spectral modeling.

In particular, DSFA-CNN shows stable advantages over recent advanced baselines such as GSCViT and HSIRMamba. The OA values of GSCViT on IP, PU, SA, and Houston2013 are 95.18 ± 0.11%, 99.13 ± 0.14%, 98.84 ± 0.28%, and 96.43 ± 0.58%, respectively; those of HSIRMamba are 94.93 ± 1.45%, 98.87 ± 0.32%, 99.17 ± 0.52%, and 95.29 ± 1.05%; and those of DSFA-CNN improve to 95.62 ± 0.13%, 99.25 ± 0.13%, 99.89 ± 0.11%, and 97.62 ± 0.23%. The improvement is especially evident on the two more challenging datasets, IP and Houston2013. This indicates that even when compared with more recent architectures, DSFA-CNN retains stronger discriminative capability and stability under repeated random splits in the presence of severe mixed-pixel interference, high spectral similarity among classes, and complicated urban backgrounds, further verifying the effectiveness and competitiveness of the proposed method.

Overall, the added Macro-F1 results complement OA, AA, and Kappa by providing a more class-balanced evaluation under imbalanced class distributions. DSFA-CNN achieves the highest Macro-F1 on SA and competitive Macro-F1 scores on IP, PU, and Houston2013. When interpreted together with the class-wise recall values in Table 6, Table 7, Table 8 and Table 9, these results support the model’s ability to recognize several minority or difficult classes while also showing that class imbalance remains a challenging factor. Therefore, the proposed method is not claimed to completely solve the class-imbalance problem.

3.4. Parameter Sensitivity Analysis

The number of retained PCA components and the spatial patch size are two key hyperparameters in DSFA-CNN. The former determines how much spectral information is preserved after dimensionality reduction, whereas the latter controls the amount of local spatial context used for center-pixel classification. Since the main experiments adopt 30 PCA components and a

13 \times 13

input patch, we further conduct sensitivity analysis to justify these settings. For the PCA experiment, the number of retained components is varied within {10, 20, 30, 40, 50}, while the patch size is fixed at

13 \times 13

. For the patch-size experiment, the patch size is varied within {

9 \times 9

,

11 \times 11

,

13 \times 13

,

15 \times 15

}, while the number of PCA components is fixed at 30. Other experimental settings remain the same as those used in the comparative experiments.

3.4.1. Sensitivity to the Number of PCA Components

Figure 13 presents the sensitivity results with respect to the number of PCA components. As shown in Figure 13a, increasing the number of components from 10 to 30 consistently improves OA on the four datasets. The improvement is especially evident on IP, indicating that too few principal components may discard useful spectral information in scenes with severe mixed pixels and high inter-class spectral similarity. When the number of components further increases from 30 to 40 or 50, the OA gain becomes marginal on most datasets, suggesting that the first 30 components already retain most discriminative spectral information required by DSFA-CNN.

Meanwhile, Figure 13b shows that the total inference time increases as more components are retained. The mean OA–time trade-off in Figure 13c further confirms this tendency: 30 components achieve a mean OA of 96.84% with a mean inference time of 2.09 s, while 50 components only improve the mean OA to 97.17% but increase the inference time to 2.70 s. Therefore, retaining 30 PCA components provides a favorable balance between accuracy and efficiency.

3.4.2. Sensitivity to the Spatial Patch Size

Figure 14 shows the sensitivity results with respect to the spatial patch size. As shown in Figure 14a, increasing the patch size from

9 \times 9

to

13 \times 13

improves the OA on all datasets, demonstrating the importance of local spatial context for hyperspectral image classification.

However, further increasing the patch size to

15 \times 15

does not yield consistent improvement and even causes a slight accuracy decrease on IP and Houston2013. This is because overly large patches may introduce heterogeneous neighboring pixels, background interference, or samples from adjacent classes, especially in mixed agricultural scenes and complex urban environments. In terms of inference time, Figure 14b shows that larger patches generally require higher computational cost. According to the mean OA–time trade-off in Figure 14c, the

13 \times 13

patch achieves the highest mean OA of 98.10% with a mean inference time of 4.06 s, whereas the

15 \times 15

patch increases the inference time to 4.32 s but reduces the mean OA to 97.71%. Therefore,

13 \times 13

is selected as the default input patch size.

Overall, the sensitivity analysis demonstrates that the adopted configuration, namely 30 PCA components and a

13 \times 13

spatial patch, achieves a robust accuracy–efficiency trade-off across the four benchmark datasets. These results support the parameter settings used in the main experiments and further indicate that the selected configuration is computationally practical.

3.5. Ablation Study

To evaluate the contribution of each component in DSFA-CNN, we conduct ablation experiments on the four datasets. Two groups of experiments are considered. In the first group, the main modules of DSFA-CNN are removed to verify their individual contributions. In the second group, the proposed depthwise separable fusion module is replaced with conventional fusion strategies to further examine whether DSF provides a better accuracy–efficiency trade-off.

3.5.1. Component-Removal Ablation

DSFA-CNN mainly consists of the dual-branch feature extraction structure, the CBAM attention module, and the depthwise separable fusion module (DSF). Therefore, four variants are constructed by removing the 3D branch, the 1D+2D branch, CBAM, and DSF, respectively. The results are shown in Table 10.

As shown in Table 10, removing any component generally leads to a performance drop on different datasets, indicating that every part of DSFA-CNN contributes positively to the final classification results. The largest degradation occurs when either branch is removed. On IP, the OA decreases from 95.62% to 91.99% and 91.59% after removing the 3D branch and the 1D+2D branch, respectively. On Houston2013, the OA decreases from 97.62% to 96.32% and 95.93%. These results confirm that the two branches are complementary: the 3D branch preserves coupled spatial–spectral information, whereas the 1D+2D branch provides an explicit description of spectral dependencies and spatial textures. Their combination is therefore more effective for scenes with mixed pixels, high inter-class similarity, or complex local structure.

CBAM and DSF further improve the model, although their gains are smaller than those of the dual-branch design. Removing CBAM reduces the OA on IP from 95.62% to 93.09% and slightly reduces the OA on PU from 99.25% to 99.16%, showing that attention-based refinement is helpful for suppressing irrelevant responses and highlighting informative regions. A similar trend is observed for DSF. Although the improvement is modest, the full model still performs best, indicating that effective fusion of heterogeneous branch features remains necessary. These results suggest that DSF mainly contributes by improving fusion selectivity and reducing feature interference rather than by making the whole network the smallest in parameter count.

3.5.2. Fusion Strategy Replacement Ablation

To further evaluate the effectiveness of the proposed DSF, we replace it with three alternative fusion strategies while maintaining the same dual-branch structure, CBAM module, input size, and training settings. The compared variants include w/o DSF, where the DSF module is removed; Concat + FC, where the two branch features are concatenated and then mapped by a fully connected layer; Concat +

1 \times 1

Conv, where concatenated features are fused by a pointwise convolution; and the proposed DSF.

Trainable parameters (Params), floating-point operations (FLOPs), and AA are used to evaluate fusion-stage compactness, computational cost, and classification effectiveness, respectively. Params reflect the learnable parameter burden of different fusion strategies and are used to assess parameter redundancy, while FLOPs indicate whether the accuracy gain is obtained with additional computation. AA measures the mean class-wise recognition ability, which is important for imbalanced HSI datasets. Therefore, achieving higher or comparable AA with fewer Params and FLOPs suggests that the fusion strategy can reduce redundancy while preserving effective spatial–spectral representations. The results are reported in Table 11.

As shown in Table 11, the proposed DSF achieves the best or comparable AA with consistently lower Params and FLOPs among the compared fusion strategies, indicating a more favorable balance between fusion effectiveness and computational cost. On IP, DSF achieves an AA of 94.65%, outperforming w/o DSF, Concat + FC, and Concat +

1 \times 1

Conv by 0.92%, 2.01%, and 0.86%, respectively. On SA and Houston2013, DSF also achieves the highest AA, reaching 99.87% and 97.65%, respectively. On PU, DSF obtains an AA of 98.71%, which is very close to the best alternative result of 98.73%, but with fewer parameters and lower FLOPs.

These results show that DSF improves the fusion process without increasing fusion-stage complexity. Compared with Concat + FC, DSF avoids directly projecting high-dimensional concatenated features, thereby reducing parameter redundancy. Compared with Concat +

1 \times 1

Conv, DSF further combines channel-wise adaptive weighting and spatial refinement, which helps maintain the complementarity between the coupled 3D spatial–spectral representation and the decomposed 1D+2D representation. Overall, DSF provides a parameter-efficient fusion design that offers a more favorable balance between classification accuracy, parameter burden, and computational cost than simple concatenation-based fusion strategies.

3.6. Interpretability Analysis

To further reveal the internal discriminative mechanism of DSFA-CNN, interpretability analysis is conducted on the IP dataset from four aspects: spectral attention, spatial attention, class-level attention distribution, and feature-space distribution, as shown in Figure 15, Figure 16, Figure 17 and Figure 18. Unlike Section 3.5, which verifies whether each module is effective from a quantitative perspective, this section explains why the modules are effective. The IP dataset mainly consists of agricultural scenes, has a spatial resolution of about 20 m, exhibits evident mixed-pixel phenomena, and involves high spectral similarity among crop categories. Therefore, it is well suited for analyzing how the model focuses on PCA-retained spectral components, critical spatial regions, and class-discriminative features. Because DSFA-CNN adopts dual-branch collaborative extraction, embeds CBAM in the CBAM-enhanced 1D+2D branch to strengthen both spectral and spatial responses, and uses DSF at the end to enhance feature consistency, the following analysis is centered around these design components.

3.6.1. Spectral Attention Analysis

Figure 15 shows the spectral attention responses learned by DSFA-CNN on the IP dataset. The horizontal axis denotes the PCA-retained spectral components, and the vertical axis denotes the corresponding attention response intensity. The non-uniform distribution indicates that the model does not treat all spectral components equally. Instead, components with higher responses are regarded as more informative for classification, whereas components with lower responses are relatively less discriminative or redundant. Therefore, this figure reflects the spectral selectivity introduced by the CBAM-based channel attention mechanism and helps explain why the attention module improves the classification performance in the ablation study.

3.6.2. Spatial Attention Analysis

Figure 16 visualizes the spatial attention when DSFA-CNN identifies the Soybean-mintill class. Figure 16a shows the label map of this class, whereas Figure 16b presents the corresponding spatial attention heatmap. The visualization indicates that within an input patch, different pixels contribute unequally to the final decision: the model assigns larger weights to the regions relevant to the target class and lower responses to background regions and interfering pixels.

This result is consistent with the scene characteristics of the IP dataset. Because the spatial resolution is relatively low, mixed pixels frequently occur near field boundaries, and the contributions of the central target and the surrounding neighborhood are therefore different for classification. Figure 16b shows that DSFA-CNN focuses more strongly on the key spatial regions related to the target class, which helps suppress background interference and enhance target saliency. This observation also agrees with the accuracy drop caused by removing CBAM in Section 3.5, indicating that spatial attention is beneficial for recognizing complex agricultural scenes.

3.6.3. Class-Level Attention Distribution Analysis

Figure 17 shows that the learned attention distributions over PCA-retained spectral components vary across the 16 land-cover categories, indicating that DSFA-CNN does not rely on a single, fixed discrimination template. This observation helps connect the interpretability analysis to the dual-branch ablation results. If the evidence required for all classes were essentially identical, a single-path representation would be more likely to suffice. Instead, the model exhibits category-dependent response patterns, suggesting that different classes depend on different combinations of spectral contrast, local texture, boundary structure, and neighborhood context. This behavior is consistent with the rationale of the proposed architecture: the 3D branch preserves coupled spatial–spectral structure, whereas the 1D+2D branch captures decomposed but efficient discriminative cues. Their collaboration enables the network to adapt its attention strategy to the characteristics of each category, which also explains why removing either branch causes a noticeable loss in accuracy.

3.6.4. Feature-Space Distribution Analysis

Figure 18 presents the t-SNE projection of the features learned by DSFA-CNN on the IP dataset. Most samples from the same class form compact clusters, while different classes remain relatively well separated. This result suggests that the proposed network maps the original HSI data into a more discriminative feature space. For the IP scene, where crop categories are spectrally similar and often affected by soil background and mixed pixels, such a distribution indicates improved intra-class compactness and inter-class separability after dual-branch feature extraction, attention enhancement, and feature fusion.

Overall, Figure 15, Figure 16, Figure 17 and Figure 18 provide qualitative evidence that the decision process of DSFA-CNN on the IP dataset is reasonably interpretable, as reflected by selective spectral emphasis, localized spatial focus, class-dependent attention patterns, and a feature space with improved intra-class compactness and inter-class separability. Combined with the ablation results in Section 3.5, these observations suggest that the performance gain of DSFA-CNN arises not only from the network design itself, but also from its effective modeling of informative spectral cues, local spatial structure, and class-discriminative representations in HSIs.

3.7. Time–Cost and Complexity Analysis

In addition to classification accuracy, practical hyperspectral image classification requires controlled model complexity and efficient inference. Therefore, training time alone is insufficient to fully characterize computational efficiency. To provide a more comprehensive evaluation of computational efficiency, we compare the number of trainable parameters, FLOPs, training time, and inference time of different deep learning models on the Houston2013 dataset. Houston2013 is used for this analysis because it contains complex urban structures and fine-grained land-cover categories, making it a representative scenario for assessing both recognition performance and deployment cost. All models are evaluated under the same preprocessing protocol, input patch size, batch size, hardware/software environment, and train/test split as described in the experimental setting. FLOPs are computed for a single forward pass with the PCA-reduced input patch. Inference time denotes the total elapsed prediction time on the testing set with gradient computation disabled.

As shown in Table 12, DSFA-CNN achieves the best classification performance on Houston2013, with an OA of 97.62%, an AA of 97.65%, and a Kappa coefficient of 97.01. Meanwhile, it requires 1.522M parameters, 1.094G FLOPs, 214.34 s training time, and 0.8833 s total inference time, corresponding to only 0.0599 ms per testing pixel. Although DSFA-CNN does not attain the minimum value in every complexity or time metric, it provides a favorable overall trade-off among classification accuracy, computational cost, inference latency, and model size. Compared with ResNet-50, DSFA-CNN substantially reduces parameters, FLOPs, and inference time while improving OA by 2.14 percentage points. Compared with HSIRMamba, it requires much fewer FLOPs and a shorter inference time while achieving higher accuracy. Compared with GSCViT and 3D-CNN, DSFA-CNN introduces a moderate model size but obtains a clearer gain in classification performance, indicating that the computational cost is effectively converted into discriminative capability rather than supporting an absolute minimum-complexity claim. The normalized latency further shows that DSFA-CNN provides efficient pixel-level prediction while maintaining the highest OA, AA, and Kappa among the compared deep learning models.

Figure 19 further visualizes the accuracy–efficiency relationship among the compared models. In this profile, the x-axis represents FLOPs, the y-axis represents OA, the bubble size denotes the number of trainable parameters, and the color indicates inference time. A desirable deployment-oriented model should be located closer to the upper-left region, indicating higher accuracy and lower computational cost, while having a relatively small and light-colored bubble. ResNet-50 has the largest bubble and the darkest color, reflecting its high parameter size and long inference time. HSIRMamba also shows relatively high computational cost due to its larger FLOPs and longer inference time. By comparison, DSFA-CNN is located in a favorable region with the highest OA, moderate FLOPs, and relatively short inference time, indicating a practical accuracy–efficiency trade-off rather than optimization for a single complexity metric.

The favorable accuracy–efficiency trade-off of DSFA-CNN is closely related to its architectural design. The 3D branch preserves local spatial–spectral coupling, which is important for distinguishing complex urban land-cover categories, whereas the 1D+2D branch extracts complementary spectral and spatial cues with lower computational burden. Moreover, the depthwise separable fusion module (DSF) avoids direct high-dimensional concatenation and reduces redundant feature interactions during cross-branch integration. As a result, DSFA-CNN improves classification accuracy without incurring the excessive FLOPs and inference latency observed in several recent competitive baselines. Overall, DSFA-CNN should be regarded as computationally practical for hyperspectral image classification, with a favorable accuracy–efficiency trade-off rather than a design optimized solely for parameter count or inference speed.

4. Conclusions

This work presents DSFA-CNN, a dual-branch hyperspectral image classification framework designed to preserve native spatial–spectral coupling while learning complementary discriminative cues without unnecessary redundancy. The proposed architecture combines a 3D convolution branch for coupled representation learning, a 1D+2D branch for decomposed spectral and spatial modeling, CBAM-based attention enhancement, and a depthwise separable fusion module for selective integration of heterogeneous features. In this way, DSFA-CNN addresses a central difficulty in HSI classification: improving representation quality without incurring unnecessary redundancy or excessive computational burden.

Across the Indian Pines, University of Pavia, Salinas, and Houston2013 datasets, DSFA-CNN delivers consistently competitive performance and achieves the strongest results on the most challenging scenes, where mixed pixels, spectral similarity, and complex backgrounds place greater demands on feature modeling. The ablation results further show that the gain of the full model is structural rather than incidental: the two branches contribute complementary information, CBAM improves the selection of informative spectral and spatial responses, and the fusion module strengthens cross-branch consistency. Together with the interpretability and computational cost analyses, these findings show that the proposed network improves classification accuracy, representation selectivity, and feature organization while maintaining moderate parameter size, FLOPs, and inference time. It should be noted that DSFA-CNN is not positioned as a minimum-complexity network across all metrics. Rather, it aims to improve discriminative representation while keeping the computational overhead controlled, thereby providing a favorable balance among classification accuracy, interpretability, and computational practicality.

Several limitations should be noted. First, DSFA-CNN adopts PCA as a preprocessing step, which is efficient for reducing spectral dimensionality but may discard some nonlinear spectral information. Second, although Macro-F1 and class-wise recall indicate that the proposed method maintains competitive performance on imbalanced datasets, extremely small classes, such as Alfalfa in the IP dataset, remain challenging. In addition, although the experiments use repeated random splits on four benchmark datasets, further validation under cross-scene or cross-sensor transfer settings is still needed to assess the generalization ability of DSFA-CNN in more diverse deployment scenarios. Future work will explore adaptive spectral dimensionality reduction and class-imbalance-aware learning strategies to further improve the robustness of hyperspectral image classification.

Taken together, the results establish DSFA-CNN as an effective and well-balanced solution for HSI classification across both agricultural and urban scenes. Overall, DSFA-CNN achieves strong classification performance while maintaining moderate parameter size, FLOPs, and inference time, demonstrating a favorable accuracy–efficiency trade-off for hyperspectral image classification. The DSF module further contributes to this balance by reducing fusion-stage parameter redundancy compared with standard concatenation-based fusion. More broadly, this study shows that robust hyperspectral recognition benefits from parallel modeling of coupled and decomposed information, followed by efficient and selective fusion.

Author Contributions

Conceptualization, T.L. and Y.C.; methodology, T.L.; software, T.L.; validation, T.L., X.G., S.Z. and L.Y.; formal analysis, T.L.; investigation, T.L.; resources, Y.C.; data curation, T.L.; writing—original draft preparation, T.L.; writing—review and editing, Y.C., X.G., S.Z. and L.Y.; visualization, T.L.; supervision, Y.C.; project administration, Y.C.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Funds for the Central Universities, grant numbers YJSJ26006 and XJSJ24004, and by the Postgraduate Innovation Fund of Xidian University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available benchmark hyperspectral datasets, including Indian Pines, University of Pavia, Salinas, and Houston2013. Further information regarding the processed data and experimental code is available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank the providers of the public benchmark hyperspectral datasets used in this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Antony, M.M.; Suchand Sandeep, C.S.; Vadakke Matham, M. Hyperspectral vision beyond 3D: A review. Opt. Lasers Eng. 2024, 178, 108238. [Google Scholar] [CrossRef]
Bhargava, A.; Sachdeva, A.; Sharma, K.; Alsharif, M.H.; Uthansakul, P.; Uthansakul, M. Hyperspectral imaging and its applications: A review. Heliyon 2024, 10, e33208. [Google Scholar] [CrossRef]
Cheng, M.-F.; Mukundan, A.; Karmakar, R.; Valappil, M.A.E.; Jouhar, J.; Wang, H.-C. Modern Trends and Recent Applications of Hyperspectral Imaging: A Review. Technologies 2025, 13, 170. [Google Scholar] [CrossRef]
Ram, B.G.; Oduor, P.; Igathinathane, C.; Howatt, K.; Sun, X. A systematic review of hyperspectral imaging in precision agriculture: Analysis of its current state and future prospects. Comput. Electron. Agric. 2024, 222, 109037. [Google Scholar] [CrossRef]
Hajaj, S.; El Harti, A.; Pour, A.B.; Jellouli, A.; Adiri, Z.; Hashim, M. A review on hyperspectral imagery application for lithological mapping and mineral prospecting: Machine learning techniques and future prospects. Remote Sens. Appl. Soc. Environ. 2024, 35, 101218. [Google Scholar] [CrossRef]
Lai, C.-L.; Karmakar, R.; Mukundan, A.; Natarajan, R.K.; Lu, S.-C.; Wang, C.-Y.; Wang, H.-C. Advancing hyperspectral imaging and machine learning tools toward clinical adoption in tissue diagnostics: A comprehensive review. APL Bioeng. 2024, 8, 041504. [Google Scholar] [CrossRef]
Lu, B.; Dao, P.; Liu, J.; He, Y.; Shang, J. Recent Advances of Hyperspectral Imaging Technology and Applications in Agriculture. Remote Sens. 2020, 12, 2659. [Google Scholar] [CrossRef]
Wołk, K.; Wołk, A. Hyperspectral Imaging System Applications in Healthcare. Electronics 2025, 14, 4575. [Google Scholar] [CrossRef]
Gámez García, J.A.; Lazzeri, G.; Tapete, D. Airborne and Spaceborne Hyperspectral Remote Sensing in Urban Areas: Methods, Applications, and Trends. Remote Sens. 2025, 17, 3126. [Google Scholar] [CrossRef]
Salomidi, A.; Benndorf, J.; Barakos, G. Establishing a Mineral Spectral Library for Hyperspectral Imaging of Ore in Underground Mines—A Case Study of Reiche Zeche, Germany. Sustainability 2024, 16, 10527. [Google Scholar] [CrossRef]
Hughes, G.F. On the Mean Accuracy of Statistical Pattern Recognizers. IEEE Trans. Inf. Theory 1968, 14, 55–63. [Google Scholar] [CrossRef]
Zou, J.; Qu, H.; Zhang, P. Conventional to Deep Learning Methods for Hyperspectral Unmixing: A Review. Remote Sens. 2025, 17, 2968. [Google Scholar] [CrossRef]
Zhu, F.; Wang, J.; Lv, P.; Qiao, X.; He, M.; He, Y.; Zhao, Z. Generating labeled samples based on improved cDCGAN for hyperspectral data augmentation: A case study of drought stress identification of strawberry leaves. Comput. Electron. Agric. 2024, 221, 109250. [Google Scholar] [CrossRef]
Jia, S.; Jiang, S.; Lin, Z.; Li, N.; Xu, M.; Yu, S. A survey: Deep learning for hyperspectral image classification with few labeled samples. Neurocomputing 2021, 448, 179–204. [Google Scholar] [CrossRef]
Kumar, V.; Singh, R.S.; Rambabu, M.; Dua, Y. Deep learning for hyperspectral image classification: A survey. Comput. Sci. Rev. 2024, 53, 100658. [Google Scholar] [CrossRef]
Ahmad, M.; Distefano, S.; Khan, A.M.; Mazzara, M.; Li, C.; Li, H.; Aryal, J.; Ding, Y.; Vivone, G.; Hong, D. A comprehensive survey for Hyperspectral Image Classification: The evolution from conventional to transformers and Mamba models. Neurocomputing 2025, 644, 130428. [Google Scholar] [CrossRef]
Audebert, N.; Le Saux, B.; Lefèvre, S. Deep learning for classification of hyperspectral data: A comparative review. IEEE Geosci. Remote Sens. Mag. 2019, 7, 159–173. [Google Scholar] [CrossRef]
Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef]
Bo, C.; Lu, H.; Wang, D. Spectral-spatial K-nearest neighbor approach for hyperspectral image classification. Multimed. Tools Appl. 2018, 77, 10419–10436. [Google Scholar] [CrossRef]
Zhang, Y.; Cao, G.; Li, X.; Wang, B.; Fu, P. Active semi-supervised random forest for hyperspectral image classification. Remote Sens. 2019, 11, 2974. [Google Scholar] [CrossRef]
Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep convolutional neural networks for hyperspectral image classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef]
Yu, S.; Jia, S.; Xu, C. Convolutional neural networks for hyperspectral image classification. Neurocomputing 2017, 219, 88–98. [Google Scholar] [CrossRef]
Li, Y.; Zhang, H.; Shen, Q. Spectral–spatial classification of hyperspectral imagery with 3D convolutional neural network. Remote Sens. 2017, 9, 67. [Google Scholar] [CrossRef]
He, M.; Li, B.; Chen, H. Multi-scale 3D deep convolutional neural network for hyperspectral image classification. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3904–3908. [Google Scholar] [CrossRef]
Gao, Q.; Lim, S.; Jia, X. Hyperspectral image classification using convolutional neural networks and multiple feature learning. Remote Sens. 2018, 10, 299. [Google Scholar] [CrossRef]
Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3D–2D CNN feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2020, 17, 277–281. [Google Scholar] [CrossRef]
Zou, L.; Zhang, Z.; Du, H.; Lei, M.; Xue, Y.; Wang, Z.J. DA-IMRN: Dual-attention-guided interactive multi-scale residual network for hyperspectral image classification. Remote Sens. 2022, 14, 530. [Google Scholar] [CrossRef]
Alkhatib, M.Q.; Al-Saad, M.; Aburaed, N.; Almansoori, S.; Zabalza, J.; Marshall, S.; Al-Ahmad, H. Tri-CNN: A three branch model for hyperspectral image classification. Remote Sens. 2023, 15, 316. [Google Scholar] [CrossRef]
Huang, W.; Zhao, Z.; Sun, L.; Ju, M. Dual-Branch Attention-Assisted CNN for hyperspectral image classification. Remote Sens. 2022, 14, 6158. [Google Scholar] [CrossRef]
Zhang, H.; Liu, H.; Yang, R.; Wang, W.; Luo, Q.; Tu, C. Hyperspectral image classification based on double-branch multi-scale dual-attention network. Remote Sens. 2024, 16, 2051. [Google Scholar] [CrossRef]
Yang, J.; Du, B.; Xu, Y.; Zhang, L. Can Spectral Information Work While Extracting Spatial Distribution?—An Online Spectral Information Compensation Network for HSI Classification. IEEE Trans. Image Process. 2023, 32, 2360–2373. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Du, B.; Wang, D.; Zhang, L. ITER: Image-to-Pixel Representation for Weakly Supervised HSI Classification. IEEE Trans. Image Process. 2024, 33, 257–272. [Google Scholar] [CrossRef]
Yang, J.; Du, B.; Zhang, L. Overcoming the Barrier of Incompleteness: A Hyperspectral Image Classification Full Model. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 14467–14481. [Google Scholar] [CrossRef]
Zhao, Z.; Kong, L.; Sun, X.; Wang, X.; Zhang, J.; Shang, X. FGAPA: Feature-Guided Adversarial Prototype Alignment for Cross-Domain Few-Shot Hyperspectral Classification. In Proceedings of the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2026. [Google Scholar] [CrossRef]
Li, S.; Sun, X.; Kong, L.; Zhang, J.; Shang, X. GATformer: Transformer-Based Progressive Triplet Network for Hyperspectral Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 19, 2134–2148. [Google Scholar] [CrossRef]
He, J.; Zhao, L.; Yang, H.; Zhang, M.; Li, W. HSI-BERT: Hyperspectral image classification using the bidirectional encoder representation from Transformers. IEEE Trans. Geosci. Remote Sens. 2020, 58, 165–178. [Google Scholar] [CrossRef]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5518615. [Google Scholar] [CrossRef]
He, X.; Chen, Y.; Lin, Z. Spatial-Spectral Transformer for hyperspectral image classification. Remote Sens. 2021, 13, 498. [Google Scholar] [CrossRef]
Zhao, Z.; Xu, X.; Li, S.; Plaza, A. Hyperspectral image classification using groupwise separable convolutional vision transformer network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
Yang, L.; Yang, Y.; Yang, J.; Zhao, N.; Wu, L.; Wang, L.; Wang, T. FusionNet: A convolution–Transformer fusion network for hyperspectral image classification. Remote Sens. 2022, 14, 4066. [Google Scholar] [CrossRef]
Gu, Q.; Luan, H.; Huang, K.; Sun, Y. Hyperspectral image classification using multi-scale lightweight Transformer. Electronics 2024, 13, 949. [Google Scholar] [CrossRef]
Wang, M.; Sun, Y.; Xiang, J.; Sun, R.; Zhong, Y. Adaptive learnable spectral–spatial fusion Transformer for hyperspectral image classification. Remote Sens. 2024, 16, 1912. [Google Scholar] [CrossRef]
Huang, L.; Chen, Y.; He, X. Spectral-spatial Mamba for hyperspectral image classification. Remote Sens. 2024, 16, 2449. [Google Scholar] [CrossRef]
Chen, J.; Wang, L.; He, W.; Huo, L.; Chang, L.; Song, S.; Shao, M.; Tan, M. SSP-Mamba: Spatial–spectral pyramid Mamba for hyperspectral image classification. Infrared Phys. Technol. 2025, 150, 105990. [Google Scholar] [CrossRef]
Wang, G.; Zhang, X.; Peng, Z.; Zhang, T.; Jiao, L. S²Mamba: A spatial-spectral state space model for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–13. [Google Scholar] [CrossRef]
Lu, S.; Zhang, M.; Huo, Y.; Wang, C.; Wang, J.; Gao, C. SSUM: Spatial-spectral unified Mamba for hyperspectral image classification. Remote Sens. 2024, 16, 4653. [Google Scholar] [CrossRef]
Zhou, W.; Kamata, S.-I.; Wang, H.; Wong, M.-S.; Hou, H. Mamba-in-Mamba: Centralized Mamba-cross-scan in tokenized Mamba model for hyperspectral image classification. Neurocomputing 2025, 613, 128751. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]

Figure 1. Overall workflow of the proposed dual-branch convolutional neural network with depthwise separable fusion (DSFA-CNN). The principal component analysis (PCA)-reduced hyperspectral patch is processed by a 3D convolutional branch for coupled spatial–spectral feature extraction and a convolutional block attention module (CBAM)-enhanced 1D+2D convolutional branch for decomposed spectral and spatial modeling. The two branch features are fused by the depthwise separable fusion module for final classification.

Figure 2. Workflow of PCA.

Figure 3. Structure of the CBAM attention mechanism.

Figure 4. Structure of the depthwise separable fusion module.

Figure 5. Indian Pines dataset: (a) false-color image; (b) ground-truth map.

Figure 6. University of Pavia dataset: (a) false-color image; (b) ground-truth map.

Figure 7. Salinas dataset: (a) false-color image; (b) ground-truth map.

Figure 8. Houston2013 dataset: (a) false-color image; (b) ground-truth map.

Figure 9. Classification maps for the IP dataset: (a) grayscale image; (b) ground truth; (c) SVM; (d) KNN; (e) 3D-CNN; (f) HybridSN; (g) ResNet-50; (h) GSCViT; (i) HSIRMamba; (j) DSFA-CNN.

Figure 10. Classification maps for the PU dataset: (a) grayscale image; (b) ground truth; (c) SVM; (d) KNN; (e) 3D-CNN; (f) HybridSN; (g) ResNet-50; (h) GSCViT; (i) HSIRMamba; (j) DSFA-CNN.

Figure 11. Classification maps for the SA dataset: (a) grayscale image; (b) ground truth; (c) SVM; (d) KNN; (e) 3D-CNN; (f) HybridSN; (g) ResNet-50; (h) GSCViT; (i) HSIRMamba; (j) DSFA-CNN.

Figure 12. Classification maps for the Houston2013 dataset: (a) grayscale image; (b) ground truth; (c) SVM; (d) KNN; (e) 3D-CNN; (f) HybridSN; (g) ResNet-50; (h) GSCViT; (i) HSIRMamba; (j) DSFA-CNN.

Figure 13. Sensitivity analysis with respect to the number of PCA components: (a) overall accuracy on four datasets; (b) inference time on four datasets; (c) mean OA–inference time trade-off over the four datasets. The red star indicates the selected setting.

Figure 14. Sensitivity analysis with respect to the input patch size: (a) overall accuracy on four datasets; (b) inference time on four datasets; (c) mean OA–inference time trade-off over the four datasets. The red star indicates the selected setting.

Figure 15. Spectral attention response distribution of DSFA-CNN on the IP dataset. The figure reflects the importance assigned by the model to different PCA-retained spectral components.

Figure 16. Spatial attention visualization for the Soybean-mintill class in the IP dataset.

Figure 17. Class-level attention distributions of DSFA-CNN over PCA-retained spectral components for the 16 land-cover categories in the IP dataset.

Figure 18. t-SNE visualization of DSFA-CNN features on the IP dataset.

Figure 19. Accuracy–efficiency profile of different deep learning models on the Houston2013 dataset. The x-axis denotes FLOPs, the y-axis denotes OA, the bubble size denotes the number of trainable parameters, and the color denotes inference time.

Table 1. Network architecture of the proposed dual-branch convolutional neural network with depthwise separable fusion (DSFA-CNN), including the convolutional block attention module (CBAM)-enhanced branch.

Branch	Layer	Kernel Size	Stride	Padding	Output Shape
CBAM-enhanced 1D+2D branch	SpectralAttention	—	—	—	[30,13,13]
	Squeeze	—	—	—	[1,30]
	Conv1d	3	2	1	[16,15]
	ReLU	—	—	—	[16,15]
	Conv1d	3	2	1	[32,8]
	ReLU	—	—	—	[32,8]
	Conv1d	3	2	1	[64,4]
	ReLU	—	—	—	[64,4]
	Conv1d	3	2	1	[128,2]
	ReLU	—	—	—	[128,2]
	Conv1d	Spectral kernel	1	0	[C,1]
	SpatialAttention	—	—	—	[30,13,13]
	Conv2d	(3,3)	(1,1)	0	[16,11,11]
	ReLU	—	—	—	[16,11,11]
	Conv2d	(3,3)	(1,1)	0	[32,9,9]
	ReLU	—	—	—	[32,9,9]
	Conv2d	(3,3)	(1,1)	0	[64,7,7]
	ReLU	—	—	—	[64,7,7]
	Conv2d	(3,3)	(1,1)	0	[128,5,5]
	ReLU	—	—	—	[128,5,5]
	Conv2d	Spatial kernel	(1,1)	0	[C,1,1]
Coupled 3D branch	Conv3d	(7,3,3)	(1,2,2)	(0,1,1)	[8,24,7,7]
	ReLU	—	—	—	[8,24,7,7]
	Conv3d	(5,3,3)	(1,2,2)	(0,1,1)	[16,20,4,4]
	ReLU	—	—	—	[16,20,4,4]
	Conv3d	(3,3,3)	(1,2,2)	(0,1,1)	[32,18,2,2]
	ReLU	—	—	—	[32,18,2,2]
	Flatten				[2304]
	Linear	—	—	—	[512]
	Linear	—	—	—	[256]
	Linear	—	—	—	[C]
Depthwise separable fusion module (DSF)	Linear	—	—	—	[C]
	Linear	—	—	—	[C]
	Conv1d	1	0	0	[1,C]
	Linear	—	—	—	[C]
	Total Param (example with C = 7)	—	—	—	1,491,887

Table 2. Class categories and numbers of samples in the IP dataset.

Dataset	Category	Class	Samples	Total Samples
IP	C1	Alfalfa	46	10,249
	C2	Corn-notill	1428
	C3	Corn-mintill	830
	C4	Corn	237
	C5	Grass-pasture	483
	C6	Grass-trees	730
	C7	Grass-pasture-mowed	28
	C8	Hay-windrowed	478
	C9	Oats	20
	C10	Soybean-notill	972
	C11	Soybean-mintill	2455
	C12	Soybean-clean	593
	C13	Wheat	205
	C14	Woods	1265
	C15	Buildings-Grass-Trees-Drivers	386
	C16	Stone-Steel-Towers	93

Total samples denotes the total number of labeled samples.

Table 3. Class categories and numbers of samples in the PU dataset.

Dataset	Category	Class	Samples	Total Samples
PU	C1	Asphalt	6631	42,776
	C2	Meadows	18,649
	C3	Gravel	2099
	C4	Trees	3064
	C5	Painted metal sheets	1345
	C6	Bare Soil	5029
	C7	Bitumen	1330
	C8	Self-Blocking Bricks	3682
	C9	Shadows	947

Total samples denotes the total number of labeled samples.

Table 4. Class categories and numbers of samples in the SA dataset.

Dataset	Category	Class	Samples	Total Samples
SA	C1	Brocoli_green_weeds_1	2009	54,129
	C2	Brocoli_green_weeds_2	3726
	C3	Fallow	1976
	C4	Fallow_rough_plow	1394
	C5	Fallow_smooth	2678
	C6	Stubble	3959
	C7	Celery	3579
	C8	Grapes_untrained	11,271
	C9	Soil_vinyard_develop	6203
	C10	Corn_senesced_green_weeds	3278
	C11	Lettuce_romaine_4wk	1068
	C12	Lettuce_romaine_5wk	1927
	C13	Lettuce_romaine_6wk	916
	C14	Lettuce_romaine_7wk	1070
	C15	Vinyard_untrained	7268
	C16	Vinyard_vertical_trellis	1807

Total samples denotes the total number of labeled samples.

Table 5. Class categories and numbers of samples in the Houston2013 dataset.

Dataset	Category	Class	Samples	Total Samples
Houston2013	C1	Healthy_grass	1362	16,372
	C2	Stressed_grass	1366
	C3	Synthetic_grass	760
	C4	Trees	1355
	C5	Soil	1353
	C6	Water	354
	C7	Residential	1382
	C8	Commercial	1355
	C9	Road	1364
	C10	Highway	1337
	C11	Railway	1345
	C12	Parking_Lot_1	1343
	C13	Parking_Lot_2	511
	C14	Tennis_Court	466
	C15	Running_Track	719

Total samples denotes the total number of labeled samples.

Table 6. Classification results of different models on the IP dataset.

Class Names	SVM	KNN	3DCNN	HybridSN	ResNet-50	GSCViT	HSIRMamba	DSFA-CNN
Alfalfa	84.00	0.00	91.00	74.00	47.00	67.00	95.00	86.00
Corn-notill	77.00	71.00	91.00	96.00	84.00	93.00	97.00	96.00
Corn-mintill	74.00	69.00	86.00	92.00	69.00	97.00	93.00	91.00
Corn	70.00	91.00	88.00	73.00	80.00	92.00	100.00	93.00
Grass-pasture	92.00	84.00	94.00	97.00	93.00	96.00	90.00	97.00
Grass-trees	95.00	81.00	99.00	99.00	96.00	98.00	98.00	100.00
Grass-pasture-mowed	100.00	0.00	96.00	95.00	12.00	80.00	93.00	100.00
Hay-windrowed	96.00	89.00	100.00	98.00	92.00	100.00	94.00	99.00
Oats	100.00	0.00	100.00	95.00	24.00	63.00	93.00	100.00
Soybean-notill	79.00	71.00	90.00	94.00	82.00	92.00	91.00	93.00
Soybean-mintill	86.00	68.00	93.00	94.00	80.00	97.00	94.00	96.00
Soybean-clean	82.00	68.00	85.00	94.00	79.00	90.00	94.00	89.00
Wheat	99.00	86.00	99.00	95.00	95.00	100.00	98.00	99.00
Woods	94.00	87.00	99.00	97.00	94.00	99.00	99.00	99.00
Buildings-Grass-Trees-Drivers	79.00	82.00	93.00	89.00	89.00	85.00	93.00	93.00
Stone-Steel-Towers	90.00	100.00	93.00	95.00	74.00	93.00	92.00	96.00
OA (%)	84.98	74.87	92.97 ± 0.35	93.48 ± 0.19	83.98 ± 2.05	95.18 ± 0.11	94.93 ± 1.45	95.62 ± 0.13
AA (%)	87.31	65.44	93.56 ± 1.24	92.56 ± 0.21	74.38 ± 2.76	90.13 ± 0.32	94.63 ± 1.76	94.65 ± 0.18
Kappa (%)	82.90	70.89	91.96 ± 0.42	92.89 ± 0.18	81.56 ± 2.21	94.48 ± 0.12	94.20 ± 0.84	95.01 ± 0.10
Macro-F1 (%)	50.76	24.00	90.08 ± 1.23	93.87 ± 1.34	73.84 ± 3.24	94.35 ± 0.67	94.71 ± 0.78	90.93 ± 0.34