Double-Attention Context Interactive Network for Hyperspectral Image Classification

Hu, Nannan; Wang, Zhongao; Wang, Minghao; Zhao, Yuefeng

doi:10.3390/rs18071059

Open AccessArticle

Double-Attention Context Interactive Network for Hyperspectral Image Classification

¹

School of Communication and Electronic Engineering, Shandong Normal University, Jinan 250358, China

²

Shandong Provincial Engineering and Technical Center of Light Manipulations & Shandong Provincial Key Laboratory of Optics and Photonic Device, School of Physics and Electronics, Shandong Normal University, Jinan 250358, China

³

Shandong Key Laboratory of Medical Physics and Image Processing, School of Communication and Electronic Engineering, Shandong Normal University, Jinan 250358, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2026, 18(7), 1059; https://doi.org/10.3390/rs18071059

Submission received: 30 January 2026 / Revised: 27 March 2026 / Accepted: 30 March 2026 / Published: 2 April 2026

(This article belongs to the Special Issue Recent Advances in Hyperspectral Remote Sensing: Theories, Technologies and Applications)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose a Double-Attention Context Interactive Network (DACINet) that strengthens long-range contextual interaction and 3D spectral–spatial feature learning for hyperspectral image classification, via a Context Interaction Fusion Module (CIFM) and Channel–Spatial Double-Attention (CSDA).
Extensive experiments on Indian Pines, Pavia University, and Salinas datasets demonstrate that the DACINet achieves superior classification accuracy and robustness compared to state-of-the-art methods.

What are the implications of the main findings?

By explicitly modeling long-range context interaction across distant spectral bands and introducing Channel–Spatial Double-Attention, the framework alleviates the limitations of local receptive fields in conventional CNNs, yielding more discriminative and robust hyperspectral representations under limited training samples and class imbalance.
By further coupling context-interactive feature fusion with a 2D–3D hybrid convolutional design, the proposed CNN-based solution enhances spectral–spatial integration and boundary/detail sensitivity, improving overall classification reliability in practical hyperspectral image classification settings.

Abstract

Convolution is still the main method for hyperspectral image classification, since it takes into account both spatial and spectral characteristics. However, the convolution relies on local perceptual computation, ignoring the effective discriminant of context association for classification. In this paper, we propose a Double-Attention Context Interactive Network (DACINet) for hyperspectral image classification. Specifically, a Context Interaction Fusion Module (CIFM) is designed to enhance long-range contextual dependencies. By stacking multiple 3D convolutional layers, the module progressively enlarges its receptive field, while cross-layer residual connections facilitate the integration of features from different contextual scales, thereby strengthening the model’s ability to capture complex relationships within the hyperspectral data. Then, a Channel–Spatial Double-Attention (CSDA) mechanism based on 3D is proposed for enhancing the two-dimensional spatial features and one-dimensional spectral features, respectively, and fusing the enhanced features. Furthermore, we also construct a hybrid convolutional layer, which combines 2D and 3D convolution to further enhance spectral bands on the basis of three-dimensional understanding. Extensive experiments on the widely used IP, UP, SA and HU datasets show that the proposed DACINet achieves superior classification accuracy, reaching Overall Accuracies of 96.78%, 97.77%, 99.53% and 86.67% respectively, outperforming other state-of-the-art models.

Keywords:

hyperspectral image classification; context interactive; double-attention; hybrid convolutional layer

1. Introduction

Hyperspectral images (HSIs) are common 3D remote sensing data, which contain both spectral information and spatial information of surface objects. Hyperspectral image classification (HSIC) aims to assign a unique semantic label to each HSI pixel based on the given land cover category [1,2]. It is widely used in the fields of resource investigations, environmental monitoring and agricultural production [3]. In the early days, machine learning methods such as SVMs were commonly adopted to extract effective discriminant features from HSIs, and different classifiers were then used to classify each pixel [4].

Benefiting from advances in deep learning [5,6,7], hyperspectral image processing has achieved remarkable progress. CNN-based methods were the first to be introduced into hyperspectral image classification. Hu et al. [8] proposed a five-layer 1D-CNN, which better extracts the spectral features of hyperspectral images. Fang et al. [9] employed discriminative band selection for pipeline hyperspectral images to better extract spatial features using 2D convolution. Subsequently, many new frameworks achieved remarkable results in hyperspectral classification. He et al. [10] employed a hybrid Mamba-Transformer framework to explore the multiscale properties of hyperspectral data effectively. Liang et al. [11] achieved efficient global dependency modeling using a double-branch Mamba-like linear attention mechanism. Zhang et al. [12] proposed the Center-Scan Mamba network to model spatial–spectral long-range dependency with linear complexity. Although new architectures such as Transformers and Mamba demonstrate great potential in hyperspectral image classification, the CNN and its variants remain the most widely used and mature solution due to their strong feature extraction capability and low computational complexity.

Recently, CNN variants have once again become the research focus in hyperspectral image classification. Ahmad et al. [13] proposed a 3D CNN to leverage spectral–spatial feature maps to enhance HSIC performance. Ghaderizadeh et al. [14] proposed a hybrid 3D–2D CNN, which employs a 3D fast learning block followed by a 2D CNN to extract spectral–spatial features. Alkhatib et al. [15] classified HSIs using a multiscale 3D CNN and three-branch feature fusion. Gündüz et al. [16] introduced a dual attention mechanism into both spectral and spatial modules to enhance the model’s discriminative ability. These variants achieve excellent classification results. However, 1D networks focus on spectral information, 2D networks focus on spatial information, and 3D networks are limited to local perception and ignore contextual correlations. This separation between spectral and spatial processing leads to insufficient global perception, making it difficult for the model to fully exploit the inherent advantages of the graphical unification. Recent efforts such as the Online Spectral Information Compensation Network [17] address this by extracting multiscale spatial features through a multi-branch network and progressively compensating spectral information. However, the OSICN primarily focuses on spatial scale diversity rather than explicitly modeling contextual interactions across spectral bands.

To alleviate the above problems, we propose a novel Double-Attention Context Interactive Network (DACINet) for modeling the contextual interaction of spectral and spatial features simultaneously. The DACINet mainly consists of a Context Interaction Fusion Module (CIFM), Channel–Spatial Double-Attention (CSDA), and a 2D–3D hybrid convolutional layer. CIFM can capture long-range correlations between spectral bands by incorporating cross-layer residual connections into the 3D convolutional network. Then, the CSDA enhances two-dimensional spatial features and one-dimensional spectral features, respectively, and fuses the enhanced features. In addition, a hybrid convolution layer that combines 2D and 3D convolution is introduced to further enhance spectral bands based on 3D feature extraction. To validate the effectiveness and robustness of our proposed model, we conduct extensive experiments on four challenging benchmark datasets: Indian Pines (IP), Pavia University (UP), Salinas (SA), and the 2013 University of Houston (HU) dataset.

The main contributions of this paper are as follows:

A novel DACINet is proposed to improve HSIC performance by enhancing contextual interaction and leveraging 3D spectral–spatial features.
A CIFM is proposed to capture contextual interaction features, while CSDA is designed to suppress irrelevant information and to enhance spectral bands and spatial information.
A hybrid convolution layer combining 2D and 3D convolutions is proposed to further strengthen the representation of spectral–spatial features.

2. Related Work

In this section, we review two topics most relevant to our work: deep learning-based hyperspectral image classification and attention-based hyperspectral image classification.

2.1. Deep Learning-Based Hyperspectral Image Classification

Convolutional Neural Networks (CNNs) [18], especially 3D CNNs [19,20], have become the main approach for hyperspectral image classification, owing to their robust feature extraction capabilities. Ghaderizadeh et al. [14] extracted spectral–spatial features using 3D convolutions and further optimized them through 2D convolutions. To address the limitation of fixed receptive fields in capturing long-range contextual dependencies, Roy et al. [21] introduced an adaptive receptive field 3D residual module, dynamically adjusting convolution kernels to enhance feature representation. To overcome the Euclidean space constraints of CNNs and enhance the modeling of non-Euclidean geometric relationships, Graph Neural Networks (GNNs) [22] have been introduced into hyperspectral classification. SSGRAM [23] proposed a 3D spectral–spatial feature network combined with a Graph Attention Feature Processor. NESSGGCN [24] constructed a Gated GCN-CNN Non-Euclidean Spectral–Spatial Feature Mining Network to simultaneously extract features from non-Euclidean and Euclidean spaces. Recently, various novel deep learning architectures have been proposed for HSI classification. Yang et al. [25] proposed an Enhanced Multiscale Feature Fusion Network (EMFFN) that extracts multiscale features through a three-stage parallel multi-path architecture. For cross-scene classification, Liu et al. [26] proposed a Dual Classification Head Self-Training Network (DHSNet) that alleviates domain discrepancies through dual-head consistency learning. Wang et al. presented HyperSIGMA [27], the first billion-parameter foundation model for HSIs, which unifies multiple interpretation tasks via a sparse sampling attention mechanism. Unsupervised clustering is a fundamental task for HSI analysis when labeled samples are unavailable. Zhang et al. [28] proposed Elastic Graph Fusion Subspace Clustering (EGFSC) for large-scale HSI clustering via superpixel-level learning and dual-graph fusion. Huang et al. [29] introduced a structural prior-guided subspace clustering method incorporating local/non-local spatial priors and cluster structure priors. Jiang et al. [30] proposed Structured Anchor Projected Clustering with linear time complexity through anchor generation and dual-graph learning. These unsupervised methods provide powerful tools for mining the intrinsic structures of HSIs. The structural information captured by these methods, such as superpixel-level organization and graph-based relationships, offers valuable insights that can inform the design of feature extraction modules in supervised classification frameworks like our DACINet.

2.2. Attention-Based Hyperspectral Image Classification

Attention mechanisms [31] were initially proposed in the field of natural language processing (NLP). Following the success of the Vision Transformer (ViT) in image classification, Chen et al. [32] pioneered its application to HSI classification, demonstrating its potential for capturing global context. To better harness the complementary strengths of different architectures, hybrid attention designs have since emerged. Arshad et al. [33] introduced HAT to integrate the local representational capabilities of 3D and 2D CNNs within a Transformer framework. Zhao et al. [34] combined group-wise separable convolution with spectral calibration attention, effectively reorganizing and weighting features in the spectral dimension. Similarly, Jing et al. [35] introduced dynamic convolution and spatial–spectral attention mechanisms to dynamically extract and integrate multi-level semantic features. Although the aforementioned methods have made significant progress in the design and application of attention mechanisms, most approaches treat spectral and spatial attention modules as separate units, whether sequential or parallel, thus failing to establish a unified dual-path interactive framework. For instance, the DBDA network [36] applies channel-wise and spatial-wise attention in two independent branches. Although effective, this design inherently decouples spectral and spatial attention, limiting their interactive fusion. In contrast, the CSDA module proposed in this paper operates within a unified 3D feature space, enabling parallel enhancement and interactive fusion of spectral and spatial features. This interactive design distinguishes CSDA from existing decoupled attention mechanisms and forms the core innovation of the proposed DACINet.

3. Proposed Method

In this section, we present the proposed Double-Attention Context Interactive Network (DACINet) in detail.

The diagram of the DACINet is illustrated in Figure 1. A hyperspectral image captures tens to hundreds of continuous spectral bands of a target region simultaneously, with each band containing a large amount of pixel information. In the DACINet, we first apply PCA to project the high-dimensional data into a low-dimensional space, which reduces dimensionality and eliminates redundant information. The reduced-dimensionality features are then input into the CIFM to model contextual interactions between spectral and spatial features simultaneously. Next, CSDA enhances spatial and spectral context interaction features, respectively, to retain more discriminative information. A hybrid convolutional layer combining 2D and 3D convolutions further enhances the discriminative power of spectral features. Finally, the resulting feature maps are classified by a fully connected layer.

3.1. PCA

In order to solve the problem of hyperspectral image data redundancy, we still employ principal component analysis (PCA) to reduce the dimensionality of the hyperspectral data. The mapping process of PCA is as follows:

[\begin{matrix} y_{1}^{i} \\ y_{2}^{i} \\ ⋮ \\ y_{n}^{i} \end{matrix}] = [\begin{matrix} u_{1}^{T} \cdot (x_{1}^{i}, x_{2}^{i}, \dots, x_{n}^{i}) \\ u_{2}^{T} \cdot (x_{1}^{i}, x_{2}^{i}, \dots, x_{n}^{i}) \\ ⋮ \\ u_{n}^{T} \cdot (x_{1}^{i}, x_{2}^{i}, \dots, x_{n}^{i}) \end{matrix}],

(1)

where

X^{i}

is hyperspectral images, and

X^{i} = {(x_{1}^{i}, x_{2}^{i}, \dots, x_{n}^{i})}^{T} \in R^{H \times W \times B}

.

u_{i}^{T}

is the corresponding eigenvector. The feature after dimensionality reduction is represented as

Y^{i} = {(y_{1}^{i}, y_{2}^{i}, \dots, y_{n}^{i})}^{T} \in R^{H \times W \times B^{'}}

. The PCA maps high-dimensional data to low-dimensional space through some linear projection, maximizing the retention of effective discriminative information in the projected dimension.

3.2. The Context Interaction Fusion Module (CIFM)

In hyperspectral images, homogeneous regions with different semantic and geometric properties but extremely similar appearance pose a significant challenge to the classification process. The long-distance contextual relationship between the image bands can provide effective discriminative constraints for such regions. To capture this contextual information, we design a Context Interaction Fusion Module (CIFM) consisting of stacked 3D convolutional residual blocks with cross-layer connections, as shown in Figure 2. While each individual block follows the standard residual 3D CNN design, the CIFM as a whole serves as a dedicated context encoder strategically positioned before the CSDA module to provide richly contextualized features for subsequent dual attention enhancement.

In the CIFM, long-distance correlations between hyperspectral image bands are modeled through cross-layer residual connections. The input feature

Y \in R^{H \times W \times B^{'}}

is processed by 3D convolution, and the resulting feature is:

\begin{matrix} Y_{j}^{l} = 3 D C o v (Y) = W_{l + 1}^{7 \times 3 \times 3} * Y + b^{l + 1} \end{matrix},

(2)

\begin{matrix} Y^{'} = R (\sum_{j = 1}^{x^{l}} F_{b n} (Y_{j}^{l}) * W_{i}^{l + 1} + b_{i}^{l + 1}) \end{matrix},

(3)

\begin{matrix} F_{b n} (Y_{j}^{l}) = \frac{Y_{j}^{l} - E (Y_{j}^{l})}{\sqrt{V a r^{2} (Y_{j}^{l})} + ϵ} \times γ + β \end{matrix},

(4)

where

Y_{j}^{l}

is the output features of the convolution layer,

W_{l + 1}^{7 \times 3 \times 3}

is the learnable weight with kernel size

(7 \times 3 \times 3)

, and

b^{l + 1}

is the bias.

R (\cdot)

represents the non-linear activation function ReLU.

F_{b n} (Y_{j}^{l})

is regularization,

E (Y_{j}^{l})

and

V a r^{2} (Y_{j}^{l})

mean the batch mean and variance in input features, and

ϵ

,

γ

,

β

represent the stability constant, scaling factor, and offset, respectively. A double-layer 3D convolution is set in the CIFM, since the features are processed again according to Equations (2)–(4), and output features after two weighted layers are labeled as

Y_{j}^{l + 1}

, and then two additional rounds of weight processing are performed:

\begin{matrix} Y_{j}^{l} = W_{l + 1}^{1 \times 1 \times 1} * Y^{'} + b^{l + 1} \end{matrix},

(5)

\begin{matrix} Y_{j}^{l + 1} = \sum_{j = 1}^{x^{l}} (F_{b n} (Y_{j}^{l}) * W_{i}^{l + 1} + b_{i}^{l + 1}) \end{matrix},

(6)

where

Y_{j}^{l + 1}

are the output features after the two consecutive layers of weights. It further joins through cross-layer residuals:

\begin{matrix} Y_{1} = Y + Y_{j}^{l + 1} \end{matrix},

(7)

where

Y_{1}

is the output of the first 3D residual block. The final output feature

Y_{f}

of the CIFM is obtained after f times of the same superimposed residual block processing. These cross-layer connections capture information associations over long distances, enhancing the context modeling capabilities of 3D convolution.

3.3. The Channel–Spatial Double-Attention (CSDA)

Hyperspectral images contain both spatial information and spectral band information. Effective integration of these two modalities can significantly improve classification performance. Considering the 3D spectral–spatial characteristics of hyperspectral data, we propose a Channel–Spatial Double-Attention (CSDA) mechanism based on pooling and feature fusion, as shown in Figure 3.

CSDA consists of two parallel branches: a channel attention mechanism and a spatial attention mechanism. Different from the original CBAM, which processes these two attentions sequentially, our CSDA processes them independently and in parallel.

Given the input feature map

Y_{f}

from the CIFM, the channel and spatial attention branches are computed as follows:

\begin{matrix} f_{1}^{m a x} = F_{m a x} (Y_{f}), f_{1}^{a v g} = F_{a v g} (Y_{f}) \end{matrix},

(8)

\begin{matrix} f_{2}^{m a x} = M a t r (M a x (Y_{f})), f_{2}^{a v g} = M a t r (M e a n (Y_{f})) \end{matrix},

(9)

where

f_{1}^{m a x}

,

f_{1}^{a v g}

,

f_{2}^{m a x}

, and

f_{2}^{a v g}

denote the pooled feature maps from each operation.

F_{m a x} (\cdot)

and

F_{a v g} (\cdot)

are the maximum pooling and global average pooling operations.

M a t r

denotes a matrix transformation operation.

Then,

f_{1}^{m a x}

and

f_{1}^{a v g}

are passed through an MLP for feature fusion:

\begin{matrix} Z_{1}^{m a x} = W_{1 \times 1}^{m a x} * f_{1}^{m a x} + b^{l + 1} \\ Z_{1}^{a v g} = W_{1 \times 1}^{a v g} * f_{1}^{a v g} + b^{l + 1} \\ Z = Z_{1}^{m a x} + Z_{1}^{a v g} \end{matrix} .

(10)

The key to explicitly modeling spectral inter-band correlations lies within this shared MLP. Since its layers are fully connected, the MLP learns complex, non-linear interdependencies between all spectral bands. By considering the context provided by all bands simultaneously, it determines the relative importance of each band and produces an attention map Z that encodes these cross-band relationships.

For the spectral channel, the output features are obtained by element-wise multiplication of the attention map with the input:

\begin{matrix} F^{C a m} = Z ⨂ Y_{f} \end{matrix} .

(11)

For the spatial channel,

f_{2}^{m a x}

and

f_{2}^{a v g}

are concatenated as:

\begin{matrix} U_{1} = C a t (f_{2}^{m a x}, f_{2}^{a v g}) \end{matrix},

(12)

which are then passed through a convolution layer to obtain the weight matrix:

\begin{matrix} U = W_{(7 \times 7)}^{l + 1} * U_{1} + b^{l + 1} \end{matrix} .

(13)

The output features are also obtained by pixel dot multiplication of the weight matrix with the input

\begin{matrix} F^{S a m} = U ⨂ Y_{f} \end{matrix} .

(14)

The outputs of the two branches are added to obtain the final output of CSDA:

\begin{matrix} F = F^{C a m} + F^{S a m} \end{matrix} .

(15)

By processing channel and spatial attention in parallel, CSDA ensures that both dimensions are enhanced independently before fusion, preserving the unique characteristics of spectral and spatial information. The dual-branch design effectively captures regional correlations in both domains, assigning higher weights to informative regions while suppressing irrelevant ones, thereby improving the discriminative power of the learned features for HSI classification.

3.4. The Hybrid Convolutional Layer

Multilayer 3D convolution is usually used to process hyperspectral images. In practice, the 1D spectrum and the 2D space are characterized independently. The use of 3D convolution increases the complexity of the model and prolongs the processing time. In the proposed DACINet, we also design a hybrid convolutional layer based on 2D and 3D convolutions. The layer consists of two 2D convolution layers and two 3D convolution layers. The size of the 3D convolutional layer kernel is

(7, 3, 3)

, and the size of the 2D convolutional kernel is

(3, 3)

. The hybrid layer begins with 3D convolutions because they are effective in capturing joint spectral–spatial features directly from the hyperspectral data, building a solid foundation for later processing. Given the feature map F output by the CSDA module, it is first processed by two consecutive 3D convolutional layers:

\begin{matrix} F^{'} = 3 D Cov (3 D Cov (F)) \end{matrix} .

(16)

The output F′ is a 4D tensor, with dimensions corresponding to feature channels, spectral bands, height, and width. This 4D tensor is then reshaped into a 3D tensor by merging the channel and spectral dimensions, resulting in a feature map of the shape (channels × bands) × height × width. This reshaped tensor is then fed into two consecutive 2D convolutional layers:

\begin{matrix} F^{″} = 2 D Cov (2 D Cov (F^{'})) \end{matrix} .

(17)

Through this reshaping, each channel in the resulting feature map encodes a specific combination of feature type and spectral band, allowing the 2D convolutions to learn cross-band interactions by combining these channels. This lets the model refine spatial patterns while implicitly learning how different spectral bands interact, without the extra cost of additional 3D layers. After feature extraction, the output is passed through batch normalization and ReLU activation, then flattened and fed into a fully connected layer for classification. In this design, 3D convolutions extract features simultaneously in spectral and spatial directions, while 2D convolutions focus more on spatial feature refinement. When combined, they enable multi-level feature learning and improve the model’s generalization ability and classification accuracy.

3.5. Classification

Finally, a fully softmax classifier is used for HSI classification. The class probability of each pixel is obtained through the softmax layer, and then the cross-entropy loss L is calculated:

\begin{matrix} L = \frac{1}{M} \sum_{m = 1}^{M} \sum_{k = 1}^{K} l n (y_{K}^{m}) l o g ({\hat{y}}_{K}^{m}) \end{matrix},

(18)

where M is the number of samples in a small sample set. K is the total number of categories.

y_{K}^{m}

and

{\hat{y}}_{K}^{m}

represent actual and predicted sample labels respectively. The loss function tends to converge quickly on many kinds of problems, and the optimal solution can be found quickly.

4. Experiments

4.1. Datasets

In experiments, we adopted a stratified random sampling strategy, randomly selecting 5% of labeled samples per class for the IP dataset and 1% for the UP, SA, and Houston2013 datasets as training samples. The remaining samples were used for testing. Indian Pines (IP): The Indian Pines dataset is the first test data for HSIC, imaged by an Airborne Visual Infrared Imaging Spectrometer (AVIRIS) in 1992 on an Indian pine tree in Indiana, USA. Its spatial dimension is 145 × 145 pixels, the spectral dimension is 200 spectral bands, and it consists of 16 target categories. Table 1 presents the category names, number of categories of the Indian Pines dataset, as well as the corresponding color annotations for each category in the visualization of classification results.

Pavia University (UP): The Pavia University dataset was acquired in 2001 over the University of Pavia campus, Italy, using the Reflective Optical System Imaging Spectrometer (ROSIS). The image comprises 610 × 340 pixels, with 115 spectral bands covering the wavelength range of 0.43–0.86 µm. After the removal of 12 noisy bands, the remaining 103 bands are commonly used for analysis. provides the category names, the number of categories, and the corresponding color annotations for this dataset. Table 2 presents the category names, number of categories of the Pavia University dataset, as well as the corresponding color annotations for each category in the visualization of classification results.

Salinas (SA): The Salinas hyperspectral dataset was acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) over the Salinas Valley, CA, USA. The image has dimensions of 512 × 217 pixels (totaling 111,104 pixels) and originally comprises 220 spectral bands covering a wavelength range of approximately 0.4–2.5 µm. The scene is annotated into 16 distinct land cover classes. Table 3 presents the category names, number of categories of the Salinas dataset, as well as the corresponding color annotations for each category in the visualization of classification results.

University of Houston 13 (HU): The HU dataset consists of 349 × 1905 pixels with 144 spectral channels ranging from 364 to 1046 nm and a spatial resolution of 2.5 m/pixel. In addition, the ground truth reference was subdivided into spatially disjoint subsets for training and testing, including 15 mutually exclusive urban land cover classes with 15,029 labeled pixels. Table 4 presents the category names, number of categories of the Houston 13 dataset, as well as the corresponding color annotations for each category in the visualization of classification results.

4.2. Evaluation Metrics

To quantitatively compare the classification performance of different methods and modules from various aspects, in the following experiments, we adopt four commonly used evaluation metrics, namely, Overall Accuracy (OA), Average Accuracy (AA), Kappa Coefficient (Kap) and Accuracy per Class (AEC).

4.3. Implementations Details

All experiments are conducted on an NVIDIA T400 GPU (NVIDIA Corporation, Santa Clara, CA, USA), Windows 11 64-bit, Python 3.8.16 and PyTorch 2.2.1. For model training, the adaptive moment estimation (Adam) optimizer is employed, and the network starts with a learning rate of 0.001 and 100 Epochs. To better compare the advantages of different networks, the best results are highlighted in bold.

4.4. Comparison with State-of-the-Art Methods

The proposed model is compared with representative baselines spanning different methodological paradigms in HSI classification. These include: traditional machine learning methods RF [37] and SVM [38] to establish fundamental benchmarks; early deep learning models like Context [39] to trace the evolution from CNNs to advanced architectures; hybrid 2D/3D CNNs, such as HybridN [40] and CVSSN [41], as direct competitors sharing a similar hybrid convolutional design; attention-based networks RSSAN [42], SSTN [43], and SSAtt [44] to evaluate our CSDA module against prominent attention mechanisms; and graph convolutional networks like F-GCN [45] to assess performance across different paradigms for modeling non-Euclidean data relationships.

Table 5 shows the experimental results of different methods on the IP dataset. The results show that among the 16 categories of objects, nine categories of the proposed method achieved the best classification effect. The comprehensive classification effect is the best, with its OA reaching 96.78%, AA reaching 90.60%, and Kappa reaching 96.32%. Compared with the HybridN model with a suboptimal effect, our DACINet improves 1.54%, 0.04% and 4.02% on OA, AA and Kappa respectively. It is noteworthy that beyond the mean accuracy, DACINet often exhibits a lower standard deviation across multiple runs compared to other methods (as shown in Table 5, Table 6, Table 7 and Table 8). This indicates a more stable and reliable classification performance, which is a significant advantage in practical applications. To quantify the impact of limited training samples on classification performance, we analyze the relationship between per class sample size and accuracy. As shown in Table 1, classes 1 (Alfalfa), 7 (Grass/pasture-mowed), and 9 (Oats) have the fewest training samples in the IP dataset, with only 46, 28, and 20 samples, respectively. Correspondingly, Table 5 reveals that these three classes consistently achieve the lowest accuracies across all compared methods. Even with our proposed DACINet, the accuracies on these minority classes are only 70.75%, 74.42%, and 73.33%, respectively. In contrast, classes with abundant training samples, such as class 2 (Corn-notill, 1428 samples) and class 11 (Soybean-mintill, 2455 samples), achieve accuracies exceeding 95% with the DACINet. This clear positive correlation between training sample size and classification accuracy demonstrates that limited sample availability is indeed the primary factor contributing to the suboptimal performance on these minority classes.

Table 6 and Table 7 present the experimental results of different methods on the UP and SA datasets. For the UP dataset, the results of Table 6 show that in the nine types of land features, our proposed method achieved the best classification effect in six of them. Due to the relatively small number of land feature types in the UP dataset, as can be observed from the data in the table, the accuracy of our proposed DACINet in eight types of land features reached over 90%, and in the second and fifth types, it was as high as over 99%. Kappa is mainly used to measure whether the final classification result is consistent with the actual observation value, and it measures the stability of the entire model’s random classification. Our OA index has climbed to 97.77%, and the AA index has reached an excellent result of 96.72%. At the same time, the consistency test index Kappa is as high as 97.04%. Compared with the HybridN model that ranked second in comprehensive performance, our proposed method has improved by three percentage points in OA, jumped by 6.08% in AA, and achieved a 4.01% increase in Kappa. It can be seen that our proposed model has achieved the best overall classification effect, average classification effect, and model stability on the entire dataset.

In Table 7, it can be seen that the OA index of our proposed DACINet model reached 99.53%, while the OA of the suboptimal HybridN network was 97.95%, an increase of 1.58% for the SA dataset. However, the OA of other methods generally fluctuated around 90%. The AA index is used to evaluate the overall classification performance of a model. According to the experimental results in the table, the overall classification effect of each model on the SA dataset was above 90%. The AA index of the DACINet model reached 99.58%, which was a significant improvement compared to SSTN and SSAtt, and it was 2.12% higher than the HybridN network. Kappa is also an important indicator for evaluating the effectiveness of a classification model. For the proposed method, Kappa reached 97.04%, and in 16 land cover categories, our proposed method achieved the best classification effect for 10 categories. This demonstrates that the DACINet can more accurately and stably complete the hyperspectral image classification.

Table 8 presents the experimental results of different methods on the HU dataset. From the experimental results, it can be observed that among the 15 land cover categories, our proposed DACINet achieves the best classification performance in nine categories, demonstrating its powerful feature discrimination capability. Particularly noteworthy is that on categories with complex textural characteristics, such as class 5 (Grass) and class 14 (Tennis Court), our method achieves accuracies of 97.55% and 96.32%, respectively, significantly outperforming other comparative methods. In terms of comprehensive evaluation metrics, our method achieves the best results across all three key indicators: OA, AA, and Kappa, reaching 86.67%, 89.20%, and 87.12%, respectively. Compared with the suboptimal A2S2K model, our DACINet improves by 1.72%, 3.85%, and 4.03% in OA, AA, and Kappa, respectively. It is worth noting that even on categories with lower sample discriminability, such as class 10 (Coastal) and class 13 (Parking Lot), our method maintains relatively stable classification performance, benefiting from the effective spectral–spatial feature capability of the CSDA module. Overall, on this challenging HU dataset, the DACINet demonstrates excellent classification performance and good generalization ability, further verifying the effectiveness and robustness of the proposed framework.

4.5. Ablation Studies

Ablation of different modules. In this section, ablation experiments are conducted to verify the functions of different modules in the DACINet. The CIFM, CSDA and hybrid convolution layer are introduced to the backbone network step by step. The results are shown in Table 9, where √ indicates the module is used, and × indicates it is not used. It can be seen that the classification results only through the CIFM are the worst, with OA of 85.20% and Kappa of 92.90 on the IP dataset, and with OA of 90.40% and Kappa of 87.14 on the UP dataset. This is because it focuses on fusion and lacks judgment on the validity. When combined with the CIFM and hybrid convolution layer, the overall classification accuracy and stability of the three are improved, which indicates that the feature extraction after the context interaction fusion is effective. Obviously, the DACINet with all modules obtains the best classification accuracy in OA and Kappa on the three datasets, specifically OA of 99.53% and Kappa of 99.48 on the SA dataset. This indicates that all modules play their own effective roles and jointly improve the hyperspectral classification.

Effectiveness of CSDA. In Table 10, we show the difference between the proposed CSDA and CBAM, where √ indicates the module is used, and × indicates it is not used. It is observed that when compared with the baseline without any attention mechanism, the CBAM mechanism dose not improve the performance on the UP and SA datasets effectively. On the IP dataset, the OA increases by 1.85%, and Kappa increases by 2.11; the performance is improved significantly. This is because the sample distribution of the IP dataset is unbalanced, and the discriminant information can be effectively screened by introducing the attention mechanism. The proposed CSDA achieves performance improvement on all three datasets, and it is higher than CBAM. This indicates that screening of spectral and spatial features through dual-channels can effectively alleviate sample imbalance, while it also improves classification performance in common scenarios.

Validity of Robustness. To verify the robustness of the proposed model, we conduct experiments on three datasets with different samples. The results on the IP dataset are shown in Figure 4. It can be seen that both OA and AA increase as the percentage of training samples increases. In detail, deep learning models significantly improve the classification performance of machine learning models (RF, SVM) across all sample distributions. HybridN has a lower OA and AA score on the IP dataset in few-shot samples, especially when the sample is less than 5%. Overall comparison shows that the proposed DACINet results in the best OA and AA scores across all sample distributions, which indicates the DACINet has stronger robustness than others. To further verify the impact of sample size on classification, we calculated the classification confusion matrix for the nine categories of the UP dataset, as shown in Figure 5. The classification accuracy of the category with a smaller sample size (such as Shadows) is significantly lower than that of the category with a larger sample size (such as Meadows). This also verifies the contribution of sample size to classification accuracy. Small sample studies still need to be strengthened.

Analysis of Band Dimension. The dimensionality reduction has a certain impact on the final classification performance of the model. For this reason, we conducted relevant experiments on three datasets while keeping other conditions unchanged and varying the band dimensions. The experimental results are shown in Figure 6. The results show that the performance of the three different datasets varies in different band dimension parameters. When the dimension reduction number is set to 36 for the IP dataset, the values of the three evaluation parameters (OA, AA, and Kappa) are the highest. When it is set to 38, the performance of the model is the worst. Among them, due to the unbalanced data sample distribution in the IP dataset, the values of the OA and Kappa parameters are not much different, while the value of the AA parameter fluctuates significantly. When the dimension reduction band number is set to 13 for the UP dataset, the performance is the best. As the number of bands set increases, the effect gradually decreases, and when it is set to 19, the decline is the greatest, and the classification effect is the worst. For the SA dataset, the values of the three evaluation parameters increase first and then decrease as the number of bands set increases. When the parameters are set to 17, the effect reaches the optimal state. Based on the experimental results, the band parameters of the IP, UP, and SA datasets are set to 36, 13, and 17 respectively.

Selection of the Convolution Kernel. The size of the convolution kernel directly affects the amount of information about the adjacent pixels in the selected space. To better explore the different datasets’ requirements for the spatial dimension in the model, this paper also conducts comparative experiments with different convolution kernels. A 3 × 3 small kernel in the spatial dimension is sufficient to extract effective spatial features, and it has a relatively small number of parameters. In the spectral dimension, experiments are conducted on the selected UP, IP, SA, and HU datasets with intervals of two pixels ranging from 1× to 13×. OA and AA are used as the classification results for the four datasets. In the experiment, when the length of the original spectral bands is less than the length of the vector that needs to be mapped and filled, the triangular principle is adopted, that is, it is filled in two steps. The experimental results are shown in Figure 7. The results show that the overall trend on the four datasets is that it first increases with the increase in the spatial window and then tends to be balanced. When the spectral size is less than seven, the performance gradually improves, and after seven, the classification accuracy tends to stabilize. Considering the overall computational cost, in all subsequent experiments, the convolution kernel is set to 7 × 3 × 3.

Analysis of the Complexity. Table 11 presents the complexity comparison of classification networks using 3D convolution kernels on the IP and UP datasets, mainly focusing on the parameters and floating-point numbers of neural networks. The results show that when using only a 3D CNN, the network parameters are the least, but the floating-point numbers are more. Since the main body of the A2S2K network adopts the residual network, its parameters are the least, but the computational load is relatively large. The proposed method ranks second in terms of both network parameters and floating-point numbers, achieving the best comprehensive evaluation effect. The proposed hybrid convolution layer integrates spatial and spectral information. Among them, the 3D convolution simultaneously captures feature information in both the spectral and spatial dimensions, and introduces 2D convolution to enhance spatial feature extraction. Compared with using only 3D convolution, this design significantly reduces the total number of parameters and the number of floating-point operations, while ensuring the model’s performance.

Analysis of Input Spatial. The size of the input space dimensions also has an impact on the final classification effect of the model. In the three datasets, while setting the optimal band parameters and keeping other conditions consistent, we changed the input space size parameter of the model for experiments to find the most suitable input space size for the three datasets. The experimental results are shown in Figure 8. The results show that the classification performance of the three datasets varies under different input space sizes. For the IP and UP datasets, the fluctuation amplitudes of the three evaluation parameters (OA, AA, and Kappa) are consistent. As the input space size increases, they first decrease, then increase, and then decrease again. Both the IP and UP datasets achieve the best classification effect at an input size of 17 × 17. However, for the SA dataset, the three evaluation parameters show an overall trend of increasing first and then decreasing as the input size increases. The best classification effect is achieved when the input size is set to 27 × 27. Therefore, based on the experimental results, we set the input size of the IP and UP datasets to 17 × 17, and the SA dataset to 27 × 27.

Convergence Analysis of the DACINet. To verify the rationality of the Epoch parameter setting of the model, this section visualizes the training convergence process of the Indian Pines and Salinas datasets. The experimental results are shown in Figure 9. The results indicate that the convergence speed of the IP dataset is relatively slow, and it converges completely at around 30. However, the loss function and accuracy curves of this dataset are relatively stable. The SA dataset converges faster and has reached a stable state at 10, but the loss function curve fluctuates significantly, while the accuracy curve is stable. This proves that the Epoch setting we have made is reasonable, and the proposed model has good stability and generalization ability.

4.6. Visualization Analysis

To show the effectiveness of the DACINet more intuitively, the classification results of the proposed DACINet and other representative methods on the IP dataset are visualized in Figure 10, colors follow the same scheme as described in Table 1. Among them, a small area in the figure indicates a small number of ground objects, and fewer samples are selected in the classification process. As observed, SSTN and SSAtt classification methods have more errors in the classification of ground objects with fewer samples. The F-GCN method is improved compared with other methods, but it tends to produce wrong results at the intersection of two categories. Clearly, the DACINet has the best overall classification performance, and the boundary accuracy has been significantly improved. The classification visualization results of datasets UP, SA and HU are also presented in Figure 11, Figure 12 and Figure 13, colors follow the same scheme as described in Table 2, Table 3 and Table 4.

The classification performance of the RF and SVM machine learning models on the four datasets is highly susceptible to other factors. For the IP dataset, SSTN and SSAtt are two classification methods that have more errors in the classification of land features with few samples. The HybridN method has been improved compared to other methods, but it is prone to generating incorrect results at the boundary of the two categories. For the UP dataset, SSTN, SSAtt and HybridN are three classification methods that have large errors in the classification of wheat stubble represented by sky blue. Since the UP dataset is an image captured in a university area, the land features are often distributed in narrow and elongated strips, which are prone to generating errors at the edges. The DACINet proposed in this section effectively solves the problem of classification errors in the UP dataset classification task. For the SA dataset, the visual graph results of classification methods such as RF, SSAtt and HybridN have obvious errors. The DACINet, while maintaining the advantages of other methods, significantly improves the classification accuracy of each category. For the HU dataset, the classification task is more challenging due to the complex urban scenes with diverse land cover categories. As shown in the classification visualization results, traditional machine learning methods RF and SVM produce substantial misclassifications, particularly in categories with similar spectral characteristics such as Grass-healthy, Grass-stressed, and Grass-synth. The Context and RSSAN methods show some improvement but still struggle with categories like Parking-lot1 and Parking-lot2, which have irregular shapes and scattered distributions. SSTN and SSAtt exhibit better performance in homogeneous regions like Water and Tree, yet they generate noticeable errors in complex categories such as Residential and Commercial, where mixed pixels are prevalent. A2S2K, as one of the advanced methods, achieves relatively good results but still fails to accurately classify challenging categories like Tennis-court and Running-track with limited samples. In contrast, the proposed DACINet significantly reduces misclassifications across all categories, achieving the most accurate and complete classification maps. The experimental results show that the DACINet network can better complete the classification task on the four datasets. Therefore, by assigning greater weight to spectral–spatial features via the dual attention channel, the context interaction facilitates better feature fusion. Combined with the hybrid network, this approach can alleviate classification errors at object boundaries and edge blurring, thereby improving the overall classification performance of the model.

5. Discussion

The DACINet was designed to address a central challenge in HSI classification: jointly modeling spectral and spatial information while capturing long-range contextual dependencies. Standard CNNs, constrained by local receptive fields, struggle to aggregate information from distant pixels or spectral bands. Our framework tackles this by integrating contextual modeling, dual attention enhancement, and a hybrid convolutional strategy. The result is a model that learns features that are both spectrally discriminative and spatially coherent, leading to its consistent performance across diverse datasets.

The CIFM extends standard 3D CNNs by stacking layers with cross-layer residual connections. This design progressively expands the effective receptive field, allowing the model to draw information from larger spectral–spatial neighborhoods. The residual connections then integrate features across these different scales, a capability crucial for distinguishing classes that are spectrally similar but differ in their spatial context. The innovation of the CSDA module lies in its parallel and interactive computation. Unlike sequential mechanisms like CBAM, where channel attention biases spatial processing, or decoupled approaches like DBDA that lack cross-branch interaction, CSDA computes both attention maps from the same 3D feature map and fuses them via addition. This enables complementary learning between spectral and spatial features, aligning with the intrinsic structure of HSI data. The hybrid convolutional layer balances performance and efficiency through its sequential 3D-to-2D design. Initial 3D layers build a rich spectral–spatial representation, while the subsequent reshape encodes spectral information into the channel dimension, allowing lighter 2D convolutions to refine spatial patterns. This explains the high accuracy and low complexity of the DACINet, which achieves far fewer FLOPs than other attention-based methods while maintaining strong performance.

Despite its robust overall performance, the accuracy of the DACINet remains constrained in categories with extremely few training samples. This limitation arises because, like most deep learning frameworks, the model struggles to learn sufficiently discriminative representations from only a handful of examples—especially in datasets with severe class imbalance. A natural direction for future work is therefore to integrate few-shot learning strategies into the framework. One promising approach involves embedding meta-learning into the DACINet, where the model is trained across many episodes to learn a more generalizable metric space. This would enable the network to compare new samples against a small support set and make predictions based on feature similarity, rather than relying solely on class statistics learned from abundant data.

6. Conclusions

This paper proposes a Double-Attention Context Interactive Network (DACINet) for hyperspectral image classification. The DACINet is mainly composed of a CIFM, a CSDA mechanism and a hybrid convolutional layer. The CIFM captures correlations between long-distance image bands through cross-layer residual connections to enhance contextual interaction. The CSDA is a new dual-channel attention mechanism for 3D features, which enhances 2D spatial and 1D spectral features, respectively, and fuses them to strengthen 3D associations. The hybrid convolutional layer combines 2D and 3D convolution to further enhance the discriminability of spectral information. Experiments are carried out on the datasets IP, UP, SA and HU to verify the performance of the proposed DACINet. The results show that the proposed DACINet is superior to other state-of-the-art methods. Ablation studies validate the effectiveness of each core component, while visualization analysis reveals the model’s capability for the modeling of spectral–spatial features. In the future, more attention will be paid to few-shot hyperspectral image classification.

Author Contributions

Investigation, M.W.; Resources, Y.Z.; Writing—original draft, N.H.; Writing—review & editing, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by grants from the National Natural Science Foundation of China (42271093), the Natural Science Foundation of Shandong Province (ZR2024QF060, ZR2025QC1571), and the Science and Technology Support Plan for Youth Innovation of Colleges and Universities of Shandong Province of China (2025KJH134).

Data Availability Statement

The data presented in this study are available in public hyperspectral remote sensing scene repositories. These data were derived from the following resources available in the public domain: 1. Indian Pines (IP) dataset: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Indian_Pines (accessed on 30 July 2025) (alternative domestic mirror: https://opendatalab.org.cn/OpenDataLab/Indian_Pines (accessed on 30 July 2025)); 2. Pavia University (UP) dataset: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Pavia_University (accessed on 30 July 2025) (alternative domestic mirror: https://hf-mirror.com/datasets/danaroth/pavia (accessed on 30 July 2025)); 3. Salinas (SA) dataset: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Salinas (accessed on 28 July 2025) (alternative domestic mirror: https://opendatalab.org.cn/OpenDataLab/Salinas (accessed on 28 July 2025)); and 4. Houston 2013 (HU) dataset: https://www.grss-ieee.org/community/technical-committees/2013-ieee-grss-data-fusion-contest/ (accessed on 10 February 2026) (alternative domestic mirror: https://drive.uc.cn/s/3fe4f55a213f4?public=1 (accessed on 10 February 2026)). All datasets used in the experiments are publicly accessible without restrictions. The original data files, including hyperspectral images and ground truth labels, can be downloaded directly from the provided official URLs for research reproduction.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Imani, M.; Ghassemian, H. An overview on spectral and spatial information fusion for hyperspectral image classification: Current trends and challenges. Inf. Fusion 2020, 59, 59–83. [Google Scholar] [CrossRef]
Wang, X.; Liu, J.; Chi, W.; Wang, W.; Ni, Y. Advances in Hyperspectral Image Classification Methods with Small Samples: A Review. Remote Sens. 2023, 15, 3795. [Google Scholar] [CrossRef]
Zhang, Y.; Li, W.; Zhang, M.; Qu, Y.; Tao, R.; Qi, H. Topological structure and semantic information transfer network for cross-scene hyperspectral image classification. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 2817–2830. [Google Scholar]
Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
Xue, Z.; Zhou, Y.; Du, P. S3Net: Spectral–spatial Siamese network for few-shot hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5531219. [Google Scholar] [CrossRef]
Hu, L.; He, W.; Zhang, L.; Zhang, H. Cross-Domain Meta-Learning under Dual Adjustment Mode for Few-Shot Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5526416. [Google Scholar] [CrossRef]
Wang, Z.; Zhao, S.; Zhao, G.; Song, X. Dual-Branch Domain Adaptation Few-Shot Learning for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5506116. [Google Scholar] [CrossRef]
Thoreau, R.; Achard, V.; Risser, L.; Berthelot, B.; Briottet, X. Active Learning for Hyperspectral Image Classification: A comparative review. IEEE Geosci. Remote Sens. Mag. 2022, 10, 256–278. [Google Scholar] [CrossRef]
Fang, L.; Liu, G.; Li, S.; Ghamisi, P.; Benediktsson, J.A. Hyperspectral image classification with squeeze multibias network. IEEE Trans. Geosci. Remote Sens. 2018, 57, 1291–1301. [Google Scholar] [CrossRef]
He, Y.; Tu, B.; Liu, B.; Li, J.; Plaza, A. HSI-MFormer: Integrating Mamba and Transformer Experts for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5621916. [Google Scholar] [CrossRef]
Liang, L.; Xie, P.; Zhang, Y.; Li, J.; Zhang, Z.; Li, J.; Plaza, A. DBMLLA: Double-branch Mamba-like linear attention network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5524315. [Google Scholar] [CrossRef]
Zhang, T.; Xuan, C.; Cheng, F.; Tang, Z.; Gao, X.; Song, Y. CenterMamba: Enhancing semantic representation with center-scan mamba network for hyperspectral image classification. Expert Syst. Appl. 2025, 287, 127985. [Google Scholar] [CrossRef]
Ahmad, M.; Khan, A.M.; Mazzara, M.; Distefano, S.; Ali, M.; Sarfraz, M.S. A Fast and Compact 3-D CNN for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5502205. [Google Scholar] [CrossRef]
Ghaderizadeh, S.; Abbasi-Moghadam, D.; Sharifi, A.; Zhao, N.; Tariq, A. Hyperspectral Image Classification Using a Hybrid 3D-2D Convolutional Neural Networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7570–7588. [Google Scholar] [CrossRef]
Alkhatib, M.Q.; Al-Saad, M.; Aburaed, N.; Almansoori, S.; Zabalza, J.; Marshall, S.; Al-Ahmad, H. Tri-CNN: A Three Branch Model for Hyperspectral Image Classification. Remote Sens. 2023, 15, 316. [Google Scholar] [CrossRef]
Gündüz, A.; Orman, Z. Hyperspectral image classification using a hybrid RNN-CNN with enhanced attention mechanisms. J. Indian Soc. Remote Sens. 2025, 53, 613–629. [Google Scholar] [CrossRef]
Yang, J.; Du, B.; Xu, Y.; Zhang, L. Can Spectral Information Work While Extracting Spatial Distribution?—An Online Spectral Information Compensation Network for HSI Classification. IEEE Trans. Image Process. 2023, 32, 2360–2373. [Google Scholar] [CrossRef]
Şakaci, S.A.; Urhan, O. Spectral-Spatial Classification of Hyperspectral Imagery with Convolutional Neural Network. In Proceedings of the 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), Istanbul, Turkey, 15–17 October 2020; pp. 1–4. [Google Scholar] [CrossRef]
Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef]
He, M.; Li, B.; Chen, H. Multi-scale 3D deep convolutional neural network for hyperspectral image classification. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3904–3908. [Google Scholar] [CrossRef]
Roy, S.K.; Manna, S.; Song, T.; Bruzzone, L. Attention-Based Adaptive Spectral–Spatial Kernel ResNet for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7831–7843. [Google Scholar] [CrossRef]
Krichen, M. Generative Adversarial Networks. In Proceedings of the 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 6–8 July 2023; pp. 1–7. [Google Scholar] [CrossRef]
Paul, B.; Fattah, S.A.; Rajib, A.; Saquib, M. SSGRAM: 3-D Spectral-Spatial Feature Network Enhanced by Graph Attention Map for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5516715. [Google Scholar] [CrossRef]
Zhang, Z.; Huang, L.; Tang, B.H.; Wang, Q.; Ge, Z.; Jiang, L. Non-Euclidean Spectral-Spatial feature mining network with Gated GCN-CNN for hyperspectral image classification. Expert Syst. Appl. 2025, 272, 126811. [Google Scholar] [CrossRef]
Yang, J.; Wu, C.; Du, B.; Zhang, L. Enhanced Multiscale Feature Fusion Network for HSI Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 10328–10347. [Google Scholar] [CrossRef]
Liu, R.; Liang, J.; Yang, J.; Hu, M.; He, J.; Zhu, P.; Zhang, L. DHSNet: Dual Classification Head Self-Training Network for Cross-Scene Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5534515. [Google Scholar] [CrossRef]
Wang, D.; Hu, M.; Jin, Y.; Miao, Y.; Yang, J.; Xu, Y.; Qin, X.; Ma, J.; Sun, L.; Li, C.; et al. HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 6427–6444. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, X.; Jiang, X.; Zhang, L.; Du, B. Elastic Graph Fusion Subspace Clustering for Large Hyperspectral Image. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 6300–6312. [Google Scholar] [CrossRef]
Huang, S.; Zeng, H.; Chen, H.; Zhang, H. Spatial and Cluster Structural Prior-Guided Subspace Clustering for Hyperspectral Image. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5511115. [Google Scholar] [CrossRef]
Jiang, G.; Zhang, Y.; Wang, X.; Jiang, X.; Zhang, L. Structured anchor learning for large-scale hyperspectral image projected clustering. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 2328–2340. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Arshad, T.; Zhang, J. Hierarchical Attention Transformer for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5504605. [Google Scholar] [CrossRef]
Zhao, Z.; Xu, X.; Li, S.; Plaza, A. Hyperspectral Image Classification Using Groupwise Separable Convolutional Vision Transformer Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5511817. [Google Scholar] [CrossRef]
Jing, C.; Sun, G.; Zhang, A.; Fu, H.; Cheng, J.; Shi, Z. A Dynamic Attention Unet Network for Hyperspectral Image Classification. In Proceedings of the IGARSS 2025—2025 IEEE International Geoscience and Remote Sensing Symposium, Brisbane, Australia, 3–8 August 2025; pp. 8458–8461. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Yang, Y.; Wang, X. Classification of hyperspectral image based on double-branch dual-attention mechanism network. Remote Sens. 2020, 12, 582. [Google Scholar] [CrossRef]
Nhaila, H.; Elmaizi, A.; Sarhrouni, E.; Hammouch, A. Supervised classification methods applied to airborne hyperspectral images: Comparative study using mutual information. Procedia Comput. Sci. 2019, 148, 97–106. [Google Scholar] [CrossRef]
Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef]
Lee, H.; Kwon, H. Going deeper with contextual CNN for hyperspectral image classification. IEEE Trans. Image Process. 2017, 26, 4843–4855. [Google Scholar] [CrossRef] [PubMed]
Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2019, 17, 277–281. [Google Scholar]
Li, M.; Liu, Y.; Xue, G.; Huang, Y.; Yang, G. Exploring the relationship between center and neighborhoods: Central vector oriented self-similarity network for hyperspectral image classification. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 1979–1993. [Google Scholar] [CrossRef]
Zhu, M.; Jiao, L.; Liu, F.; Yang, S.; Wang, J. Residual spectral–spatial attention network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 449–462. [Google Scholar]
Zhong, Z.; Li, Y.; Ma, L.; Li, J.; Zheng, W.S. Spectral–spatial transformer network for hyperspectral image classification: A factorized architecture search framework. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5514715. [Google Scholar]
Hang, R.; Li, Z.; Liu, Q.; Ghamisi, P.; Bhattacharyya, S.S. Hyperspectral image classification with attention-aided CNNs. IEEE Trans. Geosci. Remote Sens. 2020, 59, 2281–2293. [Google Scholar] [CrossRef]
Xu, J.; Li, K.; Li, Z.; Chong, Q.; Xing, H.; Xing, Q.; Ni, M. Fuzzy graph convolutional network for hyperspectral image classification. Eng. Appl. Artif. Intell. 2024, 127, 107280. [Google Scholar] [CrossRef]

Figure 1. The structure of the proposed DACINet. The process begins with PCA for dimensionality reduction of the input HSI cube. The resulting patches are fed into the CIFM, which employs stacked 3D convolutions with cross-layer residual connections to capture long-range contextual information. The feature maps then enter the CSDA, which enhances channel (spectral) and spatial features in parallel before fusing them. Subsequently, a hybrid convolutional layer, comprising sequential 3D and 2D convolutions, further refines the spectral–spatial representation. Finally, a fully connected layer produces the pixel-wise classification results.

Figure 2. The structure of the CIFM.

Figure 3. The structure of CSDA.

Figure 4. Robustness verification of the proposed model on the IP, UP, and SA datasets. The x-axis represents the percentage of labeled samples (from each class) used for training. The y-axis represents the classification accuracy (Overall Accuracy, OA, or Average Accuracy, AA).

Figure 5. The confusion matrix for 9 categories in the UP dataset.

Figure 6. The OA, AA, and Kappa values of different bands on the IP, UP, and SA datasets.

Figure 7. Different kernel size on IP, UP, SA and HU datasets.

Figure 8. The OA, AA, and Kappa values under different input spatial sizes on the IP, UP, and SA datasets. The x-axis values represent the spatial dimension of input patches.

Figure 9. Convergence analysis of the proposed DACINet. The training loss and accuracy curves on (a) the IP dataset and (b) the SA dataset.

Figure 10. Visualization of different models on IP dataset.

Figure 11. Visualization of different models on UP dataset.

Figure 12. Visualization of different models on SA dataset.

Figure 13. Visualization of different models on HU dataset.

Table 1. Category information of the Indian Pines dataset.

Number	Category	Color	Total Samples
1	Alfalfa		46
2	Corn-notill		1428
3	Corn-mintill		830
4	Corn		237
5	Grass/pasture		483
6	Grass/trees		730
7	Grass/pasture-mowed		28
8	Hay-windrowed		478
9	Oats		20
10	Soybean-notill		972
11	Soybean-mintill		2455
12	Soybean-clean		593
13	Wheat		205
14	Woods		1265
15	Bldg-grass-tree-drivers		386
16	Stone-steel-towers		93
Total	/	/	10,249

Table 2. Category information of the Pavia University dataset.

Number	Category	Color	Total Samples
1	Asphalt		6631
2	Meadows		18,649
3	Gravel		2099
4	Trees		3064
5	Painted Metal Sheets		1345
6	Bare Soil		5029
7	Bitumen		1330
8	Self-Blocking Bricks		3682
9	Shadows		947
Total	/	/	42,776

Table 3. Category information of the Salinas dataset.

Number	Category	Color	Total Samples
1	Broccoli-green-weeds_1		2009
2	Broccoli-green-weeds_2		3726
3	Fallow		1976
4	Fallow-rough-plow		1394
5	Fallow-smooth		2678
6	Stubble		3959
7	Celery		3579
8	Grapes-untrained		11,271
9	Soil-vinyard-develop		6203
10	Corn-senesced-green-weeds		3278
11	Lettuce-romaine-4wk		1068
12	Lettuce-romaine-5wk		1927
13	Lettuce-romaine-6wk		916
14	Lettuce-romaine-7wk		1070
15	Vinyard-untrained		7268
16	Vinyard-vertical-trellis		1807
Total	/	/	54,129

Table 4. Category information of the Houston 13 dataset.

Number	Category	Color	Total Samples
1	Grass-healthy		1238
2	Grass-stressed		1241
3	Grass-synth		690
4	Tree		1231
5	Soil		1229
6	Water		321
7	Residential		1255
8	Commercial		1231
9	Road		1239
10	Highway		1214
11	Railway		1222
12	Parking-lot1		1220
13	Parking-lot2		464
14	Tennis-court		423
15	Running-track		653
Total	/	/	14,871

Table 5. Classification results (%) of various HSIC methods on IP dataset with 5% labeled samples per class. The best result for each class/metric is highlighted in bold.

Class	RF	SVM	Context	RSSAN	SSTN	SSAtt	HybridN	CVSSN	F-GCN	Ours
1	87.06 ± 5.63	42.86 ± 4.78	57.89 ± 5.27	88.89 ± 2.87	77.77 ± 3.98	90.93 ± 4.13	79.86 ± 6.24	64.71 ± 1.96	93.83 ± 3.24	70.75 ± 3.65
2	66.45 ± 6.12	66.93 ± 5.36	63.08 ± 5.32	62.27 ± 3.45	85.35 ± 4.53	83.72 ± 7.34	74.16 ± 2.19	93.33 ± 3.42	78.95 ± 6.25	85.66 ± 4.32
3	70.92 ± 3.42	75.45 ± 3.54	47.28 ± 2.65	60.00 ± 5.43	93.40 ± 4.65	81.05 ± 3.25	93.60 ± 1.78	96.24 ± 2.53	78.63 ± 3.32	96.41 ± 3.21
4	43.38 ± 2.65	48.56 ± 4.23	46.06 ± 1.68	82.72 ± 3.24	98.95 ± 2.43	88.94 ± 2.67	89.37 ± 1.32	99.98 ± 0.45	96.62 ± 2.13	96.07 ± 3.21
5	72.57 ± 6.78	85.49 ± 5.34	81.93 ± 5.67	79.41 ± 7.48	84.07 ± 4.38	87.57 ± 4.98	89.31 ± 3.78	92.22 ± 3.95	76.84 ± 8.12	97.55 ± 2.45
6	81.03 ± 3.25	84.48 ± 2.23	76.09 ± 1.98	91.33 ± 3.14	96.13 ± 3.21	99.22 ± 1.13	99.67 ± 1.02	98.56 ± 2.32	95.79 ± 2.11	99.14 ± 1.10
7	58.33 ± 10.11	80.00 ± 11.23	99.28 ± 3.54	99.52 ± 2.21	35.71 ± 12.23	72.00 ± 11.43	71.11 ± 9.87	94.73 ± 7.32	96.43 ± 3.21	74.42 ± 5.43
8	86.24 ± 3.12	88.82 ± 2.28	88.13 ± 5.45	93.64 ± 3.45	94.93 ± 4.52	94.57 ± 4.87	93.32 ± 5.43	92.26 ± 3.78	100.00 ± 2.31	97.09 ± 3.54
9	99.19 ± 2.33	60.00 ± 6.49	84.62 ± 5.46	55.56 ± 8.98	42.86 ± 7.56	93.33 ± 4.65	72.22 ± 3.21	80.00 ± 6.65	99.03 ± 2.31	73.33 ± 5.63
10	64.48 ± 4.32	76.44 ± 6.56	66.22 ± 7.66	59.84 ± 5.46	90.92 ± 5.34	85.23 ± 3.12	85.85 ± 7.65	93.46 ± 3.65	78.91 ± 5.43	93.85 ± 4.23
11	63.18 ± 3.44	69.17 ± 5.43	73.27 ± 4.53	72.88 ± 5.74	96.04 ± 2.35	90.25 ± 4.34	96.46 ± 1.29	97.84 ± 5.43	78.84 ± 6.12	96.12 ± 3.67
12	57.14 ± 9.49	67.51 ± 5.46	58.06 ± 8.87	55.46 ± 8.56	78.65 ± 5.69	67.91 ± 7.88	89.64 ± 4.57	85.85 ± 6.12	85.50 ± 5.32	95.92 ± 4.17
13	88.83 ± 5.34	92.67 ± 4.35	71.62 ± 6.54	86.11 ± 5.87	97.30 ± 2.89	97.00 ± 5.12	98.72 ± 4.30	95.42 ± 3.46	100.00 ± 2.17	98.73 ± 4.12
14	83.85 ± 1.65	84.88 ± 4.32	90.99 ± 3.19	88.77 ± 5.43	96.10 ± 3.10	96.54 ± 4.32	96.87 ± 2.90	97.62 ± 4.87	94.86 ± 5.29	97.92 ± 3.82
15	66.67 ± 6.54	80.15 ± 4.21	63.66 ± 4.20	72.49 ± 1.89	90.02 ± 4.12	88.83 ± 4.76	78.63 ± 3.21	95.86 ± 2.69	95.34 ± 4.01	93.31 ± 3.54
16	98.41 ± 2.10	98.25 ± 2.98	97.78 ± 4.12	60.53 ± 3.02	98.82 ± 3.61	93.33 ± 4.08	42.74 ± 5.12	81.90 ± 2.89	93.24 ± 3.06	94.12 ± 4.32
OA	70.42 ± 3.04	75.28 ± 4.23	71.18 ± 4.57	73.18 ± 4.34	91.32 ± 5.41	88.35 ± 3.21	93.95 ± 2.87	95.22 ± 5.34	95.24 ± 3.87	96.78 ± 3.24
AA	75.19 ± 2.65	75.21 ± 4.32	72.92 ± 3.12	75.62 ± 5.01	84.81 ± 5.65	88.72 ± 4.69	84.35 ± 3.20	90.50 ± 2.19	90.56 ± 2.97	90.60 ± 3.05
Kappa	65.71 ± 3.02	71.45 ± 3.21	66.98 ± 4.32	69.23 ± 3.76	90.13 ± 2.65	86.71 ± 1.78	93.09 ± 4.03	94.55 ± 3.78	92.30 ± 3.21	96.32 ± 3.58

Table 6. Classification results (%) of various HSIC methods on UP dataset with 1% labeled samples per class. The best result for each class/metric is highlighted in bold.

Class	RF	SVM	Context	RSSAN	SSTN	SSAtt	HybridN	CVSSN	Ours
1	79.58 ± 3.87	78.28 ± 2.56	89.65 ± 2.84	86.19 ± 4.23	89.66 ± 3.08	87.96 ± 2.19	93.16 ± 3.18	95.45 ± 2.76	96.84 ± 2.85
2	84.04 ± 1.98	84.47 ± 2.10	94.30 ± 1.28	96.92 ± 2.09	97.46 ± 3.21	98.55 ± 3.98	98.72 ± 2.76	98.71 ± 4.01	99.45 ± 1.65
3	55.06 ± 4.70	76.82 ± 3.09	67.47 ± 5.31	59.67 ± 5.42	74.78 ± 6.23	74.32 ± 2.65	84.95 ± 3.67	86.35 ± 2.37	88.69 ± 3.20
4	90.87 ± 4.32	91.73 ± 3.41	91.75 ± 4.79	99.09 ± 2.12	91.09 ± 3.56	98.39 ± 4.28	93.11 ± 1.67	96.77 ± 3.23	97.33 ± 4.11
5	95.00 ± 3.11	93.97 ± 3.33	99.98 ± 2.04	90.14 ± 3.97	98.59 ± 4.22	98.08 ± 3.65	95.63 ± 2.17	96.41 ± 5.04	99.77 ± 0.78
6	75.77 ± 1.67	94.05 ± 4.31	80.84 ± 3.44	86.04 ± 2.17	96.82 ± 2.04	96.41 ± 4.13	97.06 ± 3.22	96.36 ± 4.28	98.26 ± 2.65
7	73.85 ± 3.33	69.68 ± 5.67	81.13 ± 6.53	71.05 ± 3.76	99.75 ± 1.54	94.87 ± 3.47	92.42 ± 2.38	90.39 ± 3.42	98.13 ± 3.63
8	71.54 ± 3.24	72.87 ± 3.56	80.41 ± 4.44	76.00 ± 5.05	89.06 ± 3.27	86.13 ± 4.12	85.86 ± 6.55	89.04 ± 4.29	90.80 ± 5.08
9	99.46 ± 2.22	99.89 ± 1.87	86.31 ± 2.35	80.86 ± 5.68	91.65 ± 3.18	95.81 ± 5.23	77.95 ± 4.56	97.37 ± 2.33	97.25 ± 4.55
OA	81.71 ± 4.33	83.43 ± 1.21	88.86 ± 3.24	89.09 ± 4.01	93.64 ± 2.87	94.08 ± 1.66	94.77 ± 2.21	95.93 ± 3.06	97.77 ± 2.70
AA	80.58 ± 2.21	84.64 ± 3.24	85.72 ± 4.10	82.88 ± 3.17	92.10 ± 3.29	92.28 ± 4.19	90.64 ± 3.95	94.09 ± 2.79	96.72 ± 2.64
Kappa	75.02 ± 2.98	77.24 ± 4.19	85.18 ± 2.39	85.49 ± 1.89	91.56 ± 3.25	92.13 ± 2.13	93.03 ± 1.65	94.60 ± 3.10	97.04 ± 1.23

Table 7. Classification results (%) of various HSIC methods on SA dataset with 1% labeled samples per class. The best result for each class/metric is highlighted in bold.

Class	RF	SVM	Context	RSSAN	SSTN	SSAtt	HybridN	CVSSN	Ours
1	99.68 ± 4.09	99.78 ± 2.19	97.02 ± 3.21	99.95 ± 3.67	95.48 ± 2.79	99.98 ± 1.79	96.39 ± 5.04	96.27 ± 3.78	100.00 ± 1.64
2	98.75 ± 4.32	98.86 ± 3.08	98.15 ± 5.34	99.86 ± 2.79	97.55 ± 4.08	99.08 ± 2.18	99.76 ± 4.37	99.98 ± 3.56	99.94 ± 2.19
3	84.90 ± 7.64	87.95 ± 7.65	92.67 ± 3.89	90.73 ± 5.43	93.43 ± 5.22	93.22 ± 6.34	99.31 ± 3.87	95.46 ± 4.28	99.23 ± 2.86
4	97.16 ± 5.20	97.36 ± 4.39	96.86 ± 8.02	91.26 ± 7.04	96.27 ± 4.67	98.20 ± 2.59	97.34 ± 5.33	95.83 ± 3.27	96.18 ± 5.33
5	93.95 ± 6.33	95.49 ± 5.44	97.87 ± 4.75	99.05 ± 0.87	99.59 ± 0.42	96.92 ± 3.44	98.11 ± 2.12	99.17 ± 0.65	97.25 ± 2.19
6	99.59 ± 0.46	99.91 ± 0.10	98.44 ± 2.10	99.64 ± 1.01	99.95 ± 0.87	100.00 ± 0.42	99.01 ± 1.45	99.99 ± 1.01	99.92 ± 0.45
7	97.77 ± 2.11	97.75 ± 2.01	97.68 ± 3.21	97.19 ± 3.24	99.91 ± 1.02	98.22 ± 2.54	99.96 ± 0.78	98.44 ± 2.13	100.00 ± 0.29
8	72.04 ± 3.45	72.34 ± 5.65	85.63 ± 5.43	84.12 ± 2.98	87.10 ± 5.44	85.34 ± 4.66	97.19 ± 3.77	92.81 ± 5.67	99.75 ± 1.32
9	95.85 ± 4.35	98.47 ± 3.22	99.00 ± 2.10	98.59 ± 3.21	99.79 ± 1.02	99.95 ± 0.12	99.46 ± 0.79	99.46 ± 1.22	99.80 ± 1.22
10	82.79 ± 6.54	89.33 ± 4.58	91.71 ± 5.47	91.68 ± 7.12	95.41 ± 3.40	94.51 ± 3.29	98.44 ± 4.39	97.85 ± 3.21	99.52 ± 1.23
11	94.63 ± 2.30	90.14 ± 5.43	93.35 ± 3.89	96.93 ± 3.98	80.22 ± 5.43	94.66 ± 4.10	93.48 ± 4.29	99.41 ± 1.67	99.72 ± 2.10
12	95.08 ± 3.09	95.68 ± 5.67	98.10 ± 3.08	94.10 ± 5.30	97.21 ± 3.12	97.39 ± 4.33	98.03 ± 4.02	99.86 ± 1.02	98.92 ± 2.06
13	92.13 ± 6.22	92.84 ± 8.11	98.96 ± 4.52	99.65 ± 2.10	99.89 ± 0.23	97.88 ± 3.44	95.87 ± 4.32	99.88 ± 1.23	98.67 ± 3.22
14	91.86 ± 8.33	95.68 ± 5.54	98.10 ± 4.55	94.10 ± 6.43	97.21 ± 2.35	97.39 ± 5.41	98.03 ± 4.29	99.86 ± 2.45	98.92 ± 4.22
15	68.15 ± 5.66	74.12 ± 7.44	81.53 ± 3.49	79.95 ± 8.98	83.53 ± 4.30	80.47 ± 4.87	95.98 ± 5.66	89.69 ± 2.33	99.39 ± 1.22
16	94.38 ± 3.22	98.71 ± 4.23	97.59 ± 4.33	99.56 ± 1.23	98.18 ± 3.22	99.76 ± 1.29	98.89 ± 4.32	99.12 ± 1.27	100.00 ± 1.02
OA	86.51 ± 3.21	88.30 ± 3.20	92.58 ± 4.32	92.28 ± 5.34	93.46 ± 2.37	93.15 ± 3.24	97.95 ± 4.22	96.24 ± 4.55	99.53 ± 2.87
AA	91.17 ± 3.22	92.77 ± 5.44	95.18 ± 2.66	95.12 ± 3.42	95.17 ± 3.41	95.79 ± 3.12	97.64 ± 3.11	97.48 ± 3.09	99.58 ± 1.03
Kappa	84.95 ± 6.54	86.93 ± 3.90	91.74 ± 5.34	91.40 ± 4.26	92.71 ± 5.43	92.37 ± 3.89	97.72 ± 4.23	95.81 ± 2.98	99.48 ± 2.10

Table 8. Classification results (%) of various HSIC methods on HU dataset with 1% labeled samples per class. The best result for each class/metric is highlighted in bold.

Class	RF	SVM	Context	RSSAN	SSTN	SSAtt	A2S2K	CVSSN	Ours
1	89.06 ± 4.63	82.86 ± 3.78	77.89 ± 3.27	90.89 ± 5.87	77.98 ± 4.98	91.33 ± 3.13	86.56 ± 2.24	78.59 ± 4.16	92.95 ± 2.24
2	85.76 ± 5.23	84.89 ± 6.16	65.28 ± 6.23	74.17 ± 2.87	85.55 ± 3.23	88.25 ± 8.49	89.16 ± 5.10	90.13 ± 3.22	91.95 ± 4.15
3	90.02 ± 8.03	92.45 ± 3.04	78.18 ± 9.61	82.02 ± 10.13	84.40 ± 5.50	82.15 ± 3.15	87.27 ± 7.18	96.14 ± 3.50	88.83 ± 5.23
4	90.43 ± 4.65	92.36 ± 6.53	84.26 ± 5.83	88.78 ± 7.14	86.85 ± 5.23	90.14 ± 4.70	93.01 ± 6.87	92.18 ± 4.65	95.22 ± 4.23
5	72.57 ± 6.78	85.49 ± 5.34	81.93 ± 5.67	79.41 ± 7.48	84.07 ± 4.38	87.57 ± 4.98	89.31 ± 3.78	92.22 ± 3.95	97.55 ± 2.45
6	95.03 ± 6.85	94.48 ± 7.13	76.87 ± 8.91	83.32 ± 7.24	85.43 ± 4.76	89.22 ± 5.13	91.02 ± 4.22	94.16 ± 3.23	93.23 ± 6.80
7	67.43 ± 8.91	67.40 ± 11.43	78.87 ± 6.54	79.34 ± 7.16	74.87 ± 9.23	78.13 ± 7.93	76.11 ± 6.37	80.13 ± 8.32	79.42 ± 7.43
8	73.74 ± 11.12	74.08 ± 8.28	65.13 ± 13.45	64.32 ± 11.45	76.83 ± 12.52	83.47 ± 8.87	83.32 ± 6.39	84.26 ± 8.78	89.68 ± 9.41
9	67.19 ± 7.93	65.38 ± 9.49	72.62 ± 10.46	73.56 ± 9.98	81.86 ± 7.56	81.03 ± 7.65	85.22 ± 8.21	87.10 ± 6.65	89.13 ± 4.31
10	59.48 ± 13.32	57.44 ± 12.56	54.01 ± 14.66	57.84 ± 14.46	59.42 ± 9.34	69.93 ± 8.92	85.15 ± 9.65	78.46 ± 10.65	78.65 ± 11.23
11	53.18 ± 8.44	63.17 ± 7.43	68.27 ± 8.53	65.88 ± 6.64	56.44 ± 12.35	71.25 ± 6.34	76.46 ± 6.29	80.84 ± 5.43	85.32 ± 7.67
12	52.64 ± 9.59	57.51 ± 8.46	55.56 ± 8.67	58.36 ± 9.56	64.15 ± 7.69	73.91 ± 9.88	74.14 ± 6.57	80.15 ± 9.12	79.42 ± 8.17
13	48.83 ± 13.34	50.67 ± 9.35	80.12 ± 7.54	79.41 ± 6.87	83.30 ± 6.89	90.10 ± 7.12	92.72 ± 5.30	91.42 ± 6.46	94.12 ± 7.17
14	80.85 ± 5.65	84.74 ± 7.32	79.59 ± 6.19	87.13 ± 8.43	82.45 ± 5.10	78.54 ± 9.32	88.17 ± 5.78	85.62 ± 4.37	96.32 ± 6.82
15	96.27 ± 4.54	99.85 ± 0.71	84.26 ± 6.20	78.59 ± 6.89	93.22 ± 4.12	89.83 ± 4.76	97.63 ± 3.21	92.16 ± 3.69	94.34 ± 2.01
OA	71.82 ± 3.04	75.78 ± 2.23	69.48 ± 4.17	70.18 ± 4.24	74.32 ± 4.41	80.35 ± 3.21	84.95 ± 4.17	83.24 ± 3.32	86.67 ± 4.24
AA	75.79 ± 2.55	78.11 ± 3.42	70.12 ± 4.24	75.62 ± 4.21	79.81 ± 5.65	81.72 ± 4.29	85.35 ± 3.23	84.98 ± 4.37	89.20 ± 4.15
Kappa	69.11 ± 5.12	73.41 ± 5.21	67.28 ± 7.32	68.54 ± 5.76	74.13 ± 5.65	78.91 ± 4.78	83.09 ± 3.67	85.30 ± 3.34	87.12 ± 5.18

Table 9. Ablation results of different modules. Bold values indicate the best results among all compared methods.

Different Modules			Metrics	IP	UP	SA
CIFM	CSDA	HCL	Metrics	IP	UP	SA
×	×	√	OA (%)	93.17	97.05	97.47
×	×	√	Kappa	93.63	96.07	98.41
√	×	×	OA (%)	85.20	90.40	97.71
√	×	×	Kappa	82.90	87.14	97.58
√	×	√	OA (%)	94.76	97.68	98.56
√	×	√	Kappa	94.02	96.93	98.51
√	√	√	OA (%)	96.78	97.77	99.53
√	√	√	Kappa	96.32	97.04	99.48

Table 10. Ablation results of different attention. Bold values indicate the best results among all compared methods.

Attention Mechanism		Metrics	IP	UP	SA
CBAM	CSDA	Metrics	IP	UP	SA
×	×	OA (%)	94.76	97.68	98.56
×	×	Kappa	94.02	96.93	98.51
√	×	OA (%)	96.61	97.28	98.90
√	×	Kappa	96.13	96.38	98.77
×	√	OA (%)	96.78	97.77	99.53
×	√	Kappa	96.32	97.04	99.48

Table 11. Comparison of Params and FLOPs for different methods. Bold values indicate the best results among all compared methods.

Dataset	Metrics	Context	SSAN	3D CNN	A2S2K	Ours
IP	Params (M)	1.211	87.020	0.975	0.371	0.578
IP	FLOPs (M)	84.79	7062.96	30.284	170.45	50.98
UP	Params (M)	0.703	25.024	4.657	0.221	0.631
UP	FLOPs (M)	49.52	1982.81	2.298	87.30	43.561

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, N.; Wang, Z.; Wang, M.; Zhao, Y. Double-Attention Context Interactive Network for Hyperspectral Image Classification. Remote Sens. 2026, 18, 1059. https://doi.org/10.3390/rs18071059

AMA Style

Hu N, Wang Z, Wang M, Zhao Y. Double-Attention Context Interactive Network for Hyperspectral Image Classification. Remote Sensing. 2026; 18(7):1059. https://doi.org/10.3390/rs18071059

Chicago/Turabian Style

Hu, Nannan, Zhongao Wang, Minghao Wang, and Yuefeng Zhao. 2026. "Double-Attention Context Interactive Network for Hyperspectral Image Classification" Remote Sensing 18, no. 7: 1059. https://doi.org/10.3390/rs18071059

APA Style

Hu, N., Wang, Z., Wang, M., & Zhao, Y. (2026). Double-Attention Context Interactive Network for Hyperspectral Image Classification. Remote Sensing, 18(7), 1059. https://doi.org/10.3390/rs18071059

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Double-Attention Context Interactive Network for Hyperspectral Image Classification

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning-Based Hyperspectral Image Classification

2.2. Attention-Based Hyperspectral Image Classification

3. Proposed Method

3.1. PCA

3.2. The Context Interaction Fusion Module (CIFM)

3.3. The Channel–Spatial Double-Attention (CSDA)

3.4. The Hybrid Convolutional Layer

3.5. Classification

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementations Details

4.4. Comparison with State-of-the-Art Methods

4.5. Ablation Studies

4.6. Visualization Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI