MDS3-Net: A Multiscale Spectral–Spatial Sequence Hybrid CNN–Transformer Model for Hyperspectral Image Classification

Bian, Taonian; Yang, Bin; Chen, Yuanjiang; Zhou, Xuan; Yue, Li; Hu, Shunshi

doi:10.3390/rs18070977

Open AccessArticle

MDS³-Net: A Multiscale Spectral–Spatial Sequence Hybrid CNN–Transformer Model for Hyperspectral Image Classification

by

Taonian Bian

^1,2

,

Bin Yang

^1,2,

Yuanjiang Chen

^1,2,

Xuan Zhou

^1,2,

Li Yue

³ and

Shunshi Hu

^1,2,*

¹

School of Geographic Sciences, Hunan Normal University, Changsha 410081, China

²

Key Laboratory of Geospatial Big Data Mining and Application, Hunan Province, Changsha 410081, China

³

BGP Inc., China National Petroleum Corporation, Zhuozhou 072751, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(7), 977; https://doi.org/10.3390/rs18070977

Submission received: 6 February 2026 / Revised: 12 March 2026 / Accepted: 22 March 2026 / Published: 25 March 2026

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel MDS³-Net model is developed, which synergizes MSDC for spectral discrimination and geometric alignment, a linear-complexity S³ Encoder for global context, and DPFE for semantics-preserving dimensionality reduction.
Experimental results on four benchmark datasets (University of Pavia, Houston2013, LongKou, and University of Trento) demonstrate that MDS³-Net achieves higher OA, AA, and Kappa values compared with existing approaches.

What are the implications of the main findings?

The integration of local convolutional extraction and efficient global context modeling improves classification robustness, especially for classes with similar spectral characteristics and complex spatial structures.
MDS³-Net provides a scalable framework for hybrid deep learning models, potentially advancing the processing of high-dimensional remote sensing data with limited labeled samples.

Abstract

Hyperspectral image (HSI) classification faces significant challenges due to the spatial–spectral heterogeneity of land covers and the geometric rigidity of standard convolutions. Although Transformers offer powerful global modeling capabilities, their quadratic computational complexity limits practical efficiency. To address these limitations, this paper proposes a novel hierarchical framework named MDS³-Net (Multiscale Deformable Spectral–Spatial Sequence Network). Specifically, we design a Multiscale Spectral-Deformable Convolution (MSDC) module that adopts a cascaded strategy to sequentially extract discriminative spectral features and adaptively align spatial receptive fields with irregular object boundaries. To capture long-range dependencies efficiently, a Spectral–Spatial Sequence (S³) Encoder is introduced based on a gated large-kernel convolution mechanism, achieving global context modeling with linear complexity. Furthermore, a Dual-Path Feature Extraction (DPFE) module is proposed to perform semantics-preserving dimension reduction via spectral reorganization and spatial attention. Experimental results on four public datasets demonstrate that the proposed MDS³-Net achieves state-of-the-art classification performance and exhibits superior robustness under limited training samples compared to existing methods.

Keywords:

hyperspectral image classification; deformable convolution; Transformer; spectral–spatial feature extraction; dual-path feature extraction; deep learning

1. Introduction

Hyperspectral images (HSIs), acquired by advanced sensors on satellites, aerial vehicles, or drones [1,2,3], provide continuous spectral curves for each pixel. This rich spectral information enables the precise identification of materials, making HSIs indispensable in mineral exploration [4], urban planning [5], precision agriculture [6], environmental monitoring [7], and military reconnaissance [8]. However, the very advantage of HSIs—their high dimensionality—also introduces the “curse of dimensionality” [9]. More critically, real-world HSI scenes are characterized by significant spatial–spectral heterogeneity, where the same material may exhibit varying spectral signatures due to environmental changes, and land covers often present irregular geometries and scale variations [10]. Consequently, effectively utilizing this complex data for accurate classification remains a significant challenge.

Deep learning has revolutionized HSI classification by automatically learning hierarchical representations [11,12], moving beyond handcrafted features. Among these techniques, Convolutional Neural Networks (CNNs) have become the dominant backbone. Early works, such as Li et al. [13], employed 1D CNNs to extract spectral signatures. Recognizing that HSIs are volumetric data, subsequent studies integrated spatial context. For instance, Zhang et al. [14] and Xu et al. [15] utilized dual-branch architectures to extract spectral and spatial features respectively. While these methods improved performance, they typically rely on standard convolution operations defined on a fixed, rigid grid. This inherent rigidity assumes that relevant features always lie within a regular rectangular neighborhood, ignoring the fact that object boundaries in HSIs are often curved or fragmented. As a result, standard CNNs often fail to capture the intrinsic geometric deformations of objects, leading to feature misalignment at boundaries [16].

To address this, Li et al. [17] partitioned HSIs into multiple 3D cubes and applied 3D CNNs to simultaneously perform convolutions along spatial and spectral dimensions. However, employing stacked 3D CNNs significantly increases parameter count and leads to gradient vanishing issues [18,19,20]. Therefore, Roy et al. [21] proposed a hybrid 3D-2D convolution approach to reduce network complexity. Zhong et al. [22] introduced residual structures [23] into 3D spectral and 2D spatial convolutions. To further address the geometric limitations of standard convolutions, Dai et al. [24] introduced deformable convolutions by adding offsets to standard 2D convolutions, enabling adaptive sampling. Yu and Vladlen [25] proposed an efficient method to enlarge convolutional receptive fields. While these techniques alleviate the constraints of fixed grids, modeling long-range dependencies remains a critical challenge for CNNs.

In recent years, Transformer architectures have been introduced into the field of computer vision and achieved impressive results [26]. Unlike CNNs that primarily capture local features, Transformers can capture long-range dependencies between features with global features [27]. In the context of HSI classification, Hong et al. [28] reformulated the task as a sequence modeling problem and proposed a spectral Transformer, surpassing classical ViT. Similarly, Qing et al. [29] exploited spectral attention and self-attention mechanisms, while Liu et al. [30] designed a hierarchical Transformer with shifted windows to enable multi-scale feature extraction with reduced computational redundancy. Moreover, interactive learning frameworks, such as the Center Transformer [31], have been developed to capture multi-scale spatial–spectral representations by interacting features from center to surrounding regions.

In addition, hybrid CNN-Transformer architectures have gained popularity for combining local and global feature modeling. For instance, Sun et al. [32] proposed SSFTT, where a Transformer encoder processes spectral–spatial features extracted from hierarchical 3D and 2D convolutional blocks. Similarly, Fu et al. [33] constructed parallel CNN and Transformer branches to integrate local and non-local features. Xu et al. [34] developed a novel Transformer architecture that incorporates embedded convolution modules to adaptively fuse features from diverse receptive fields. Roy et al. [35] designed learnable spectral and spatial morphological networks using morphological convolutions combined with attention mechanisms. Yang et al. [36] proposed a two-stream CNN using 2D and 3D convolutions for local feature extraction, followed by a Transformer to model global dependencies. To alleviate the computational burden and overfitting of Transformers, Woo et al. [37] introduced channel and spatial attention modules using convolution operations to emulate attention effects. Furthermore, Zhang et al. [38] proposed a cascaded spatial cross-attention network that simultaneously captures local and global spatial contextual features via cross-attention. Beyond the aforementioned architectures, the HSI classification field has recently witnessed significant progress in several other advanced dimensions. For instance, enhanced multiscale feature fusion networks [39] have been developed to capture robust spatial–spectral representations. To alleviate the reliance on massive labeled data, weakly supervised paradigms like the ITER framework [40] have been explored to generate effective image-to-pixel representations. Furthermore, with the advent of large-scale deep learning, vision transformer-based foundation models, such as HyperSIGMA [41], have emerged to unify HSI interpretation across diverse and complex scenes.

However, despite these advances, existing methods still face challenges. First, standard Transformers suffer from quadratic computational complexity (

O (N^{2})

), which restricts efficiency and scalability. Second, in HSI scenarios with limited samples, they are prone to overfitting. Third, standard downsampling methods in these hierarchical networks often cause the irreversible loss of fine-grained details, leading to the disappearance of small-scale objects.

To overcome these challenges—specifically geometric rigidity, high computational complexity, and information loss during downsampling—we propose an innovative hierarchical framework named Multiscale Deformable Spectral–Spatial Sequence Network (MDS³-Net). Unlike previous methods, MDS³-Net introduces a synergistic design that balances local adaptivity and global efficiency. Specifically, we design a Multiscale Spectral-Deformable Convolution (MSDC) module to simultaneously extract discriminative spectral features and adaptively align spatial features with irregular object boundaries. To resolve the quadratic complexity of Transformers, we introduce a Spectral–Spatial Sequence (S³) Encoder based on a gated convolutional mechanism, which captures long-range dependencies with linear complexity (

O (N)

). Furthermore, a Dual-Path Feature Extraction (DPFE) module is proposed to perform dimension reduction while preserving salient spectral–spatial information.

The main contributions of this paper are summarized as follows:

(1): We propose MDS³-Net, a novel unified hierarchical framework that synergizes local geometric adaptability with efficient global modeling, achieving state-of-the-art HSI classification performance even under limited training samples.
(2): We design a MSDC module that decouples spectral and spatial feature extraction, enabling effective spectral discrimination and dynamic alignment with irregular object boundaries.
(3): We introduce a S³ Encoder that utilizes a gated large-kernel convolution mechanism to capture global long-range dependencies with linear computational complexity ( $O (N)$ ), overcoming the heavy computational burden of traditional self-attention.
(4): We propose a DPFE module as a semantics-preserving downsampling mechanism, which performs dimensionality reduction via spatial attention and spectral reorganization to prevent the loss of fine-grained details.

The remainder of the paper is organized as follows. Section 2 elaborates on the proposed MDS³-Net framework and its core components. Section 3 details the experimental setup, datasets, and comprehensive performance analysis. Section 4 presents a further discussion on the experimental results. Finally, Section 5 concludes the paper with a summary of findings and future directions.

2. Methodology

2.1. Overall Architecture

We present MDS³-Net, an innovative hierarchical framework designed for hyperspectral image classification, with the detailed architecture depicted in Figure 1. The MDS³-Net architecture incorporates three key synergistic components: the MSDC module, the S³ Encoder, and the DPFE module.

Prior to feature extraction, to mitigate the curse of dimensionality and spectral redundancy, we first perform Principal Component Analysis (PCA) on the raw hyperspectral imagery. Subsequently, we extract spatial neighborhoods from the dimension-reduced data, generating multiple 3D image patches denoted as

I \in R^{H \times W \times B}

, where

H \times W

represents the spatial dimensions and B is the number of spectral bands. These patches serve as the input for the subsequent stage-wise hierarchical processing.

The proposed framework is designed to progressively extract and integrate spectral–spatial information through a pyramidal structure. Within each processing stage, we adopt a dual-branch strategy to simultaneously capture local and global information. Specifically, the MSDC module serves as the primary extractor, employing a decoupled strategy that combines spectral convolution for discriminative spectral features with deformable convolution for geometric adaptability. Simultaneously, the S³ Encoder operates along a parallel residual path, sharing the same input as the MSDC module. It models long-range sequential dependencies via a gated convolutional mechanism, achieving global receptive fields with linear computational complexity. The local features from the MSDC and the global context from the S³ Encoder are then fused via element-wise addition. Subsequently, this fused representation is fed into the DPFE module, which serves as a semantics-preserving downsampling mechanism. By prioritizing salient information preservation during resolution reduction, the DPFE effectively bridges adjacent stages.

Overall, this architectural design adheres to the principle of complementary feature learning, wherein each component performs a distinct yet cooperative role to strengthen the joint spectral–spatial representation. The MSDC emphasizes both spectral fidelity and local structural adaptability; the S³ Encoder facilitates global contextual awareness while maintaining high computational efficiency; and the DPFE functions as a critical filtering mechanism to ensure salient information preservation during spatial and spectral dimension reduction. Through the progressive integration of these complementary cues, MDS³-Net achieves a balanced synergy between local detail preservation, global semantic understanding, and model efficiency.

2.2. MSDC

As the fundamental feature extraction unit of MDS³-Net, the MSDC module is engineered to address the inherent spectral–spatial coupling and geometric complexity of hyperspectral data. Figure 2 illustrates the internal structure of the MSDC block. It adopts a decoupled strategy combining spectral convolution and deformable spatial convolution, augmented with dual residual connections to facilitate feature reuse and gradient propagation.

The module first applies a spectral convolution with a kernel size of

k \times 1 \times 1

, meaning the operation is performed exclusively in the spectral dimension to aggregate spectral information without altering the spatial structure. To preserve the original spectral fidelity and prevent network degradation, a residual connection is introduced. The intermediate output

X_{m i d}

is formulated as:

X_{m i d} = σ (BN ({Conv}_{k \times 1 \times 1} (x))) + x

(1)

where x is the input feature,

{Conv}_{k \times 1 \times 1}

denotes the 3D convolution with a spatial kernel size of

1 \times 1

, BN represents Batch Normalization [42], and

σ

denotes the ReLU activation function [43].

Subsequently, the spatial features

X_{m i d}

are processed by a deformable convolution with a kernel size of

k \times k

. Unlike standard convolutions that sample from a fixed grid, deformable convolution introduces learnable offsets to dynamically adjust the sampling positions [24]:

F_{deform} (X_{m i d}) = \sum_{p_{n} \in R} W (p_{n}) \cdot X_{m i d} (p_{0} + p_{n} + Δ p_{n})

(2)

where

R

denotes the regular sampling grid, W represents the convolution weights, and

Δ p_{n}

is the learnable offset. Similar to the first stage, a second residual connection is applied to the spatial branch. The final output of the MSDC block is obtained by:

X_{o u t} = σ (BN (F_{deform} (X_{m i d}))) + X_{m i d}

(3)

where

X_{o u t}

and

X_{m i d}

denote the output and input feature maps of the residual connection, respectively.

F_{deform} (\cdot)

represents the deformable convolution operation.

B N (\cdot)

and

σ (\cdot)

denote the operations defined previously. Notably, the symbol + denotes element-wise addition rather than feature concatenation. This operation acts as a standard residual connection to refine the representations while perfectly preserving the original channel dimensions, thereby avoiding the drastic increase in computational complexity that concatenation would cause in subsequent deep layers.

To capture features at varying scales and receptive fields, the MSDC module employs a hierarchical kernel configuration. Specifically, regarding the implementation details, in the shallow stages (Block 1 and Block 2), we utilize a smaller kernel size (

k = 3

) to capture fine-grained texture and local spectral variations. Conversely, in the deeper stages (Block 3 and Block 4), the kernel size is increased (

k = 5

) to expand the receptive field and encapsulate broader semantic context. This multiscale design enables the network to effectively recognize objects of various sizes, ranging from small targets to large homogeneous regions.

2.3. S³ Encoder

While the MSDC module excels at extracting local spectral–spatial features, it inherently lacks the ability to model global contextual dependencies due to its limited receptive field [44]. To address this limitation, we introduce the S³ Encoder. Unlike traditional Transformers that suffer from quadratic computational complexity with respect to token length, the S³ Encoder models long-range interactions with linear complexity via a gated convolutional mechanism.

As shown in Figure 3, the encoder block comprises two synergistic sub-modules: the Gated Spectral–Spatial Mixer (GS²M) and the Feed-Forward Network (FFN). Layer Normalization (LN) [45] is applied before each sub-module to normalize feature distributions, and residual connections are employed after each block. This design effectively alleviates the vanishing gradient problem during the training of deep networks.

2.3.1. GS²M

The GS²M is specifically designed to replace the computationally intensive Multi-Head Self-Attention (MHSA). As depicted in Figure 4, it adopts a large-kernel convolution combined with a gating mechanism to efficiently aggregate global context.

Given a normalized input feature map

X_{i n}

, the module first projects it into a hidden representation using a

1 \times 1

convolution. This representation is then split along the channel dimension into two parallel branches: the gating branch and the feature branch. The gating branch utilizes a depthwise convolution with a large kernel size (

7 \times 7

) to capture broad spatial cues, followed by a GELU activation to generate a spatial attention map. Simultaneously, the feature branch retains the local spectral details. The attention map then modulates the feature branch via element-wise multiplication. The mathematical formulation is defined as:

\begin{matrix} X_{g a t e} & = GELU ({DWConv}_{7 \times 7} ({Conv}_{1 \times 1} (X_{i n}))) \end{matrix}

(4)

\begin{matrix} X_{f e a t} & = {Conv}_{1 \times 1} (X_{i n}) \end{matrix}

(5)

\begin{matrix} Y_{G S^{2} M} & = {Proj}_{o u t} (X_{g a t e} ⊙ X_{f e a t}) + X_{i n} \end{matrix}

(6)

where ⊙ denotes the element-wise multiplication and

{Proj}_{o u t}

is the output projection layer. This gating design allows the model to adaptively select spectral–spatial features based on global context while maintaining a linear computational complexity of

O (N)

, where N is the number of pixels.

2.3.2. FFN

As illustrated in Figure 3, the output of the GS²M is subsequently processed by the FFN. Standard FFNs in Transformers typically operate in a pixel-wise manner (using two

1 \times 1

convolutions), which may overlook local structural details. To mitigate this, our FFN integrates a

3 \times 3

depthwise convolution within the expansion layer. This locality-enhanced design ensures that fine-grained texture information is preserved and refined during the channel mixing process. The FFN can be expressed as:

Y_{o u t} = {Conv}_{1 \times 1} (σ ({DWConv}_{3 \times 3} ({Conv}_{1 \times 1} (Y_{G S^{2} M})))) + Y_{G S^{2} M}

(7)

where

σ

denotes the GELU activation. This modification effectively complements the global modeling capability of the GS²M, creating a comprehensive feature encoder.

2.4. DPFE

Downsampling is a pivotal operation in hierarchical networks. However, standard pooling methods often result in the irreversible degradation of fine-grained details, leading to the disappearance of small-scale objects. To address this, we propose the DPFE module. Distinct from the S³ Encoder which focuses on global feature modeling, the DPFE functions as a downsampling mechanism dedicated to preserving key spectral and spatial information. It is explicitly designed to filter background noise and minimize the loss of salient features during spatial and spectral dimension reduction.

As illustrated in Figure 5, the DPFE module operates through two parallel paths. The spectral reorganization path is designed to perform linear spectral transformation. It employs a

1 \times 1 \times 1

convolution to project the input spectral features onto the target dimension, followed by a

1 \times 2 \times 2

Average Pooling layer to perform spatial downsampling. This path ensures that essential spectral context is efficiently transferred during the spatial and spectral dimension reduction process.

Simultaneously, the spatial squeeze path functions as a global attention filter. It first reduces channel dimensionality via a

1 \times 1 \times 1

convolution to generate intermediate features

X_{g}

. These features are then processed by a large-kernel depthwise convolution (

1 \times 7 \times 7

) and a Sigmoid activation to generate a spatial attention map. This map modulates

X_{g}

via element-wise multiplication, effectively suppressing background noise and highlighting salient regions. Finally, the refined features undergo spatial downsampling identical to the spectral reorganization path. The outputs from both paths are fused via element-wise addition to integrate local spectral details with global salient semantics. The operation is summarized as:

\begin{matrix} X_{s p e c} & = Pool ({Conv}_{1 \times 1 \times 1}^{s p e c} (X_{i n})) \end{matrix}

(8)

\begin{matrix} X_{s q z} & = Pool ({Conv}_{1 \times 1 \times 1}^{s q z} (X_{i n}) ⊙ σ ({DWConv}_{1 \times 7 \times 7} ({Conv}_{1 \times 1 \times 1}^{s q z} (X_{i n})))) \end{matrix}

(9)

\begin{matrix} X_{o u t} & = X_{s p e c} + X_{s q z} \end{matrix}

(10)

where

X_{s p e c}

and

X_{s q z}

denote the intermediate features generated by the spectral reorganization path and the spatial squeeze path, respectively. Pool represents the

1 \times 2 \times 2

Average Pooling operation, and

σ

refers to the Sigmoid activation function. Specifically,

{Conv}_{1 \times 1 \times 1}^{s p e c}

and

{Conv}_{1 \times 1 \times 1}^{s q z}

correspond to the distinct pointwise convolutions utilized in these two respective paths.

3. Experimental Results

3.1. Data Description

We quantitatively and qualitatively evaluated the model’s performance on four representative and promising HSI datasets—University of Pavia, Houston 2013, LongKou and University of Trento in the form of image classification. For each dataset, 5% of the labeled samples were randomly selected for training, 1% for validation, and the remaining 94% for testing.

3.1.1. University of Pavia (UP)

The UP dataset is a hyperspectral scene acquired by the ROSIS sensor (German Aerospace Center (DLR), Cologne, Germany) during a flight over the University of Pavia in northern Italy. This image features a geometric sampling resolution of 1.3 m, featuring dimensions of

610 \times 340

pixels and 103 spectral bands spread across nine distinct classes. The details of the UP dataset used in our experiments are shown in Table 1a.

3.1.2. Houston 2013 (HS2013)

The HS2013 dataset was acquired by the ITRES CASI-1500 sensor (ITRES Research Limited, Calgary, AB, Canada) over the University of Houston campus and its neighboring urban areas. It consists of

349 \times 1905

pixels with a spatial resolution of 2.5 m. The dataset contains 144 spectral bands spanning the wavelength range from 380 nm to 1050 nm, and includes 15 complex land-cover classes. Known for its challenging characteristics, such as severe cloud shadows and diverse urban materials with high spectral similarity, this dataset is widely used to evaluate the robustness of classification models in complex scenarios. The details of the HS2013 dataset used in our experiments are shown in Table 1c.

3.1.3. LongKou (LK)

The LK dataset was acquired in Longkou Town, Hubei province, China, using an 8-mm focal length Headwall Nano-Hyperspec imaging sensor (Headwall Photonics Inc., Bolton, MA, USA) mounted on a DJI Matrice 600 Pro UAV platform (DJI, Shenzhen, China). The study area represents a simple agricultural scene, comprising six crop species and three other land cover types. The imagery, sized at

550 \times 400

pixels, encompasses 270 bands ranging from 400 to 1000 nm, with a spatial resolution of approximately 0.463 m. The details of the LK dataset used in our experiments are shown in Table 1b.

3.1.4. University of Trento (UT)

The UT dataset was acquired by the airborne imaging spectrometer for applications (AISA) Eagle sensor (Specim, Oulu, Finland) over the campus of Trento University, Italy. There are

600 \times 166

pixels with a spatial resolution of 1 m and 63 spectral bands ranging from 402–989 nm are provided. All labeled pixels fall into six different classes, such as buildings, roads, and ground. The details of the UT dataset used in our experiments are shown in Table 1d.

3.2. Experimental Setup

3.2.1. Configuration

The validation tests of the proposed methodology were conducted using the following computer hardware setup: Intel i9-14900K CPU, 192 GB RAM, and NVIDIA RTX 4090 GPU. The software platform for the experiments was based on CUDA 12.1, Pytorch 2.3.0, and Python 3.9.19. The training parameters were set to a starting learning rate of

5 \times 10^{- 5}

, 300 epochs, and batch size of 64. Additionally, the number of retained principal components in the PCA preprocessing step was set to 30 for all datasets.

3.2.2. Evaluation Metrics

In order to evaluate the classification effectiveness of different models, we chose three widely used metrics, which are Overall Accuracy (OA), Average Accuracy (AA), and Kappa Coefficient. Overall accuracy is the ratio of the number of all correctly categorized pixels to the total number of pixels, and AA reflects the average of the categorization accuracies of all categories, while the Kappa coefficient is used to measure how well the categorization results match with the random categorization. Moreover, in order to alleviate the influence of experimental randomness, all experiments are run five times. Mean values and standard deviations are reported for each class and metric.

3.3. Classification Results

In order to verify the effectiveness of the proposed MDS³-Net, we compared it against eight state-of-the-art CNN- and Transformer-based models: 3DCNN [17], HybridSN [21], SSRN [22], SpectralFormer [28], SSFTT [32], morphFormer [35], DSFormer [34], and CSCANet [38].

3.3.1. University of Pavia

The classification accuracy results for the UP dataset are detailed in Table 2. In terms of overall quantitative indicators, our proposed MDS³-Net achieves the best performance, registering the highest OA (

99.49 %

), AA (

99.09 %

), and Kappa (

99.33 %

). Specifically, it outperforms competitors in distinguishing complex classes such as Bare Soil (Class 6) and Self-blocking Bricks (Class 8) from spectrally similar land covers. This quantitative advantage is corroborated by the visual classification maps presented in Figure 6. While traditional CNNs (3DCNN) and earlier Transformer architectures (SpectralFormer) display noticeable fragmentation and misclassification noise, maps generated by recent methods like DSFormer and CSCANet appear relatively smooth. However, as observed in Figure 6j, the map generated by MDS³-Net is visually the closest to the Ground Truth (Figure 6a). It produces sharper boundaries and fewer misclassified pixels in heterogeneous regions compared to other state-of-the-art methods. This visual superiority is attributed to the synergistic architecture: the MSDC and S³ Encoder cooperatively model local-global features to suppress background noise, while the DPFE module functions as a critical filtering mechanism to preserve salient spectral–spatial details during the dimension reduction process.

3.3.2. Houston 2013

The classification accuracy results for the HS2013 dataset are presented in Table 3. Demonstrating its robust feature extraction capabilities in complex urban scenarios, MDS³-Net achieves the best performance across all three comprehensive metrics. It registers the highest OA of

98.05 %

, representing a significant and substantial improvement of over

2.1 %

compared to the second-best methods, such as HybridSN (

95.87 %

) and CSCANet (

95.75 %

), and a massive lead of nearly

20 %

compared to the baseline 3DCNN (

78.53 %

). Similar superiorities are confirmed in AA (

98.09 %

) and Kappa (

97.89 %

). A detailed class analysis reveals that the HS2013 dataset poses extreme challenges due to severe cloud shadows and spectrally similar urban materials. While competing methods fluctuate drastically and struggle with challenging structural categories—such as Highway (Class 10), Railway (Class 11), and Parking Lot 1 (Class 12), MDS³-Net maintains exceptional stability, achieving outstanding accuracies exceeding

98.8 %

in these classes and reaching perfect

100 %

accuracy in Soil (Class 5) and Running Track (Class 15). Furthermore, the visualization maps shown in Figure 7 strongly correlate with these quantitative results. In this wide-swath urban scene, early methods like 3DCNN and SpectralFormer display extensive noise and severe misclassification errors, particularly in regions corrupted by cloud shadows. While advanced models like CSCANet and DSFormer mitigate some of these issues, they still exhibit noticeable misclassification clusters when dealing with highly mixed and complex urban textures. In contrast, the map generated by MDS³-Net (Figure 7j) is visually the most faithful to the ground truth. It successfully overcomes the interference of shadow artifacts and high spectral similarity, yielding spatially coherent land-cover regions with remarkably sharp and accurate boundaries, thereby visually confirming the superior ability of our MDS³-Net’s hybrid architecture to learn resilient and precise spectral–spatial representations even in the most adverse conditions.

3.3.3. LongKou

The classification accuracy results for the LK dataset are detailed in Table 4. On this scene, MDS³-Net achieves state-of-the-art performance across all evaluation metrics. Specifically, MDS³-Net registered an OA of

99.86 %

, surpassing the second-best method, DSFormer (

99.81 %

). A distinctive advantage is observed in AA, where MDS³-Net reached

99.57 %

, demonstrating superior balance compared to competitors. Similar trends are seen in the Kappa coefficient (

99.82 %

). This robustness is most evident in Class 5 (Soy_narrow), a challenging class where traditional methods like 3DCNN struggle significantly (

54.27 %

). In contrast, MDS³-Net achieves a remarkable

99.67 %

, outperforming even the strong competitor DSFormer (

98.84 %

). The visual classification maps presented in Figure 8 corroborate these quantitative findings. While earlier architectures like 3DCNN and SpectralFormer (Figure 8b,e) exhibit noticeable salt-and-pepper noise and fragmentation, advanced methods such as DSFormer and CSCANet produce highly smooth classification maps similar to ours. However, the map generated by MDS³-Net (Figure 8j) achieves the highest fidelity to the Ground Truth (Figure 8a). By effectively combining global homogeneity with precise feature identification, MDS³-Net ensures consistent accuracy even in the most difficult crop regions (Class 5) that are prone to misclassification by other methods.

3.3.4. University of Trento

The classification accuracy results for the UT dataset are presented in Table 5. Despite the challenge of complex urban land cover and limited samples, MDS³-Net achieves the highest values across all three main evaluation metrics. Specifically, MDS³-Net records an OA of

99.64 %

, surpassing the second-best method, morphFormer (

99.11 %

), and demonstrating a substantial improvement of over

15 %

compared to the 3DCNN. Notably, our method also achieves the highest AA (

99.44 %

) and Kappa (

99.53 %

), indicating superior consistency. A detailed class analysis reveals the source of this advantage: while comparison methods perform well on distinct vegetation classes (e.g., Classes 4 and 5), MDS³-Net distinguishes itself in structural categories. For instance, in Class 6 (Roads), MDS³-Net achieves

98.93 %

, significantly outperforming the nearest competitor, CSCANet (

96.27 %

), by over

2 %

. The visual classification maps shown in Figure 9 confirm these quantitative findings. While the map generated by 3DCNN (Figure 9b) exhibits severe fragmentation and noise, advanced methods like morphFormer and CSCANet produce relatively smooth maps. However, the map generated by MDS³-Net (Figure 9j) offers the highest fidelity to the Ground Truth (Figure 9a). It effectively suppresses noise while preserving the sharp geometry of urban structures, further validating the model’s robustness in handling the complex spatial details of the UT environment.

In summary, the extensive experiments across four diverse datasets confirm that the proposed MDS³-Net achieves consistent state-of-the-art performance, validating its robust generalization ability. This comprehensive superiority is not attributable to a single component but stems from the synergistic integration of the three proposed modules. The MSDC module excels in extracting discriminative spectral information and aligning geometric features, effectively handling objects with irregular boundaries and complex shapes across all datasets. Simultaneously, the S³ Encoder captures long-range dependencies to maintain global semantic consistency, which is vital for distinguishing spectrally similar materials in both urban and agricultural scenes. Furthermore, the DPFE module functions as a critical filtering mechanism, preventing the loss of fine structural details during dimension reduction. Together, these components enable MDS³-Net to balance local detail preservation with global contextual understanding, thereby addressing the complex variability inherent in diverse HSI scenes.

3.4. Additional Experiments

We conducted further experiments to analyze the contribution of different MDS³-Net components and the impact of key parameters. These experiments include ablation analysis, the effect of training sample ratios, the number of principal components, patch size variation, the number of MSDC blocks, and the parameter sensitivity analysis of the S³ Encoder.

3.4.1. Ablation Analysis

To systematically investigate the specific contribution of each component within MDS³-Net, we conducted a comprehensive ablation study by selectively enabling or disabling the MSDC, S³ Encoder, and DPFE module. The quantitative results on all four datasets are summarized in Table 6.

As observed in Table 6, the complete MDS³-Net achieves the highest performance across all datasets, confirming the necessity of the synergistic integration of all three modules. Taking the highly challenging HS2013 dataset as a representative example, we can draw three key observations:

(1): Single-module limitations: When the network relies on a single component, the performance is limited. The stand-alone DPFE yields the lowest OA ( $77.57 %$ ), which is expected as it is primarily designed for transition and downsampling refinement rather than deep feature extraction. While the MSDC or S³ Encoder alone achieves a moderate baseline ( $91.99 %$ and $93.17 %$ , respectively), they fail to capture the full spectrum of local and global characteristics required for precise classification compared to the integrated model.
(2): Synergy of dual modules: The integration of any two modules yields substantial performance gains over the single-component baselines. For instance, combining the MSDC and S³ Encoder boosts the OA to $96.94 %$ . This validates our design philosophy that fusing local spectral-geometric features (from MSDC) with global sequential dependencies (from S³ Encoder) significantly enhances discriminative power. Similarly, the combination of S³Encoder and DPFE reaches $96.73 %$ , highlighting the importance of the DPFE in preserving salient information during feature processing.
(3): Holistic integration: The full MDS³-Net achieves the peak OA of $98.05 %$ , outperforming all single- and dual-module variants. This demonstrates that the three components are not merely additive but complementary: the MSDC aligns local features, the S³ Encoder models global context, and the DPFE ensures detail preservation during dimension reduction. The absence of any single module disrupts this balance and leads to a noticeable degradation in accuracy.

3.4.2. Impact of Training Sample Ratios

To evaluate the robustness of MDS³-Net under limited supervision conditions, we varied the training set proportions from 1% to 10% across all four datasets. The comparative experimental results are visualized in Figure 10.

Specifically, when the training ratio increases from 1% to 5%, most models show a significant upward trend in performance. However, as the ratio continues to rise beyond 5%, the growth rate plateaus, and certain methods even exhibit a decline in accuracy. This performance degradation observed in specific models can be attributed to their unique architectural characteristics. For the traditional 3DCNN, its extremely limited network capacity and rigid receptive field struggle to accommodate the increased spatial–spectral variance introduced by larger training sets, leading to underfitting and classification instability. For SpectralFormer, which primarily focuses on spectral sequence modeling, the lack of robust spatial contextual constraints makes it highly sensitive to the increased intra-class variance and local noise. In contrast, models that effectively fuse spatial–spectral features with appropriate capacity maintain stable decision boundaries and exhibit greater robustness. Based on this observation, we selected 5% as the fixed training ratio for our main comparative experiments.

Despite these fluctuations in competitor performance, MDS³-Net demonstrates exceptional stability, particularly in data-scarce scenarios. This robustness is clearly exhibited on the highly complex HS2013 dataset (Figure 10b). When the training ratio is extremely limited to merely 1%, MDS³-Net still achieves the highest OA of approximately

89 %

. While the advanced CSCANet demonstrates competitive resilience, MDS³-Net consistently maintains an absolute lead, whereas other recent competitors like morphFormer and DSFormer experience more noticeable performance degradation, dropping to around

85 %

and

81 %

, respectively. As shown across all subfigures in Figure 10, when the training ratio is reduced to 1% or 3%, competitors such as 3DCNN and SpectralFormer suffer significant performance degradation due to overfitting. In contrast, MDS³-Net maintains a substantial lead even with minimal supervision, achieving consistently high OA values that are comparable to those of other methods trained with larger datasets. This robustness indicates that the synergistic integration of the MSDC, S³ Encoder, and DPFE effectively extracts and preserves discriminative features without relying heavily on massive labeled data, proving the model’s effectiveness in practical applications where labels are expensive to acquire.

3.4.3. Impact of the Number of Principal Components

The number of retained principal components (C) determines the spectral richness of the network input, directly affecting the balance between information preservation and noise redundancy. To investigate its impact, we evaluated the performance of MDS³-Net with C varying from 10 to 50, as illustrated in Figure 11.

From the results, we can observe that different datasets exhibit varying sensitivities to spectral dimensionality. Notably, the LK, UT, and the challenging HS2013 datasets all achieve their peak classification accuracy exactly at

C = 30

, demonstrating that this dimensionality offers the optimal trade-off between informative spectral features and noise reduction. For the HS2013 dataset in particular, the performance exhibits a clear inverted-U shape, rising steadily from lower dimensions to reach its absolute maximum at

C = 30

, before declining as redundant bands introduce noise. Although the UP dataset favors slightly lower dimensions due to its specific spectral characteristics, it still maintains a stable local peak at

C = 30

before dropping at higher dimensions. Consequently, to ensure robust generalization and consistency across all diverse scenes, we set the number of principal components to 30 for our method.

3.4.4. Impact of Spatial Patch Size

To compare the effect of different input spatial contexts on the performance of MDS³-Net, we varied the patch size from

9 \times 9

to

17 \times 17

. The OA trends for all four datasets are illustrated in Figure 12.

First, it can be observed that the OAs on the four datasets generally follow a trend of increasing and then decreasing as the patch size grows. An appropriately sized patch provides sufficient neighborhood information for the MSDC and S³ Encoder to capture local structures and global dependencies, whereas an overly large patch may introduce irrelevant background noise and spatial redundancy.

Second, as shown in Figure 12, MDS³-Net achieves optimal classification performance on the UP, HS2013, and UT datasets when the patch size is set to

13 \times 13

. For the LK dataset, the performance peaks at a patch size of

15 \times 15

. Taking into account the consistent optimal performance across the majority of the datasets and the computational resources available, we set the input patch size to

13 \times 13

for all experiments in this study.

3.4.5. Impact of the Number of MSDC Blocks

To determine the optimal depth of the spatial–spectral feature extraction stage, we investigated the impact of the number of MSDC blocks on classification performance. By varying the block count from 1 to 5, we evaluated the OAs on all four datasets, as illustrated in Figure 13.

As the number of MSDC blocks increases from 1 to 4, a substantial and consistent improvement in OA is observed across all datasets. This sharp upward trend indicates that a shallow architecture is insufficient to capture the intricate, high-level spectral–spatial representations required for accurate classification in heterogeneous scenes. By cascading multiple MSDC blocks, the network progressively expands its receptive field, allowing the deformable convolutions to capture broader geometric contexts while refining local spectral details.

However, when the network depth is further increased to 5 blocks, the classification performance plateaus and even experiences a slight degradation across the datasets. This decline can be attributed to the risks of overfitting, information redundancy, and potential optimization difficulties that often accompany overly deep structures. Consequently, to achieve the optimal balance between feature extraction capacity, classification accuracy, and computational efficiency, the number of cascaded MSDC blocks is empirically set to 4 in our proposed MDS³-Net.

3.4.6. Parameter Sensitivity Analysis of the S³ Encoder

The structural configuration of the S³ Encoder plays a pivotal role in balancing feature modeling capability and computational complexity. To determine the optimal architecture, we conducted sensitivity analysis on two key hyperparameters: the spatial kernel size (K) in the GS²M block and the number of encoder layers (L). The experiments were performed on the UP, HS2013, LK, and UT datasets, as illustrated in Figure 14.

First, we investigated the influence of K in the GS²M module, varying K from 3 to 9. As shown in Figure 14a, the OA generally exhibits an upward trend as K increases from 3 to 7. This phenomenon indicates that the GS²M module relies on a large kernel gating mechanism to capture long-range spatial dependencies, and a larger kernel size effectively expands the receptive field, enabling the module to simulate global feature interactions. However, as K further increases to 9, the performance on the UP and LK datasets decreases significantly, while it tends to saturate on the HS2013 and UT datasets. This indicates that while seeking a wider receptive field is beneficial, an excessively large spatial window may introduce irrelevant background noise and disrupt the consistency of local spectra, which is particularly disadvantageous for classifying small or heterogeneous objects. Therefore, we set

K = 7

to achieve the best balance between modeling long-term dependencies and preserving fine-grained local details.

Second, we evaluated the impact of the network depth by varying L from 1 to 5. Figure 14b demonstrates the variations in OA with different network depths. It can be observed that the model achieves optimal performance when

L = 2

across all four datasets. While deeper networks are theoretically capable of extracting higher-level semantic abstractions, a significant performance drop occurs when L exceeds 2. This phenomenon can be attributed to the optimization difficulties inherent in training deep networks with limited hyperspectral training samples such as overfitting. Therefore, to ensure high classification accuracy while maintaining computational efficiency, the number of S³ Encoder layers is fixed at

L = 2

.

3.5. Computational Complexity and Efficiency Analysis

In addition to classification accuracy, computational efficiency—encompassing computational cost, model size, and running time—is a pivotal factor for assessing the practical value of HSI classification models. Table 7 and Figure 15 present a comprehensive comparison of the proposed MDS³-Net against state-of-the-art methods in terms of four key metrics: Floating Point Operations (FLOPs) measured in millions (M), the number of parameters (Param) measured in thousands (K), alongside training and testing times measured in seconds (s), the sum of which yields the total running time.

It is worth noting that these statistics were recorded on the HS2013 dataset under the same experimental setting as Table 3, specifically with 5% of samples for training, 1% for validation, and the remaining 94% for testing. This consistent configuration ensures a fair comparison of both performance and computational cost across different methods.

As observed from Table 7 and intuitively illustrated in Figure 15, lightweight networks like SSRN and SSFTT exhibit relatively lower computational costs in terms of FLOPs. However, this efficiency comes at a severe expense of classification performance. For instance, the most lightweight SSRN trails the proposed method by over

16.8 %

in OA, falling to the bottom-left corner. On the other hand, complex models like SpectralFormer and DSFormer suffer from high computational burdens. As depicted by its rightmost position and darker warm color in Figure 15, SpectralFormer requires the longest total running time (

102.53

s) and possesses high FLOPs (

28.79

M). Similarly, the traditional 3DCNN involves an excessively heavy parameter load of

3527.22

K, which is clearly visualized by its massive bubble size without delivering competitive accuracy (

78.53 %

).

The proposed MDS³-Net achieves a superior trade-off between performance and efficiency. Located prominently in the top-left region of Figure 15, it demonstrates an optimal balance. Remarkably, MDS³-Net maintains a highly competitive computational footprint: its FLOPs (

5.67

M) are significantly lower than those of recent Transformer-based methods like SpectralFormer (

28.79

M), morphFormer (

25.64

M), and DSFormer (

21.05

M), while its parameter count (

140.35

K) remains highly comparable to, and even slightly lower than, lightweight architectures like SSFTT (

148.49

K). This efficiency is primarily attributed to the design of the S³ Encoder, which replaces standard heavy computations with streamlined spectral–spatial sequence modeling. Although the integration of the deformable convolution mechanism in the MSDC module introduces a slight increase in training time compared to simple CNNs (due to the learning of adaptive sampling offsets), this cost is well-justified. Crucially, compared to SpectralFormer, MDS³-Net reduces the total running time by over

35 %

while achieving state-of-the-art accuracy across all metrics. This demonstrates that the proposed architecture effectively leverages computational resources to enhance feature representation capability without incurring excessive overhead.

4. Discussion

4.1. Mechanism Analysis of Performance Superiority

The extensive experimental results in Section 3 validate that MDS³-Net outperforms current state-of-the-art methods. Beyond the numerical improvements, it is crucial to understand the underlying mechanisms driving this success. The superiority of MDS³-Net primarily stems from its ability to synergistically address HSI classification challenges through three synergistic dimensions: the unified extraction of local spatial–spectral features, the efficient modeling of global long-range dependencies, and the preservation of salient information during dimensionality reduction.

First and foremost, the MSDC module serves as the core engine for joint spectral–spatial feature extraction. Unlike standard CNNs that utilize fixed square kernels and treat spatial–spectral dimensions rigidly, the MSDC module introduces a dual-enhancement mechanism. Regarding spatial adaptability, land cover objects—such as the complex urban structures in HS2013 or winding roads in Trento—often exhibit irregular shapes that do not conform to fixed grids. The deformable mechanism in MSDC decouples the receptive field, allowing sampling locations to dynamically align with object boundaries, thereby effectively reducing “mixed pixel” interference at edges. Concurrently, in terms of spectral discrimination, the MSDC employs a cascaded strategy. It first applies multiscale spectral convolution to extract discriminative spectral signatures across different local band ranges. These spectrally refined features are then sequentially processed by the spatial deformable convolution. This serial design ensures that the network distinguishes between materials with subtle spectral discrepancies before performing geometric alignment. Therefore, the high classification accuracy of MDS³-Net in heterogeneous regions is a direct result of this synergy—MSDC refines features spectrally and then aligns them spatially.

Second, the ablation study (Table 6) confirms the necessity of the S³ Encoder. While MSDC excels at local spectral–spatial extraction, pure convolutional operations inherently struggle to capture long-range dependencies. The S³ Encoder compensates for this by utilizing a gated large-kernel mechanism to model global sequential relationships across the spectral–spatial domain. Unlike traditional Transformers that rely on computationally intensive self-attention, the S³ Encoder achieves this global modeling with linear complexity. Consequently, the architecture establishes a complementary hierarchy: both localized and adaptive feature extraction through MSDC, and efficient global context refinement through S³ Encoder as a supplement.

Third, the DPFE module plays a critical role in maintaining feature integrity during downsampling. In conventional hierarchical networks, standard pooling operations often lead to the irreversible loss of fine-grained details, causing small-scale objects to disappear in deeper layers. The DPFE addresses this by employing a dual-path strategy: a spectral reorganization path to strictly preserve spectral context and a spatial squeeze path to highlight salient regions via spatial attention. By filtering background noise while retaining key semantic information during dimension reduction, the DPFE effectively bridges adjacent stages, ensuring that the network maintains high distinctiveness even for small targets or complex boundaries.

4.2. Architectural Efficiency and Practicality

In the realm of HSI classification, achieving a balance between high accuracy and low computational cost is a pivotal consideration for practical deployment. The efficiency of the proposed MDS³-Net stems from its strategic architectural design.

Specifically, the efficiency and practicality of MDS³-Net are realized through the layer-wise synergistic operation of its three core components. First, within each processing stage, the MSDC module and the S³ Encoder work in a complementary manner. The MSDC utilizes decoupled convolutions to efficiently extract dense local features. Simultaneously, the S³ Encoder, strategically embedded in the residual path, captures global context to rectify these local representations. Distinct from standard ViTs, our S³ Encoder avoids quadratic complexity (

O (N^{2})

), ensuring that the overhead of global modeling remains manageable.

Second, connecting these stages is the DPFE module, which acts as a semantics-preserving compressor. Unlike standard pooling layers that indiscriminately discard information, the DPFE employs a dual-path strategy: a spectral reorganization path to linearly project spectral dimensions and a spatial squeeze path to filter background noise via spatial attention. This hierarchical architecture optimizes the allocation of computational resources. By progressively reducing feature resolution while preserving salient information through DPFE, the network ensures that deeper layers operate on compact, high-level semantic embeddings. This “coarse-to-fine” processing flow allows MDS³-Net to retain powerful global modeling capabilities without incurring the prohibitive costs associated with full-resolution processing, thereby achieving an optimal balance between inference speed and classification accuracy.

4.3. Limitations

Despite the superior classification performance and competitive efficiency achieved by MDS³-Net, there remain limitations regarding model complexity that warrant further discussion. Although MDS³-Net is significantly faster than standard Transformer-based methods, it inevitably incurs higher storage and computational costs compared to extremely lightweight CNNs (such as SSRN). Specifically, the calculation of learnable offsets in the MSDC module and the large-kernel depthwise convolutions in the S³ Encoder require more floating-point operations than simple static convolutions. This reflects a necessary trade-off to achieve high-precision classification in complex scenes. To address this, future work will focus on developing lightweight versions of MDS³-Net, thereby further reducing the resource overhead to enhance deployability on resource-constrained platforms.

5. Conclusions

In this paper, we have proposed MDS³-Net, a novel hierarchical framework designed to address the dual challenges of geometric rigidity and computational inefficiency in HSI classification. By synergizing the MSDC with the S³ Encoder, our method effectively unifies the adaptive extraction of local spectral–spatial features and the modeling of global long-range dependencies with linear computational complexity. Additionally, the DPFE module functions as a critical bridge between stages, facilitating semantics-preserving dimensionality reduction through spectral reorganization and spatial attention mechanisms.

Extensive experiments on four benchmark datasets (UP, HS2013, LK, and UT) demonstrate that MDS³-Net consistently achieves state-of-the-art classification performance, particularly in scenes characterized by complex geometric boundaries and significant spectral variability. Quantitative comparisons and ablation studies further validate the necessity of each synergistic component—MSDC for joint spectral discrimination and geometric alignment, S³ Encoder for efficient global context modeling, and DPFE for semantics-preserving dimensionality reduction. Moreover, the complexity analysis confirms that MDS³-Net attains an optimal trade-off between accuracy and efficiency, significantly outperforming standard Transformer-based models in terms of training speed while maintaining superior precision.

In future work, we intend to focus on developing lightweight versions of MDS³-Net. This will aim to reduce the parameter count and computational overhead identified in the limitations, thereby further facilitating the deployment of high-performance HSI classification models on resource-constrained edge devices.

Author Contributions

Conceptualization, T.B., B.Y. and S.H.; methodology, T.B., B.Y. and S.H.; validation, T.B.; writing—original draft preparation, T.B.; writing—review and editing, B.Y., Y.C., X.Z., L.Y. and S.H.; visualization, T.B.; funding acquisition, S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientific Research Project by Hunan Provincial Department of Education, grant number 21B0046.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors thank those who provided help in this study.

Conflicts of Interest

Author Li Yue was employed by the company BGP Inc., China National Petroleum Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Audebert, N.; Le Saux, B.; Lefevre, S. Deep learning for classification of hyperspectral data: A comparative review. IEEE Geosci. Remote Sens. Mag. 2019, 7, 159–173. [Google Scholar]
Tinega, H.C.; Chen, E.; Ma, L.; Nyasaka, D.O.; Mariita, R.M. HybridGBN-SR: A deep 3D/2D genome graph-based network for hyperspectral image classification. Remote Sens. 2022, 14, 1332. [Google Scholar] [CrossRef]
Zhong, Y.; Hu, X.; Luo, C.; Wang, X.; Zhao, J.; Zhang, L. WHU-Hi: UAV-borne hyperspectral with high spatial resolution (H2) benchmark datasets and classifier for precise crop identification based on deep convolutional neural network with CRF. Remote Sens. Environ. 2020, 250, 112012. [Google Scholar]
Zhou, L.; Ma, X.; Wang, X.; Hao, S.; Ye, Y.; Zhao, K. Shallow-to-deep spatial–spectral feature enhancement for hyperspectral image classification. Remote Sens. 2023, 15, 261. [Google Scholar]
Zhong, Y.; Cao, Q.; Zhao, J.; Ma, A.; Zhao, B.; Zhang, L. Optimal decision fusion for urban land-use/land-cover classification based on adaptive differential evolution using hyperspectral and LiDAR data. Remote Sens. 2017, 9, 868. [Google Scholar] [CrossRef]
Lu, B.; Dao, P.D.; Liu, J.; He, Y.; Shang, J. Recent advances of hyperspectral imaging technology and applications in agriculture. Remote Sens. 2020, 12, 2659. [Google Scholar] [CrossRef]
Stuart, M.B.; Davies, M.; Hobbs, M.J.; Pering, T.D.; McGonigle, A.J.S.; Willmott, J.R. High-resolution hyperspectral imaging using low-cost components: Application within environmental monitoring scenarios. Sensors 2022, 22, 12. [Google Scholar] [CrossRef]
Khan, M.J.; Khan, H.S.; Yousaf, A.; Khurshid, K.; Abbas, A. Modern trends in hyperspectral image analysis: A review. IEEE Access 2018, 6, 14118–14129. [Google Scholar] [CrossRef]
Liu, B.; Yu, X.; Zhang, P.; Tan, X.; Yu, A.; Xue, Z. DSS-TRM: Deep spatial-spectral transformer for hyperspectral image classification. Eur. J. Remote Sens. 2022, 55, 103–114. [Google Scholar]
Zhong, Y.; Wang, X.; Xu, Y.; Wang, S.; Jia, T.; Hu, X.; Zhao, J.; Wei, L.; Zhang, L. Mini-UAV-borne hyperspectral remote sensing: From observation and processing to applications. IEEE Geosci. Remote Sens. Mag. 2018, 6, 46–62. [Google Scholar] [CrossRef]
Paoletti, M.E.; Haut, J.M.; Plaza, J.; Plaza, A. Deep learning classifiers for hyperspectral imaging: A review. ISPRS J. Photogramm. Remote Sens. 2019, 158, 279–317. [Google Scholar]
Liu, B.; Yu, A.; Yu, X.; Wang, R.; Gao, K.; Guo, W. Deep multiview learning for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7758–7772. [Google Scholar] [CrossRef]
Li, W.; Wu, G.; Zhang, F.; Du, Q. Hyperspectral image classification using deep pixel-pair features. IEEE Trans. Geosci. Remote Sens. 2017, 55, 844–853. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, B.; Yu, X.; Zhang, P.; Tan, X. S²DCN: Spectral–spatial difference convolution network for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 3053–3068. [Google Scholar]
Xu, X.; Li, W.; Ran, Q.; Du, Q.; Gao, L.; Zhang, B. Multisource remote sensing data classification based on convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 937–949. [Google Scholar] [CrossRef]
Liao, D.; Shi, C.; Wang, L. A spectral–spatial fusion transformer network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Li, Y.; Zhang, H.; Shen, Q. Spectral–spatial classification of hyperspectral imagery with 3D convolutional neural network. Remote Sens. 2017, 9, 67. [Google Scholar]
Kherimiche, A.; Ouahbi, I.; El Makkaoui, K. Hyperspectral image classification using deep learning: A recent overview. In Proceedings of the International Conference on Intelligent Computing in Data Sciences (ICDS), Marrakech, Morocco, 23–25 October 2024; pp. 1–8. [Google Scholar]
Esmaeili, M.; Abbasi-Moghadam, D.; Sharifi, A.; Tariq, A.; Li, Q. ResMorCNN model: Hyperspectral images classification using residual-injection morphological features and 3DCNN layers. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 219–243. [Google Scholar]
Basodi, S.; Ji, C.; Zhang, H.; Pan, Y. Gradient amplification: An efficient way to train deep neural networks. Big Data Min. Anal. 2020, 3, 196–207. [Google Scholar] [CrossRef]
Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3D-2D CNN feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2020, 17, 277–281. [Google Scholar]
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–spatial residual network for hyperspectral image classification: A 3-D deep learning framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 847–858. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. arXiv 2017, arXiv:1703.06211. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Shi, C.; Yue, S.; Wang, L. A dual-branch multiscale transformer network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 10328–10347. [Google Scholar] [CrossRef]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Qing, Y.; Liu, W.; Feng, L.; Gao, W. Improved Transformer Net for Hyperspectral Image Classification. Remote Sens. 2021, 13, 2216. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
Yang, J.; Du, B.; Zhang, L. From center to surrounding: An interactive learning framework for hyperspectral image classification. ISPRS J. Photogramm. Remote Sens. 2023, 197, 145–166. [Google Scholar]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral-spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
Fu, C.; Zhou, T.; Guo, T.; Zhu, Q.; Luo, F.; Du, B. CNN-transformer and channel-spatial attention based network for hyperspectral image classification with few samples. Neural Netw. 2025, 186, 107311. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Wang, D.; Zhang, L.; Zhang, L. Dual selective fusion transformer network for hyperspectral image classification. Neural Netw. 2025, 187, 107311. [Google Scholar] [CrossRef] [PubMed]
Roy, S.K.; Deria, A.; Shah, C.; Haut, J.M.; Du, Q.; Plaza, A. Spectral–spatial morphological attention transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Yang, L.; Zhang, L.; Wang, Y.; Chen, J.; Liu, Z.; Bian, L.; Yang, C. TC-HISRNet: Hyperspectral image super-resolution network based on contextual band joint transformer and CNN. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 9632–9645. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
Zhang, B.; Chen, Y.; Xiong, S.; Lu, X. Hyperspectral image classification via cascaded spatial cross-attention network. IEEE Trans. Image Process. 2025, 34, 899–913. [Google Scholar] [CrossRef]
Yang, J.; Wu, C.; Du, B.; Zhang, L. Enhanced multiscale feature fusion network for HSI classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar]
Yang, J.; Du, B.; Wang, D.; Zhang, L. ITER: Image-to-pixel representation for weakly supervised HSI classification. IEEE Trans. Image Process. 2024, 33, 257–272. [Google Scholar] [CrossRef]
Wang, D.; Hu, M.; Jin, Y.; Miao, Y.; Yang, J.; Xu, Y.; Qin, X.; Ma, J.; Sun, L.; Li, C.; et al. HyperSIGMA: Hyperspectral intelligence comprehension foundation model. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 6427–6444. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar] [CrossRef]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]

Figure 1. Overall framework of the proposed MDS³-Net. The input HSI data first undergoes PCA for dimensionality reduction. The network then processes the patches through a 4-stage hierarchical structure. In each stage, the MSDC block and the S³ Encoder extract local spectral–spatial features and global long-range dependencies, respectively. The DPFE module bridges adjacent stages by performing semantics-preserving spatial and spectral downsampling. Finally, a classifier generates the predicted output classes.

Figure 2. Illustration of the MSDC block.

Figure 3. Architecture of the S³ Encoder.

Figure 4. Detailed structure of the GS²M.

Figure 5. Illustration of the DPFE module.

Figure 6. Classification visualization maps of all methods on the UP dataset. (a) Ground Truth. (b) 3DCNN. (c) HybridSN. (d) SSRN. (e) SpectralFormer. (f) SSFTT. (g) morphFormer. (h) DSFormer. (i) CSCANet. (j) MDS³-Net.

Figure 7. Classification visualization maps of all methods on the HS2013 dataset. (a) Ground Truth. (b) 3DCNN. (c) HybridSN. (d) SSRN. (e) SpectralFormer. (f) SSFTT. (g) morphFormer. (h) DSFormer. (i) CSCANet. (j) MDS³-Net.

Figure 8. Classification visualization maps of all methods on the LK dataset. (a) Ground Truth. (b) 3DCNN. (c) HybridSN. (d) SSRN. (e) SpectralFormer. (f) SSFTT. (g) morphFormer. (h) DSFormer. (i) CSCANet. (j) MDS³-Net.

Figure 9. Classification visualization maps of all methods on the UT dataset. (a) Ground Truth. (b) 3DCNN. (c) HybridSN. (d) SSRN. (e) SpectralFormer. (f) SSFTT. (g) morphFormer. (h) DSFormer. (i) CSCANet. (j) MDS³-Net.

Figure 10. Comparison of different training sample percentages. (a) UP. (b) HS2013. (c) LK. (d) UT.

Figure 11. Impact of the number of principal components on OA across the four datasets.

Figure 12. Comparison of OAs of different patch size.

Figure 13. Comparison of OAs with different numbers of MSDC blocks.

Figure 14. Parameter sensitivity analysis of the S³ Encoder. (a) Impact of the spatial kernel size in GS²M. (b) Impact of the number of S³ Encoder layers.

Figure 15. Visualization of the performance-complexity trade-off on the HS2013 dataset. The X-axis represents FLOPs (M), and the Y-axis represents OA (%). The bubble size denotes the number of parameters (K), and the color indicates the total running time (s), which is the sum of training and testing times.

Table 1. Details of the classes and sample numbers for the UP, LK, HS2013, and UT datasets.

(a) University of Pavia (UP)						(b) LongKou (LK)
No.	Color	Class	Train	Val	Test	No.	Color	Class	Train	Val	Test
1		Asphalt	331	63	6237	1		Corn	1725	327	32,459
2		Meadows	932	177	17,540	2		Cotton	418	79	7877
3		Gravel	104	19	1976	3		Sesame	151	28	2852
4		Trees	153	29	2882	4		Soy_broad	3160	600	59,452
5		Painted-m-s	67	12	1266	5		Soy_narrow	207	39	3905
6		Bare Soil	251	47	4731	6		Rice	592	112	11,150
7		Bitumen	66	12	1252	7		Water	3352	637	63,067
8		Self-block-b	184	34	3464	8		Roads_houses	356	67	6701
9		Shadows	47	9	891	9		Mixed_weed	261	49	4919
Total			2135	402	40,239	Total			10,222	1938	192,382
(c) Houston 2013 (HS2013)						(d) University of Trento (UT)
No.	Color	Class	Train	Val	Test	No.	Color	Class	Train	Val	Test
1		Healthy grass	63	12	1176	1		Apple trees	201	38	3795
2		Stressed grass	63	12	1179	2		Buildings	145	27	2731
3		Synthetic grass	35	7	655	3		Ground	23	4	452
4		Trees	63	12	1169	4		Woods	456	86	8581
5		Soil	63	12	1167	5		Vineyard	525	99	9877
6		Water	17	4	304	6		Roads	158	30	2986
7		Residential	64	13	1191
8		Commercial	63	12	1169
9		Road	63	12	1177
10		Highway	62	12	1153
11		Railway	62	12	1161
12		Parking Lot 1	62	12	1159
13		Parking Lot 2	24	5	440
14		Tennis Court	22	5	401
15		Running Track	33	7	620
Total			759	149	14,121	Total			1508	284	28,422

Table 2. Classification results of different methods for the UP dataset.

Methods	3DCNN	HybridSN	SSRN	SpectralFormer	SSFTT	morphFormer	DSFormer	CSCANet	MDS³-Net
1	84.82 ± 8.15	89.81 ± 2.55	97.98 ± 1.26	90.92 ± 0.54	98.74 ± 0.82	99.86 ± 3.35	99.50 ± 0.21	99.50 ± 0.24	99.23 ± 0.78
2	94.63 ± 0.06	98.67 ± 0.62	97.98 ± 0.10	99.23 ± 0.12	99.19 ± 0.04	99.99 ± 0.18	99.87 ± 0.10	100.00 ± 0.00	99.98 ± 0.03
3	0.00 ± 0.00	93.03 ± 8.18	97.53 ± 3.11	42.47 ± 2.78	99.09 ± 1.83	97.72 ± 4.58	93.71 ± 1.14	96.05 ± 2.21	98.08 ± 0.93
4	81.94 ± 2.21	96.09 ± 4.23	97.81 ± 1.99	96.42 ± 2.37	97.06 ± 1.79	96.63 ± 3.16	99.42 ± 0.33	98.39 ± 0.57	98.42 ± 0.84
5	99.45 ± 2.21	99.76 ± 1.58	97.98 ± 0.02	100.00 ± 0.00	98.95 ± 0.20	100.00 ± 0.00	100.00 ± 0.00	99.98 ± 0.05	100.00 ± 0.00
6	16.65 ± 1.45	95.49 ± 2.89	94.51 ± 1.08	44.76 ± 1.07	99.19 ± 0.23	99.30 ± 1.62	99.81 ± 0.19	99.80 ± 0.47	99.84 ± 0.17
7	0.40 ± 16.50	87.05 ± 2.90	97.98 ± 3.41	50.72 ± 1.28	99.11 ± 0.15	94.96 ± 0.52	98.38 ± 0.94	99.85 ± 0.24	99.82 ± 0.21
8	95.26 ± 9.66	82.66 ± 5.67	97.08 ± 2.08	93.27 ± 1.69	95.15 ± 1.87	96.71 ± 2.48	96.65 ± 0.77	96.99 ± 1.00	98.95 ± 0.91
9	24.80 ± 7.94	93.45 ± 11.40	97.64 ± 2.05	99.55 ± 3.41	96.40 ± 3.04	94.83 ± 5.67	97.24 ± 1.11	98.44 ± 0.88	97.45 ± 1.90
OA (%)	74.12 ± 1.64	94.48 ± 0.87	96.90 ± 0.41	86.56 ± 0.42	98.54 ± 0.17	98.99 ± 0.62	99.09 ± 0.09	99.29 ± 0.13	99.49 ± 0.18
AA (%)	49.80 ± 2.33	92.89 ± 1.23	97.40 ± 0.76	79.71 ± 0.63	98.10 ± 0.41	97.78 ± 1.05	98.29 ± 0.25	98.24 ± 0.23	99.09 ± 0.36
$κ \times 100$	64.30 ± 2.17	92.67 ± 1.16	96.50 ± 0.54	81.64 ± 0.55	98.34 ± 0.23	98.65 ± 0.83	98.80 ± 0.12	99.06 ± 0.18	99.33 ± 0.24

The best results are shown in bold.

Table 3. Classification results of different methods for the HS2013 dataset.

Methods	3DCNN	HybridSN	SSRN	SpectralFormer	SSFTT	morphFormer	DSFormer	CSCANet	MDS³-Net
1	85.03 ± 2.96	97.23 ± 2.11	89.95 ± 6.30	99.43 ± 0.40	98.38 ± 1.39	97.7 ± 1.20	96.79 ± 1.16	96.51 ± 2.67	98.3 ± 0.97
2	92.08 ± 5.48	97.91 ± 1.44	87.89 ± 6.82	86.71 ± 1.94	96.63 ± 3.95	98.94 ± 0.72	99.35 ± 0.54	98.41 ± 1.84	99.41 ± 0.62
3	90.49 ± 3.22	99.44 ± 1.21	98.14 ± 1.83	95.11 ± 1.56	99.10 ± 1.78	99.50 ± 0.30	99.79 ± 0.21	99.67 ± 0.4	99.8 ± 0.13
4	87.85 ± 5.12	96.12 ± 1.59	89.96 ± 4.75	90.80 ± 1.01	99.20 ± 0.74	95.90 ± 2.44	99.27 ± 0.72	98.49 ± 2.30	99.67 ± 0.39
5	98.91 ± 0.82	97.47 ± 2.47	99.84 ± 0.35	98.20 ± 0.87	98.00 ± 2.50	99.29 ± 0.58	99.60 ± 0.29	99.97 ± 0.08	100.00 ± 0.00
6	59.80 ± 6.02	98.35 ± 1.72	72.61 ± 10.54	40.72 ± 8.81	66.35 ± 30.16	84.87 ± 6.58	94.34 ± 5.15	88.53 ± 4.95	97.48 ± 2.18
7	85.00 ± 3.10	95.09 ± 2.49	82.38 ± 12.80	85.94 ± 1.46	93.85 ± 6.11	94.82 ± 2.29	94.09 ± 2.37	95.39 ± 3.76	97.77 ± 1.58
8	57.34 ± 8.00	95.39 ± 1.77	66.34 ± 8.78	65.06 ± 4.72	83.09 ± 8.51	88.3 ± 2.09	84.17 ± 4.6	83.67 ± 5.11	93.74 ± 2.49
9	76.80 ± 2.49	93.70 ± 1.49	76.96 ± 6.25	82.04 ± 3.98	85.63 ± 7.63	92.86 ± 2.31	80.78 ± 4.07	92.90 ± 3.10	93.40 ± 2.89
10	74.58 ± 14.48	92.76 ± 2.72	68.65 ± 10.88	74.71 ± 6.78	95.54 ± 3.87	97.43 ± 1.64	89.56 ± 3.45	97.78 ± 1.92	99.01 ± 1.47
11	75.09 ± 11.96	95.77 ± 2.15	69.77 ± 11.90	72.70 ± 8.21	88.14 ± 15.60	95.69 ± 2.45	92.76 ± 2.77	97.04 ± 3.11	99.08 ± 0.95
12	48.07 ± 21.99	94.84 ± 1.71	87.27 ± 6.21	76.90 ± 5.41	89.29 ± 3.37	95.34 ± 1.45	85.45 ± 1.74	94.73 ± 3.96	98.85 ± 0.87
13	49.39 ± 5.62	96.06 ± 2.75	3.04 ± 8.98	60.36 ± 6.89	56.02 ± 37.07	93.82 ± 1.58	79.89 ± 2.97	92.44 ± 1.9	94.99 ± 2.07
14	92.16 ± 4.65	97.05 ± 3.95	98.39 ± 2.28	91.49 ± 2.82	78.75 ± 41.57	98.43 ± 2.14	100.00 ± 0.00	99.85 ± 0.31	99.88 ± 0.31
15	94.56 ± 3.38	95.76 ± 4.56	97.85 ± 2.94	91.06 ± 3.17	99.79 ± 0.45	99.95 ± 0.11	100.00 ± 0.00	99.76 ± 0.75	100.00 ± 0.00
OA (%)	78.53 ± 1.47	95.87 ± 0.64	81.18 ± 1.39	82.78 ± 1.57	91.27 ± 4.12	95.79 ± 0.50	92.77 ± 0.81	95.75 ± 0.80	98.05 ± 0.29
AA (%)	72.95 ± 1.25	96.20 ± 0.60	79.27 ± 1.56	80.75 ± 1.34	88.52 ± 6.48	95.52 ± 0.56	93.06 ± 0.80	95.68 ± 0.69	98.09 ± 0.37
$κ \times 100$	76.77 ± 1.60	95.53 ± 0.69	79.62 ± 1.51	81.38 ± 1.70	90.55 ± 4.47	95.44 ± 0.54	92.19 ± 0.88	95.40 ± 0.87	97.89 ± 0.32

The best results are shown in bold.

Table 4. Classification results of different methods for the LK dataset.

Methods	3DCNN	HybridSN	SSRN	SpectralFormer	SSFTT	morphFormer	DSFormer	CSCANet	MDS³-Net
1	96.92 ± 0.86	98.48 ± 0.61	99.94 ± 0.04	97.91 ± 0.09	99.58 ± 0.51	99.66 ± 0.35	99.98 ± 0.01	99.93 ± 0.06	100.00 ± 0.00
2	83.40 ± 2.42	95.12 ± 4.66	99.85 ± 3.91	86.98 ± 0.82	99.59 ± 0.47	99.71 ± 2.44	99.92 ± 0.07	99.72 ± 0.33	99.82 ± 0.37
3	80.82 ± 2.01	99.36 ± 3.69	99.05 ± 1.58	72.24 ± 2.26	99.35 ± 1.85	99.54 ± 2.34	99.71 ± 0.27	99.39 ± 0.76	99.89 ± 0.23
4	95.18 ± 0.51	99.82 ± 0.10	99.93 ± 0.09	95.25 ± 0.62	99.40 ± 0.11	99.75 ± 0.13	99.90 ± 0.03	99.92 ± 0.04	99.93 ± 0.05
5	54.27 ± 5.29	96.43 ± 13.20	95.51 ± 4.04	66.60 ± 4.76	98.83 ± 5.95	98.03 ± 8.77	98.84 ± 0.35	96.83 ± 1.99	99.67 ± 2.96
6	97.87 ± 1.13	99.99 ± 1.10	99.95 ± 0.34	99.74 ± 0.53	99.33 ± 0.27	99.73 ± 1.42	99.96 ± 0.05	99.76 ± 0.16	99.92 ± 0.19
7	99.32 ± 0.31	99.97 ± 0.08	99.97 ± 0.02	99.98 ± 0.08	99.50 ± 0.12	99.96 ± 0.17	99.98 ± 0.01	99.96 ± 0.02	99.98 ± 0.06
8	96.11 ± 4.83	97.48 ± 1.39	96.66 ± 1.29	91.19 ± 4.19	97.85 ± 4.24	99.18 ± 5.79	98.19 ± 0.32	97.16 ± 0.91	98.64 ± 2.19
9	86.49 ± 4.41	98.13 ± 4.52	98.23 ± 4.84	71.90 ± 1.64	98.21 ± 1.36	97.77 ± 3.80	97.99 ± 0.58	96.35 ± 0.82	98.27 ± 1.97
OA (%)	95.27 ± 0.36	99.25 ± 0.76	99.38 ± 0.30	95.51 ± 0.21	99.31 ± 0.23	99.25 ± 0.74	99.81 ± 0.02	99.66 ± 0.07	99.86 ± 0.06
AA (%)	79.04 ± 1.08	98.31 ± 2.77	98.79 ± 1.12	86.86 ± 0.62	99.07 ± 0.90	99.26 ± 1.05	99.39 ± 0.10	98.78 ± 0.34	99.57 ± 0.25
$κ \times 100$	93.81 ± 0.21	99.01 ± 1.07	99.58 ± 0.40	94.10 ± 0.28	99.24 ± 0.31	99.18 ± 0.58	99.75 ± 0.03	99.55 ± 0.09	99.82 ± 0.20

The best results are shown in bold.

Table 5. Classification results of different methods for the UT dataset.

Methods	3DCNN	HybridSN	SSRN	SpectralFormer	SSFTT	morphFormer	DSFormer	CSCANet	MDS³-Net
1	69.20 ± 2.10	97.78 ± 5.45	99.28 ± 0.44	96.23 ± 0.48	99.38 ± 2.02	99.57 ± 2.62	99.88 ± 0.12	99.94 ± 0.05	99.19 ± 0.42
2	63.50 ± 4.60	93.59 ± 2.75	96.17 ± 0.94	78.72 ± 0.94	92.63 ± 0.53	96.09 ± 1.27	95.21 ± 1.33	98.46 ± 0.92	98.64 ± 0.55
3	0.00 ± 0.00	98.67 ± 6.64	85.16 ± 1.81	85.96 ± 0.61	65.42 ± 3.33	97.87 ± 3.85	95.36 ± 2.90	66.18 ± 44.38	99.91 ± 0.21
4	84.06 ± 3.18	99.53 ± 6.13	100.00 ± 0.00	98.18 ± 0.15	99.99 ± 0.61	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	99.99 ± 0.02
5	98.83 ± 1.02	99.13 ± 6.21	99.96 ± 1.58	97.51 ± 0.23	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
6	84.51 ± 2.23	93.55 ± 5.90	87.95 ± 2.95	87.88 ± 1.01	89.81 ± 1.47	96.01 ± 4.15	92.04 ± 1.16	96.27 ± 1.52	98.93 ± 0.48
OA (%)	83.95 ± 1.06	97.94 ± 1.10	98.02 ± 0.33	94.54 ± 0.21	97.59 ± 0.88	99.11 ± 0.32	98.61 ± 0.12	98.92 ± 0.68	99.64 ± 0.09
AA (%)	57.16 ± 0.94	97.04 ± 1.43	94.75 ± 0.57	90.75 ± 0.47	91.20 ± 0.62	97.84 ± 1.09	97.08 ± 0.51	93.47 ± 7.34	99.44 ± 0.15
$κ \times 100$	78.24 ± 1.33	97.25 ± 1.38	97.36 ± 0.26	92.73 ± 0.62	96.78 ± 1.11	98.82 ± 0.35	98.15 ± 0.15	98.55 ± 0.91	99.53 ± 0.12

The best results are shown in bold.

Table 6. Ablation study on different modules of MDS³-Net.

Method	MSDC	S³ Encoder	DPFE	UP OA (%)	HS2013 OA (%)	LK OA (%)	UT OA (%)
MDS³-Net	✓	×	×	94.72	91.99	94.87	93.46
	×	✓	×	93.70	93.17	96.87	93.41
	×	×	✓	88.64	77.57	89.59	85.22
	✓	✓	×	97.36	96.94	98.21	97.15
	✓	×	✓	96.49	95.01	97.86	96.84
	×	✓	✓	98.41	96.73	98.91	97.58
	✓	✓	✓	99.49	98.05	99.86	99.64

The check mark (✓) indicates that the corresponding module is included in the network, while the cross mark (×) indicates that the module is excluded. The best results are shown in bold.

Table 7. Comparison of classification performance and computational complexity on the HS2013 dataset.

Methods	Classification Performance			Computational Complexity
Methods	OA (%)	AA (%)	$κ \times 100$	FLOPs (M)	Param (K)	Tr Time (s)	Te Time (s)
3DCNN	78.53	72.95	76.77	14.86	3527.22	47.49	33.05
HybridSN	95.87	96.20	95.53	16.01	534.53	10.16	0.18
SSRN	81.18	79.27	79.62	3.46	18.33	56.71	1.02
SpectralFormer	82.78	80.75	81.38	28.79	280.30	101.60	0.93
SSFTT	91.27	88.52	90.55	6.99	148.49	13.26	0.35
morphFormer	95.79	95.52	95.44	25.64	153.76	69.05	1.59
DSFormer	92.77	93.06	92.19	21.05	677.29	79.17	1.71
CSCANet	95.75	95.68	95.40	16.37	298.70	29.06	0.65
MDS³-Net (Ours)	98.05	98.09	97.89	5.67	140.35	64.53	1.17

The best results are shown in bold. Tr and Te denote Training and Testing, respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bian, T.; Yang, B.; Chen, Y.; Zhou, X.; Yue, L.; Hu, S. MDS³-Net: A Multiscale Spectral–Spatial Sequence Hybrid CNN–Transformer Model for Hyperspectral Image Classification. Remote Sens. 2026, 18, 977. https://doi.org/10.3390/rs18070977

AMA Style

Bian T, Yang B, Chen Y, Zhou X, Yue L, Hu S. MDS³-Net: A Multiscale Spectral–Spatial Sequence Hybrid CNN–Transformer Model for Hyperspectral Image Classification. Remote Sensing. 2026; 18(7):977. https://doi.org/10.3390/rs18070977

Chicago/Turabian Style

Bian, Taonian, Bin Yang, Yuanjiang Chen, Xuan Zhou, Li Yue, and Shunshi Hu. 2026. "MDS³-Net: A Multiscale Spectral–Spatial Sequence Hybrid CNN–Transformer Model for Hyperspectral Image Classification" Remote Sensing 18, no. 7: 977. https://doi.org/10.3390/rs18070977

APA Style

Bian, T., Yang, B., Chen, Y., Zhou, X., Yue, L., & Hu, S. (2026). MDS³-Net: A Multiscale Spectral–Spatial Sequence Hybrid CNN–Transformer Model for Hyperspectral Image Classification. Remote Sensing, 18(7), 977. https://doi.org/10.3390/rs18070977

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MDS3-Net: A Multiscale Spectral–Spatial Sequence Hybrid CNN–Transformer Model for Hyperspectral Image Classification

Highlights

Abstract

1. Introduction

2. Methodology

2.1. Overall Architecture

2.2. MSDC

2.3. S3 Encoder

2.3.1. GS2M

2.3.2. FFN

2.4. DPFE

3. Experimental Results

3.1. Data Description

3.1.1. University of Pavia (UP)

3.1.2. Houston 2013 (HS2013)

3.1.3. LongKou (LK)

3.1.4. University of Trento (UT)

3.2. Experimental Setup

3.2.1. Configuration

3.2.2. Evaluation Metrics

3.3. Classification Results

3.3.1. University of Pavia

3.3.2. Houston 2013

3.3.3. LongKou

3.3.4. University of Trento

3.4. Additional Experiments

3.4.1. Ablation Analysis

3.4.2. Impact of Training Sample Ratios

3.4.3. Impact of the Number of Principal Components

3.4.4. Impact of Spatial Patch Size

3.4.5. Impact of the Number of MSDC Blocks

3.4.6. Parameter Sensitivity Analysis of the S3 Encoder

3.5. Computational Complexity and Efficiency Analysis

4. Discussion

4.1. Mechanism Analysis of Performance Superiority

4.2. Architectural Efficiency and Practicality

4.3. Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

MDS³-Net: A Multiscale Spectral–Spatial Sequence Hybrid CNN–Transformer Model for Hyperspectral Image Classification

2.3. S³ Encoder

2.3.1. GS²M

3.4.6. Parameter Sensitivity Analysis of the S³ Encoder