EDTST: Efficient Dynamic Token Selection Transformer for Hyperspectral Image Classification

Hu, Xiang; Zhang, Zhiwen; Zhai, Jianghe; Zhang, Longlong; Tang, Yuxiang; Peng, Yuanxi; Zhou, Tong

doi:10.3390/rs17183180

Open AccessArticle

EDTST: Efficient Dynamic Token Selection Transformer for Hyperspectral Image Classification

by

Xiang Hu

^1,†

,

Zhiwen Zhang

^1,†

,

Jianghe Zhai

¹

,

Longlong Zhang

²

,

Yuxiang Tang

²,

Yuanxi Peng

¹

and

Tong Zhou

^3,*

¹

College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China

²

College of Science, National University of Defense Technology, Changsha 410073, China

³

College of Advanced Interdisciplinary Studies, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(18), 3180; https://doi.org/10.3390/rs17183180

Submission received: 29 July 2025 / Revised: 9 September 2025 / Accepted: 11 September 2025 / Published: 14 September 2025

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

We propose EDTST, a novel and efficient Vision Transformer architecture that integrates large-kernel 3D convolution with a dynamic token selection mechanism for hyperspectral image classification.
EDTST achieves state-of-the-art classification accuracy with a 3% improvement in overall accuracy on the WHU-Hi-HanChuan dataset, while requiring the shortest training and inference time among recent models.

What is the implication of the main finding?

The model significantly enhances computational efficiency by reducing parameters and FLOPs through innovative architectural design, making it suitable for resource-constrained applications.
It establishes a new benchmark for balancing accuracy and efficiency in hyperspectral image analysis, providing a practical solution for real-world remote sensing tasks.

Abstract

Hyperspectral images, characterized by rich spectral information, enable precise pixel-level classification and are thus widely employed in remote sensing applications. Although convolutional neural networks (CNNs) have demonstrated effectiveness in hyperspectral image processing, their limited receptive fields constrain their capacity to capture long-range dependencies. Transformers excel at modeling long-range features for hyperspectral image classification (HSIC). Yet, they often overlook effective representation of local spectral–spatial characteristics while incurring computational redundancy from numerous classification-irrelevant tokens. To address these challenges, we propose EDTST, a state-of-the-art Vision Transformer architecture specifically designed for efficient hyperspectral image classification. The model utilizes a large-kernel 3D convolution block to extract deep spectral–spatial features. A 2D convolution block further refines these features, followed by a novel attention mechanism with dynamic token pruning that substantially reduces the computational load by focusing on the most pertinent features. The process concludes with an adaptive average pooling layer and a fully connected layer for classification. Extensive experiments on four standard hyperspectral datasets demonstrate that EDTST achieves the highest classification accuracy, with a notable 3% improvement in overall accuracy on the WHU-Hi-HanChuan dataset, while requiring the shortest training and inference time among all compared state-of-the-art models from the past three years. These results validate the efficacy of our approach in achieving superior performance with markedly improved computational efficiency.

Keywords:

hyperspectral image classification (HSIC); vision transformer (ViT); token selection

1. Introduction

Hyperspectral images capture rich spectral information for each pixel, enabling precise classification into distinct ground object categories [1]. This capability has made them invaluable in applications such as precision agriculture, remote sensing, and material analysis. As a fundamental step in hyperspectral data processing, accurate pixel-level classification is crucial for tasks such as terrain mapping and material identification.

The performance of hyperspectral image classification hinges on the extraction of discriminative features. Conventional machine learning approaches—including random forests [2], decision trees [3], and support vector machines (SVMs) [4]—rely on handcrafted features for classification. Zhang et al. [5] pioneered edge-guided feature extraction using Extended Relative Total Variation (ERTV), where spectral dimensionality reduction precedes structure extraction guided by learned edge probability maps. This approach demonstrated significant performance gains with limited training samples. However, these methods demand domain expertise to design effective feature extraction strategies. With the rapid advancement of hyperspectral imaging technology, the increasing diversity of sensors and the exponential growth of data volume have led to greater heterogeneity across datasets. Consequently, traditional feature-based methods often face stability limitations when handling large-scale, highly variable hyperspectral imagery.

In recent years, deep learning has emerged as a powerful alternative for hyperspectral image classification. By leveraging hierarchical feature learning through diverse network architectures, deep learning models can automatically extract robust spectral–spatial features without manual engineering. This data-driven paradigm has achieved significant progress in hyperspectral image processing, demonstrating superior adaptability and performance. Notably, convolutional neural networks (CNNs) have gained widespread adoption in hyperspectral image classification due to their ability to preserve the inherent spatial structure of input data while effectively extracting discriminative spectral–spatial features. This structural preservation enables CNNs to maintain the integrity of local spectral signatures and spatial contextual information throughout the feature extraction process. For example, a multi-scale 3D CNN model [6] was introduced for hyperspectral image classification. Following this, Roy et al. [7] proposed HybridSN, an integrated model leveraging both 2D spatial and 3D spectral–spatial convolutions for hyperspectral image classification.

However, CNNs may be limited in capturing long-range dependencies and the complex spectral–spatial correlations inherent in HSI [8]. Recently, Vision Transformers (ViTs) have emerged as a powerful alternative, leveraging self-attention mechanisms to model global relationships [9]. Nevertheless, standard ViTs are computationally intensive and require large amounts of training data, making them less suitable for HSI where labeled data is limited.

Recent developments in Transformer-based models, such as the Vision Transformer (ViT) [9], have shown promising results in image processing. Several efforts have been made to apply Transformer models to HSI classification. For instance, He et al. [8] proposed a spatial–spectral transformer (SST) model which utilizes a VGGNet-like structure [10] to extract spatial features and dense transformers to model spectral relationships. Hong et al. [11] developed SpectralFormer (SF), which learns spectral features from neighboring bands using cross-layer transformer encoders. More recently, Sun et al. [12] proposed a model that integrates a CNN backbone with transformers. The CNN captures low-level spectral–spatial features, while the transformers learn contextualized spectral relationships and incorporate spatial information. Li et al. [13] proposed a lightweight self-pooling Transformer called SPFormer. SPFormer reduces model complexity by employing a one-layer self-supervised autoencoder for dimensionality reduction and introduces parameter-free modules such as Channel Shuffle for Multihead Self-Pooling with Sparse Mapping (CSSM-MHSP) and a Central Token Mixer (CTM) to enhance spectral feature mapping and pixel-wise information interaction. Liang et al. [14] introduced a transformer–CNN hybrid (FTSCN) that integrates SimAM-based attention mechanisms with hierarchical dense networks. Their dual-branch architecture leverages spatial SimAM modules for discriminative pixel weighting and squeeze-enhanced axial transformers for global spectral dependencies, achieving state-of-the-art accuracy on four datasets while reducing training time by 40–60% compared to 3D-CNN baselines. Zhang et al. [15] proposed GMLMPTF, a pyramid texture filtering (PTF) framework extracting multi-scale global structures through parameter-varied PTF operations, while local features derive from superpixel-optimized PTF outputs. Their probabilistic fusion of global and local probabilities yielded > 4% OA improvement on agricultural datasets with minimal training samples. Similarly, Zhang et al. [16] developed PFS3F to mitigate segmentation inaccuracies by probabilistically fusing superpixel-optimized spatial features with semantic-aware structural features (S2Fs) constrained by edge information. Zhang et al. [17] introduced MSTRF using exponential windowed inherent variance (eWIV) maps to guide structural-textural-aware recursive filtering. This parameter-free attention mechanism dynamically weights spatial–spectral features, enabling > 92% OA on complex urban scenes with five labeled samples/class. Earlier foundational work by Zhang et al. [18] established contour structural profiles (CSPs) using edge-aware total variation models, where KPCA-fused multi-scale CSPs preserved field boundaries with 90.78% OA on Salinas. Xue et al. introduced an Attention-Gated Tuning strategy with a Triplet Transformer (Tri-Former), enabling effective knowledge transfer across different HSI and even RGB datasets [19]. For multimodal fusion, Huang et al. presented MCFTNet, a cross-layer fusion transformer that integrates LiDAR data as an external classification token to improve feature propagation [20]. Zhu et al. proposed HMAT, a hierarchical aggregation network combining multihead axial attention and pyramid convolution to fuse hyperspectral and LiDAR features [21]. Wu et al. further developed a Spectral Spatial Window Attention Transformer that uses cross-window attention to capture long-range dependencies [22].

Despite these advances, existing approaches face several limitations: (1) high computational complexity, (2) inefficient feature extraction from both spatial and spatial–spectral domains, and (3) suboptimal performance under limited data conditions. These limitations are further summarized by the inherent trade-offs between CNN and Transformer-based models as illustrated in Table 1.

Our work addresses these challenges through a novel architecture that synergistically combines the strengths of both paradigms. This work introduces the EDTST, a novel hybrid 3D-CNN and Vision Transformer architecture specifically crafted for efficient hyperspectral image classification. The initial 3D/2D CNN layers serve as a local feature extractor, efficiently capturing discriminative spectral–spatial patterns from neighboring pixels. Subsequently, the Transformer encoder operates on this condensed feature set, modeling the long-range contextual dependencies that are crucial for disambiguating spectrally similar but semantically distinct categories. This design not only enhances model performance by leveraging both local and global information but also significantly improves computational efficiency by reducing the sequence length of tokens presented to the Transformer. The primary contributions of our research are detailed as follows:

To significantly reduce the parameter count and computational cost of 3D convolutions, we introduce a novel and efficient large-kernel 3D convolution block designed specifically for hyperspectral image processing. The architecture employs a spatial 7 × 7 × 7 convolution followed by two pointwise (1 × 1 × 1) convolutions. This design achieves a receptive field equivalent to that of three stacked 3 × 3 × 3 convolutions, while using only 39.8% of the parameters and 40% of the FLOPs. The module maintains strong representational capacity and is highly suitable for processing high-dimensional hyperspectral data due to its efficient parameter allocation and competitive computational efficiency.
To alleviate the computational redundancy associated with processing a large number of tokens in the Transformer block, we propose a dynamic token selection mechanism that significantly reduces computational complexity. By selectively retaining the 75% most informative tokens and pruning redundant ones, this strategy focuses computational resources on the most critical features, thereby maintaining or even enhancing the model’s classification accuracy while improving efficiency.
Extensive experimental evaluations on multiple benchmark hyperspectral datasets demonstrate that our proposed model, EDTST, consistently outperforms a range of state-of-the-art methods in classification accuracy while achieving superior computational efficiency. Notably, on the WHU-Hi-HanChuan dataset, EDTST attains a notable 3% improvement in overall accuracy compared to leading models proposed in recent years. Furthermore, EDTST requires the shortest training and inference time across all datasets tested, underscoring its practical utility for resource-constrained scenarios. These results conclusively validate the effectiveness of our architectural choices in balancing high performance with operational efficiency.

These contributions collectively enhance the applicability and effectiveness of Vision Transformers in hyperspectral image analysis, setting new benchmarks for future research in the field.

The rest of the paper is organized as follows. Section 2 introduces the related work. Section 3 provides a detailed description of the proposed framework. Section 4 presents experimental results on four hyperspectral datasets. The ablation analyses are reported in Section 5. Finally, Section 6 concludes the paper and summarizes our study.

2. Related Work

The attention mechanism in Transformer models [23] initially demonstrated remarkable success in natural language processing (NLP). Subsequent research extended this capability to computer vision, where Dosovitskiy et al. [9] pioneered Vision Transformer (ViT) for image classification. The long-range dependency modeling of ViT proves particularly advantageous for hyperspectral imagery (HSI), effectively capturing rich spectral information across hundreds of bands while overcoming the limited receptive fields of convolutional networks. This prompted further innovations: He et al. [24] developed HSI-BERT with bidirectional encoding, while He et al. [25] introduced Spa-Spe-TR, a dual-branch architecture with dedicated spectral (Spe-TR) and spatial (Spa-TR) transformer modules.

Recognizing limitations in pure transformer approaches, recent works explore hybrid CNN–transformer architectures [8,12,26]. For instance, Ouyang et al. [27] proposed HybridFormer, integrating convolutions with attention to capture global spectral–spatial dependencies. Similarly, Qi et al. [28] designed a global–local 3D convolutional transformer to model local spectral correlations within global sequences, and Zhao et al. [29] created CTFSN, fusing local–global features through additive and channel-superposition methods.

Recent advances continue to refine this hybrid paradigm. Zhao et al. proposed a Self- and Cross-Attention Enhanced Transformer (SCAET) that employs a dual-branch CNN for visible and thermal infrared HSI feature extraction, followed by interactive enhancement through attention mechanisms [30]. To improve center pixel discrimination, Jia et al. introduced CenterFormer, which employs a center spatial–spectral attention mechanism to emphasize label-relevant pixels [31], while Yu et al. developed a Center-Specific Perception Transformer (CP-Transformer) that uses label-guided attention to prioritize spectral and spatial features related to the center pixel [32].

Efficiency and deployment concerns have also been addressed through novel architectures. Hu et al. designed BinaryViT, a binary Vision Transformer that significantly reduces computational cost through adaptive binarization while maintaining competitive accuracy [33]. Wang et al. proposed an Efficient Attention Transformer Network (EATN) that incorporates self-similarity descriptors to enhance feature relevance and reduce redundancy [34].

While existing methods have made considerable progress in either improving accuracy or reducing computational overhead, few successfully balance both objectives. Many hybrid models still process a large number of tokens, leading to redundant computations, and often overlook the importance of dynamic, input-aware feature reduction. Some efficiency-oriented approaches resort to techniques such as network-wide architectural modifications, external self-similarity descriptors, or binarization. While these strategies can reduce inference cost to some extent, they often introduce additional training complexity or auxiliary computational burdens, without fundamentally addressing the core issue of token redundancy in the attention mechanism itself.

In contrast, our approach, EDTST, introduces a dynamic token selection mechanism within the transformer module to aggressively reduce sequence length without sacrificing discriminative power. Rather than relying on external aids or reduced-precision arithmetic, our method maintains full precision while adaptively focusing computation only on semantically rich regions. Furthermore, by combining 3D convolutional feature extraction with a transformer encoder, we achieve more effective local–global integration than purely attention-based or convolution-only models. This integrated design not only improves accuracy but also substantially decreases computational load during both training and inference, by directly targeting and pruning redundant tokens at the feature level.

3. Materials and Methods

The proposed EDTST architecture introduces three key innovations for efficient hyperspectral image classification: (1) a large-kernel 3D convolution block, (2) a 2D convolution block, and (3) a Transformer block. Figure 1 illustrates the overall architecture. In this section, we detail each component and their synergistic integration.

3.1. Large-Kernel 3D Convolution Block

The large-kernel 3D convolution block serves as the primary spectral–spatial feature extractor in our architecture, designed to process raw hyperspectral data cubes while preserving critical spectral information. As shown in Figure 2, this component operates directly on the input tensor

X \in R^{B \times 1 \times D \times H \times W}

, where B denotes the batch size, D the spectral depth (number of bands), and

H \times W

the spatial dimensions. The block comprises three carefully designed operations.

Feature Extraction Layer: The core operation employs a multi-stage convolutional module featuring the following.

A 3D convolution with kernel size

7 \times 7 \times 7

and padding 3 to maintain spatial–spectral dimensions, implemented as

Y_{1} = {Conv 3 D}_{7 \times 7 \times 7} (X)

(1)

Group normalization with number of groups equal to input channels, ensuring stable training dynamics:

Y_{2} = γ \frac{Y_{1} - μ}{\sqrt{σ^{2} + ϵ}} + β

(2)

Channel expansion via

1 \times 1 \times 1

convolution increasing dimensionality by an expansion ratio of

r = 4

:

Y_{3} = {Conv 3 D}_{1 \times 1 \times 1} (Y_{2})

(3)

Gaussian Error Linear Unit (GELU) activation for non-linear transformation:

Y_{4} = \frac{1}{2} Y_{3} ⊙ (1 + \erf (\frac{Y_{3}}{\sqrt{2}}))

(4)

Channel compression via

1 \times 1 \times 1

convolution, reducing dimensionality to target channels:

Y_{5} = {Conv 3 D}_{1 \times 1 \times 1} (Y_{4})

(5)

This hierarchical design enables efficient feature extraction across spectral–spatial domains while maintaining tensor dimensions

(D, H, W)

.

Normalization Layer: The extracted features undergo 3D batch normalization:

Z = \frac{Y_{5} - μ_{B}}{\sqrt{σ_{B}^{2} + ϵ}} ⊙ γ + β

(6)

where

μ_{B}

and

σ_{B}

are batch-wise statistics computed across spatial–spectral dimensions. This operation mitigates internal covariate shift and accelerates convergence.

Activation Layer: Finally, GELU activation enhances representation power:

O = GELU (Z)

(7)

producing the output tensor

O \in R^{B \times 8 \times D \times H \times W}

.

The design prioritizes parameter efficiency through the following.

Large-kernel spatial–spectral convolution (

7^{3}

) capturing wide contextual relationships;

Channel expansion strategy (

1 \to 4 \to 8

channels) enhancing representational capacity;

Minimal parameters (410 total) enabling deployment in resource-constrained environments.

As verified in Table 2, the block maintains spectral dimensionality D throughout processing, preserving critical hyperspectral information for downstream components. The output serves as enriched input to the 2D spatial processing block described in Section 3.2.

3.2. Two-Dimensional Convolution Block

Following the spectral–spatial feature extraction in Section 3.1, the 2D convolution block processes the output tensor

O \in R^{B \times 8 \times D \times H \times W}

to extract discriminative spatial features while reducing computational complexity. As illustrated in Figure 1, this component transforms the volumetric representation into compact spatial embeddings suitable for subsequent transformer processing. As shown in Figure 3, the block employs a basic yet effective convolutional structure comprising three core operations.

Feature Transformation Layer: The primary operation utilizes a 2D spatial convolution with kernel size

3 \times 3

to capture local spatial patterns:

Y = {Conv 2 D}_{3 \times 3} (O f l a t)

(8)

where

O flat \in R^{B \times C in \times H \times W}

is formed by flattening the spectral dimension (

C_{in} = 8 \times D

). The convolution uses stride 1 and padding 1 to maintain spatial dimensions, formally expressed as

Y_{b, c o u t, i, j} = \sum_{k = 1}^{C_{in}} \sum_{m = - 1}^{1} \sum_{n = - 1}^{1} W_{c o u t, k, m, n} {O f l a t}_{b, k, i + m, j + n} + b_{c o u t}

(9)

This operation effectively compresses spectral-channel information while preserving spatial relationships.

Normalization Layer: Batch normalization is applied to stabilize feature distributions:

Z = \frac{Y - μ_{B}}{\sqrt{σ_{B}^{2} + ϵ}} ⊙ γ + β

(10)

where

μ_{B}, σ_{B}

are batch-wise statistics computed across spatial positions, and

γ, β \in R^{E}

are learnable affine parameters. This operation mitigates covariate shift during training, particularly crucial when transitioning from 3D to 2D representations.

Activation Layer: GELU activation introduces non-linear transformation while maintaining gradient continuity:

O_{2 D} = GELU (Z)

(11)

producing the output tensor

O_{2 D} \in R^{B \times E \times H \times W}

.

As shown in Table 3, the block achieves linear complexity relative to input channels

C_{in}

, making it scalable for high-dimensional hyperspectral data. The output

O_{2 D}

serves as spatial token embeddings for the Transformer block described in Section 3.3, preserving spatial topology while encoding rich spectral–spatial features.

3.3. Transformer Block

The Transformer block serves as the final feature refinement stage in the EDTST architecture, processing spatial embeddings

O_{2 D} \in R^{B \times E \times H \times W}

from Section 3.2 to capture long-range dependencies. As shown in Figure 4, this component consists of three sequential operations.

3.3.1. Input Transformation

Spatial features are converted to token sequences compatible with Transformer processing:

X_{0} = flatten (O_{2 D}) \to X_{0} \in R^{B \times N \times E}

(12)

where

N = H \times W

is the token count. Learnable positional encodings

P \in R^{1 \times N \times E}

preserve spatial relationships:

X_{1} = X_{0} + P

(13)

3.3.2. Attention Mechanism

The core operation of the attention mechanism can be formulated as follows. The input is first projected into queries (

Q

), keys (

K

), and values (

V

) using linear transformations, which are then segmented into H attention heads. The normalized queries and keys are used to compute a raw attention score matrix

S

:

S = L 2 Norm (Q) \cdot L 2 Norm {(K)}^{⊤} \cdot τ \cdot \frac{1}{\sqrt{D_{h}}}

(14)

where

τ

is a learnable temperature vector that scales the attention distribution per head, and

D_{h} = C / H

is the feature dimension per head.

The distinctive step in the attention mechanism is the token selection process. Instead of using a full attention matrix, the attention mechanism constructs a sparse attention matrix

\tilde{S}

by retaining only the connections to the top-k most relevant keys for each query, where

k = ⌊ r \cdot N ⌋

and r is a pre-defined retention ratio (e.g.,

r = 0.75

). This is achieved by applying a binary mask

M

that identifies the indices of the top-k elements in

S

along the key dimension:

\tilde{S} = softmax (S ⊙ M + (- \infty) (1 - M))

(15)

Here, ⊙ denotes the Hadamard product, and the mask

M

is defined such that

M_{i, j} = 1

if the key j is among the top-k keys for query i, and

M_{i, j} = 0

otherwise. The softmax operation is applied to the masked scores, effectively ignoring non-top-k entries.

The final output for each head is computed as the weighted sum of the value projections using the sparsified attention weights,

O_{h} = \tilde{S} V_{h}

. The outputs from all heads are then concatenated and linearly projected to form the final output

Z

.

3.3.3. Feature Refinement Module

The Feature Refinement Module incorporates spatial reasoning and gating mechanisms to refine feature representations through a combination of convolutional processing and adaptive feature selection.

Architecture Overview

The Feature Refinement Module processes an input tensor

X \in R^{B \times N \times C}

, where B is the batch size,

N = 121

is the number of tokens, and

C = 64

is the feature dimension. The network consists of three main components: (1) a spatial feature reorganization block, (2) a gated feature transformation block, and (3) a projection layer.

Spatial Feature Reorganization

The Feature Refinement Module begins by reorganizing the token sequence into a spatial grid representation, leveraging the inherent spatial relationships between tokens. For an input sequence of

N = 121

tokens, we reshape the tensor into an

11 \times 11

spatial grid:

X_{spatial} = Reshape (X) \in R^{B \times C \times H \times W}

(16)

where

H = W = 11

. This spatial reorganization allows the model to capture local structural patterns that may be lost in a purely sequential representation.

The spatial tensor is then split along the channel dimension into two components:

X_{1}, X_{2} = Split (X_{spatial})

(17)

where

X_{1} \in R^{B \times C_{conv} \times H \times W}

and

X_{2} \in R^{B \times C_{untouched} \times H \times W}

, with

C_{conv} = C_{untouched} = 32

.

The first component

X_{1}

undergoes convolutional processing with a

3 \times 3

kernel:

X_{1}^{'} = {Conv 2 D}_{3 \times 3} (X_{1})

(18)

This convolutional operation enhances the model’s ability to capture local spatial dependencies. The processed features are then concatenated with the untouched features:

X_{recon} = Concat (X_{1}^{'}, X_{2}) \in R^{B \times C \times H \times W}

(19)

Finally, the spatial representation is flattened back to the original token sequence format:

X_{flat} = Flatten (X_{recon}) \in R^{B \times N \times C}

(20)

Gated Feature Transformation

The second stage of the Feature Refinement Module employs a gating mechanism to adaptively control information flow. The input is first projected to a higher-dimensional space:

X_{proj} = {Linear}_{C \to 4 C} (X_{flat})

(21)

The projected features are then split into two equal parts along the channel dimension:

G_{1}, G_{2} = Split (X_{proj})

(22)

One part (

G_{1}

) undergoes spatial processing through a depthwise separable convolution:

G_{1}^{'} = DepthwiseConv 2 D (G_{1})

(23)

This operation applies separate convolutional filters to each input channel, efficiently capturing spatial patterns while minimizing computational cost.

The final output is computed through a gating mechanism that combines the spatially processed features with the original projected features:

X_{gate} = G_{1}^{'} ⊙ G_{2}

(24)

where ⊙ denotes element-wise multiplication.

Output Projection

The gated output is projected back to the original feature dimension:

Z = {Linear}_{2 C \to C} (X_{gate})

(25)

3.3.4. Architectural Advantages

The proposed design provides:

Adaptive Computation: A novel token selection mechanism (with a pruning ratio of k = 0.75) dynamically reduces the sequence length during attention calculation, significantly lowering the quadratic complexity based on input content.

Hybrid Representation Learning: The model synergistically combines the global contextualization of multihead self-attention (H = 8) with the local feature extraction prowess of 3 × 3 convolutions, effectively capturing both long-range dependencies and fine-grained local patterns.

Parallel Feature Processing: The Feature Refinement Module (FRM) employs a split–process–merge strategy, applying spatial convolutions to a subset of channels and channel-wise projections to the remainder, enabling simultaneous and efficient local and global feature refinement within a single module.

As verified in Table 4, the block achieves significant efficiency gains. The output

X_{L} \in R^{B \times N \times E}

undergoes global average pooling before the classification layer, forming the final hyperspectral representation that integrates spectral–spatial features with global contextual relationships.

4. Results

This section consists of datasets, backbone models, and result analysis. To evaluate the model performance, we adopted five metrics: overall accuracy (OA), average accuracy (AA), Kappa coefficient (Kappa), Matthews correlation coefficient (MCC), geometric mean (G-Mean), training time, and testing time. OA measures the ratio of correctly classified pixels over the total pixels. AA is the average accuracy of each class, and Kappa measures the consistency between the result and the truth. MCC comprehensively considers true positives, true negatives, false positives, and false negatives, providing a balanced measure especially useful in scenarios of class imbalance. G-Mean, the geometric mean of recall rates of all classes, reflects the model’s balanced performance across different categories and is particularly suitable for evaluating classification effectiveness on imbalanced datasets. The training time is the time required for model training, and the testing time is the time required to classify the test set.

The metrics are computed as follows. Let n be the total number of samples, C the number of classes, and

M

the

C \times C

confusion matrix, where

M_{i j}

represents the number of samples of class i predicted as class j. We define the following:

Row sum: $R_{i} = \sum_{j = 1}^{C} M_{i j}$ (actual samples in class i).
Column sum: $C_{j} = \sum_{i = 1}^{C} M_{i j}$ (predicted samples in class j).

Overall Accuracy (OA) is the proportion of correctly classified samples:

$OA = \frac{\sum_{i = 1}^{C} M_{i i}}{n} \times 100 %$

(26)
Average Accuracy (AA) is the mean of class-specific producer’s accuracies:

$AA = \frac{1}{C} \sum_{i = 1}^{C} \frac{M_{i i}}{R_{i}} \times 100 %$

(27)
Kappa Coefficient measures agreement between predictions and ground truth:

$κ = \frac{p_{a} - p_{e}}{1 - p_{e}} \times 100 %$

(28)

where $p_{a} = OA$ is the observed agreement, and $p_{e}$ is the expected agreement by chance:

$p_{e} = \frac{1}{n^{2}} \sum_{i = 1}^{C} (R_{i} \times C_{i})$

(29)
Matthews Correlation Coefficient (MCC) for multi-class classification is calculated using the standard formula:

$MCC = \frac{n \times \sum_{k = 1}^{C} M_{k k} - \sum_{k = 1}^{C} (R_{k} \times C_{k})}{\sqrt{(n^{2} - \sum_{k = 1}^{C} R_{k}^{2}) \times (n^{2} - \sum_{k = 1}^{C} C_{k}^{2})}} \times 100 %$

(30)

This formulation corresponds to the multi-class implementation in scikit-learn’s matthews_corrcoef function.
Geometric Mean (G-Mean) is the geometric mean of class-wise recall:

$G - Mean = {(\prod_{i = 1}^{C} \frac{M_{i i}}{R_{i}})}^{1 / C} \times 100 %$

(31)

In implementation, we apply smoothing $max (\frac{M_{i i}}{R_{i}}, 10^{- 7})$ to avoid zero values.
Training Time and Testing Time are recorded in seconds during model training and inference.

4.1. Data Description

In this study, we utilized four distinct hyperspectral datasets to evaluate our image classification methods: QUH-Tangdaowan [35], WHU-Hi-HanChuan [36], Indian Pines, Salinas. Each dataset was selected for its unique properties and relevance to specific aspects of hyperspectral imaging and classification.

The QUH-Tangdaowan dataset, acquired over Tangdao Bay National Wetland Park, Qingdao, China, on 18 May 2021, features ultra-high spatial resolution (0.15 m) imagery captured at a 300 m flight altitude. Comprising 1740 × 860 pixels across 176 spectral bands (400–1000 nm), this dataset encompasses 18 challenging land cover classes, including diverse vegetation (Coniferous pine, Buxus sinica, Populus, Ulmus pumila L, Ligustrum vicaryi, Photinia serrulata, Bulrush, Spiraea, and Grassland), artificial surfaces (rubber track, asphalt, boardwalk, flagging, and gravel road), and natural features (sandy, rocky shallows, bare soil, and seawater).

WHU-Hi-HanChuan, acquired on 17 June 2016, documents a diverse peri-urban landscape near Hanchuan, also in Hubei Province. It includes agricultural land interspersed with small settlements and natural vegetation, presenting a complex mosaic of land uses. The dataset features 274 spectral bands and spans an area of 1217 × 303 pixels, with a very high spatial resolution of about 0.109 m, enabling detailed analyses of both crop types and urban infrastructure.

The Indian Pines dataset, sourced from the AVIRIS sensor over northwestern Indiana, USA, is particularly noted for its coverage of mixed agricultural and forest land. The original data comprises 224 spectral bands. After removing bands covering the water absorption regions ([104–108], [150–163], and 220), the dataset used in this study consists of 200 spectral bands in the range of 0.4–2.5 μm and 145 × 145 pixels. This dataset captures a variety of perennial and seasonal vegetation as well as man-made features, offering a broad spectrum for algorithm testing and development in rural settings.

The Salinas dataset, also captured by the AVIRIS sensor, focuses on the Salinas Valley, California, a region known for its intensive agriculture, particularly vegetables and vineyards. The original imagery has 224 spectral bands. The water absorption bands ([108–112], [154–167], and 224) are discarded, resulting in a processed dataset of 204 spectral bands. It features a high spatial resolution of 3.7 m per pixel, covering an area of 512 lines by 217 samples. The Salinas dataset is ideal for analyzing agricultural health and productivity due to its detailed ground truth data and extensive spectral coverage.

Each dataset contributes uniquely to our study, providing a rich basis for exploring and refining hyperspectral image classification methods across different landscapes and land use types. The diverse range of environments from urban to agricultural and natural settings enhances the robustness and applicability of our findings in the hyperspectral image classification domain.

4.2. Experiment Settings

4.2.1. Implementation Details

To ensure the statistical reliability of our experimental results, we conducted 5 independent runs with different random seeds for both model initialization and training sample selection. In total, 25 samples per class were selected for training in the QUH-Tangdaowan, WHU-Hi-HanChuan and Salinas datasets, while 10 samples per class were used for the Indian Pines dataset due to its smaller size. Table 5 provides detailed information about the numbers of training and test samples for each dataset. The performance metrics (overall accuracy (OA), average accuracy (AA), Kappa coefficient, Matthews correlation coefficient (MCC), geometric mean (G-Mean), training time, and testing time) are reported as mean ± standard deviation across these 5 runs to demonstrate the robustness and stability of our proposed method. The confusion matrices for the five individual runs and the average confusion matrix of our proposed method are provided in Appendix A.

For data preprocessing, we applied Principal Component Analysis (PCA) to reduce the spectral dimensionality to 40 components while preserving the essential spectral information. The selection of Principal Component Analysis (PCA) for dimensionality reduction was motivated by a comparative assessment of its advantages over alternative techniques in the context of hyperspectral data. Unlike supervised methods such as Linear Discriminant Analysis (LDA), which are prone to overfitting under the small-sample-size scenario typical of HSI classification, PCA operates unsupervisedly, avoiding this pitfall. Moreover, while non-linear methods like autoencoders or t-SNE can model complex structures, they are often computationally intensive and prioritize visualization over classification clarity. PCA, in contrast, provides a computationally efficient solution that directly addresses the core issue of band multicollinearity by deriving orthogonal components. This transformation not only reduces dimensionality but also preserves the dominant, discriminative spectral variances essential for a classifier, establishing PCA as the optimal choice for this study. The input patch size was set to

11 \times 11 \times 40

for all experiments. All experiments were conducted on a server equipped with a NVIDIA RTX 6000 GPU and 128 GB RAM. The network was trained using the AdamW optimizer with an initial learning rate of 1 × 10⁻⁴ for 100 epochs, using a mini-batch size of 64. As a preprocessing step for model training, the input image patches were normalized to the range [0, 1].

4.2.2. Baseline Methods

For comprehensive evaluation, we compared our proposed EDTST with several representative methods proposed in recent three years, including the following:

Three CNN-based models: 3D ASPP multi-scale feature fusion network (3A-MFFN) [37], spatial–spectral ConvNeXt (SS-ConvNeXt) [38], and dual-branch convolutional transformer network (DCTN) [39].
Four state-of-the-art transformer-based models: global–local multigranularity transformer (GLMGT) [40], multi-scale hierarchical conv-aided fourierformer (MHCFormer) [41], 3D-convolution guided spectral–spatial transformer (3D-ConvSST) [42], and dual selective fusion transformer (DSFormer) [43].

For fair comparison, all baseline methods were implemented using their original configurations as reported in their respective papers.

4.3. Classification Results

4.3.1. Classification Results of the QUH-Tangdaowan Dataset

As shown in Table 6 and Figure 5, the proposed EDTST achieves superior performance on the QUH-Tangdaowan dataset, attaining the highest OA of

85.75 % \pm 1.62 %

, alongside leading results in AA (

91.75 % \pm 0.58 %

),

κ

(

84.07 \pm 1.76

), MCC (

84.38 % \pm 1.67 %

), and G-Mean (

91.26 % \pm 0.69 %

). The strong performance across all metrics, particularly MCC and G-Mean, indicates both excellent per-class balance and overall classification capability. Notably, EDTST delivers the best accuracy in several challenging classes, including C5 (

98.16 % \pm 1.16 %

), C7 (

83.25 % \pm 3.25 %

), C10 (

99.92 % \pm 0.05 %

), C13 (

100.00 % \pm 0.00 %

), C14 (

100.00 % \pm 0.00 %

), and C16 (

70.56 % \pm 4.99 %

). While certain methods excel in specific classes (e.g., GLMGT in C1 and C8, and MHCFormer in C3 and C4), EDTST maintains highly competitive performance across all categories. A significant advantage of our approach is its computational efficiency. EDTST requires only

12.15 \pm 0.02

seconds for training and

59.92 \pm 0.13

seconds for testing, substantially outperforming all competitors in both training and inference speed.

4.3.2. Classification Results of the WHU-Hi-HanChuan Dataset

Table 7 and Figure 6 confirm the strong generalization of EDTST on the WHU-Hi-HanChuan dataset. Our method achieves the highest OA (

89.78 % \pm 0.53 %

), AA (

88.86 % \pm 0.53 %

),

κ

(

88.10 \pm 0.61

), MCC (

88.16 % \pm 0.61 %

), and G-Mean (

88.33 % \pm 0.64 %

), outperforming the second-best method by a considerable margin. The superior G-Mean score demonstrates the effectiveness of EDTST in handling class imbalance, while the high MCC value reflects a strong correlation between predicted and actual classifications. EDTST obtains the best accuracy in seven classes, most notably in C2 (

82.62 % \pm 5.78 %

), C8 (

77.81 % \pm 5.80 %

), C9 (

81.88 % \pm 7.37 %

), and C13 (

72.68 % \pm 5.72 %

)—categories characterized by complex spatial–spectral characteristics. The computational efficiency is again remarkable, with EDTST completing training in

10.71 \pm 0.01

seconds and testing in

27.81 \pm 0.07

seconds, representing a

1.48 \times

and

1.09 \times

speedup over the fastest competitor, respectively.

4.3.3. Classification Results of the Indian Pines Dataset

The results on the Indian Pines dataset (Table 8) highlight a particularly interesting phenomenon: the AA (

89.93 % \pm 0.41 %

) significantly exceeds the OA (

82.06 % \pm 0.39 %

) across all methods, including our proposed EDTST. This discrepancy arises from the highly imbalanced class distribution inherent to the Indian Pines dataset. Several classes (e.g., C1, C7, C9, C13, and C16) contain very few samples (ranging from 10 to 195 test samples) but achieve near-perfect classification accuracy (often 100%). These high accuracies disproportionately elevate the AA, which is the arithmetic mean of per-class accuracies. In contrast, the OA is weighted by sample count and is heavily influenced by the performance on majority classes (e.g., C2, C3, C10, C11, C12), where accuracy is notably lower due to higher spectral confusion and within-class variability.

Despite these challenges, EDTST achieves the highest OA, AA,

κ

(

79.80 \pm 0.42

), MCC (

80.07 % \pm 0.36 %

), and G-Mean (

89.11 % \pm 0.33 %

). The strong G-Mean performance is particularly noteworthy, as it demonstrates our method’s ability to maintain balanced sensitivity across both minority and majority classes. EDTST shows particularly strong gains in the majority classes: C2 (

76.78 % \pm 5.34 %

), C10 (

82.91 % \pm 4.80 %

), C11 (

69.31 % \pm 4.40 %

), and C12 (

73.55 % \pm 3.64 %

). EDTST also maintains the fastest computational time, requiring only

4.24 \pm 0.07

seconds for training and

1.05 \pm 0.01

seconds for testing. The classification maps in Figure 7 further validate the effectiveness of EDTST, showing fewer misclassified pixels in agriculturally dominant regions.

4.3.4. Classification Results of the Salinas Dataset

Results on the Salinas dataset (Table 9) demonstrate the robustness of EDTST in a controlled agricultural setting. Our method achieves the highest OA (

97.36 % \pm 0.82 %

), AA (

98.93 % \pm 0.27 %

),

κ

(

97.07 \pm 0.91

), MCC (

97.09 % \pm 0.89 %

), and G-Mean (

98.89 % \pm 0.29 %

), with particularly notable performance in the more challenging classes: C8 (

91.05 % \pm 3.39 %

) and C15 (

95.60 % \pm 2.53 %

). The near-perfect G-Mean score (

98.89 %

) indicates exceptional balance in class-wise sensitivity and specificity. The Salinas scene, with its extensive homogeneous regions, is generally easier to classify as reflected in the high accuracies across all methods. Nevertheless, EDTST consistently outperforms its counterparts, achieving perfect (

100 %

) accuracy in five of the sixteen classes. Computationally, EDTST remains the most efficient, with training and testing times of

10.79 \pm 0.01

and

5.80 \pm 0.01

seconds, respectively. The classification maps in Figure 8 are visually similar across all advanced methods.

4.3.5. Overall Performance

The consistent superiority of EDTST across all four datasets and all evaluation metrics underscores its effectiveness in capturing discriminative spatial–spectral features for HSI classification. The integration of efficient depthwise transformers and spectral attention mechanisms allows the model to achieve state-of-the-art accuracy while significantly reducing the computational overhead. The observed phenomenon in the Indian Pines dataset, where AA exceeds OA, is a classic characteristic of imbalanced data and highlights the importance of evaluating multiple metrics—AA ensures good performance across all classes, while OA reflects the effectiveness of the dataset as a whole. The strong performance in MCC and G-Mean across all datasets demonstrates the robustness of EDTST in handling both balanced and imbalanced classification scenarios. The minimal visual differences in classification maps among the top-performing methods suggest that quantitative metrics provide a more discriminative evaluation than qualitative inspection alone. EDTST strikes an optimal balance between accuracy and efficiency, making it particularly suitable for practical applications requiring rapid and reliable analysis.

5. Discussion

In this section, we conduct a comprehensive ablation study to evaluate the contributions of key components in the proposed method, including the large-kernel 3D convolution block, the Transformer block, the patch size selection, and the use of PCA for dimensionality reduction. Performance is assessed using five widely recognized metrics: overall accuracy (OA), average accuracy (AA), Kappa coefficient (

κ

), Matthews correlation coefficient (MCC), and G-Mean.

5.1. Effect of Large-Kernel 3D Convolution Block and Transformer Block

The ablation results presented in Table 10 provide several important insights into the role of each component in our architecture:

The combined use of large-kernel 3D convolution block and Transformer block consistently yields the best performance across all datasets, indicating that these two components capture complementary types of features. The large-kernel 3D convolution block excels at extracting local spectral–spatial characteristics, while the Transformer block effectively models long-range dependencies within the data. This synergistic effect is particularly evident in the WHU-Hi-HanChuan dataset, where the full model achieves an OA of 89.78%, outperforming both individual components.

Notably, the large-kernel 3D convolution block demonstrates fundamental importance for hyperspectral data processing. Across three of the four datasets (WHU-Hi-HanChuan, Indian Pines, and Salinas), models incorporating 3D convolution show better performance than those using only the Transformer block. For instance, on the Salinas dataset, the large-kernel 3D convolution block alone achieves an OA of 97.22%, significantly higher than the Transformer-only configuration (95.76%). This performance gap highlights the indispensable role of explicitly modeling spectral–spatial relationships, which provides a stronger foundational feature representation for the transformer to refine.

For challenging scenarios with class imbalance such as Indian Pines, the combined model shows notable improvements in AA (89.93%) and G-Mean (89.11%), indicating better handling of minority classes.

Additionally, the full model exhibits enhanced stability during training as evidenced by reduced standard deviations across multiple metrics, particularly on the Indian Pines dataset. The consistent improvements in

κ

and MCC values further confirm that our integrated approach produces more reliable classifications across diverse hyperspectral scenes.

5.2. Influence of Patch Size

We systematically evaluate the impact of patch size on classification performance, with the results summarized in Table 11. A clear trend emerges across all datasets: larger patch sizes generally improve classification accuracy up to a certain point, beyond which gains diminish or become negligible.

For all four datasets, a patch size of 11 provides the optimal balance between classification performance and computational efficiency. On the QUH-Tangdaowan dataset, increasing the patch size from 5 to 11 improves OA by nearly 10 percentage points (76.46% to 85.75%), with corresponding improvements across all other metrics. Similar trends are observed for WHU-Hi-HanChuan and Indian Pines, where patch size 11 delivers the best performance (89.78% and 82.06% OA, respectively).

Notably, larger patch sizes not only improve accuracy but also enhance result stability. For Indian Pines, using a patch size of 11 reduces the standard deviation of OA to just ±0.39%, compared to ±3.16% for size 9. This increased stability is consistent across datasets and suggests that larger receptive fields help the model learn more robust features.

Based on these experiments, we establish patch size 11 as the optimal configuration for our method, providing excellent performance across diverse hyperspectral scenes while maintaining reasonable computational requirements.

5.3. Influence of PCA Dimensionality Reduction

The results of our PCA dimensionality reduction analysis are presented in Table 12. These experiments reveal that appropriate dimensionality reduction significantly enhances model performance compared to using raw spectral bands, with optimal results achieved at 40 principal components across all datasets.

Using PCA preprocessing consistently improves classification accuracy over using the full spectral dimension. For the WHU-Hi-HanChuan dataset, employing 40 PCA components yields an OA of 89.78%, compared to 84.08% without PCA—an improvement of more than 5 percentage points. Similar substantial gains are observed for QUH-Tangdaowan (85.75% vs. 71.95%) and Salinas (97.36% vs. 93.06%). This demonstrates that judicious dimensionality reduction removes redundant spectral information while preserving discriminative features.

Beyond improving accuracy, PCA preprocessing also enhances training stability as evidenced by reduced standard deviations across multiple runs. For Indian Pines, using 40 components reduces the OA standard deviation to just ±0.39%, compared to ±2.06% without PCA. This stabilizing effect is consistent across datasets and suggests that PCA helps the model learn more robust representations by eliminating noise and redundancy in the spectral domain.

These results confirm that PCA is an essential preprocessing step for our method, providing both computational efficiency through dimensionality reduction and improved classification performance through noise removal and feature enhancement.

5.4. Overall Discussion

Our comprehensive ablation study demonstrates that each component of the proposed method contributes significantly to its overall performance. The large-kernel 3D convolution block provides essential local spectral–spatial feature extraction capabilities, while the Transformer block effectively captures long-range contextual relationships. The combination of these components yields synergistic effects that outperform either individual approach.

The selection of an appropriate patch size and PCA dimensions further optimizes performance. A patch size of 11 provides sufficient spatial context for accurate classification without excessive computational burden. Similarly, reducing spectral dimensionality to 40 principal components preserves discriminative information while improving training stability.

These findings highlight the importance of considering both architectural innovations and appropriate preprocessing techniques when designing hyperspectral image classification systems. Our method strikes an effective balance between model complexity and performance, delivering state-of-the-art results across diverse hyperspectral datasets.

6. Conclusions

This paper presents EDTST, a novel Vision Transformer architecture that effectively addresses the challenges of hyperspectral image classification through several key innovations. By integrating 3D convolutions for initial spectral–spatial feature extraction, our model demonstrates superior feature learning capabilities while maintaining computational efficiency. The proposed progressive token pruning strategy in our attention mechanism has proven particularly effective, reducing 25% of tokens while preserving classification performance. This adaptive approach not only decreases computational complexity but also helps the model focus on the most informative spectral–spatial features.

Despite these advantages, the proposed method has certain limitations. First, the fixed token pruning ratio of 25% may not be optimal for all datasets, as the redundancy of spectral–spatial information can vary significantly across different scenes and sensors. Second, the pruning strategy relies solely on the self-attention weights to determine token importance, which may not always accurately reflect the semantic or physical relevance of tokens. A more nuanced criterion that incorporates domain knowledge about spectral characteristics could further improve the pruning effectiveness.

For future work, we plan to extend the EDTST architecture in several directions. We will investigate adaptive pruning strategies that automatically determine the optimal pruning ratio based on the redundancy degree of the input data. Additionally, we aim to develop more advanced token selection mechanisms, such as incorporating learnable gates or leveraging physical spectral properties to identify and retain the most informative tokens. We also envision extending EDTST to support multi-scale feature extraction and exploring self-supervised pre-training strategies to better leverage unlabeled hyperspectral data. Finally, we will study the interpretability of the model and the relationship between pruned tokens and physical spectral characteristics to provide deeper insights for hyperspectral image analysis.

Author Contributions

Conceptualization, T.Z. and X.H.; methodology, T.Z. and X.H.; software, X.H. and Z.Z.; validation, Z.Z., J.Z., L.Z., Y.T. and Y.P.; formal analysis, T.Z. and X.H.; investigation, T.Z., X.H., Z.Z., J.Z., L.Z. and Y.T.; resources, T.Z., X.H. and Z.Z.; data curation, T.Z., X.H., Z.Z., J.Z., L.Z., Y.T. and Y.P.; writing—original draft preparation, T.Z. and X.H.; writing—review and editing, T.Z. and X.H.; visualization, X.H. and Z.Z.; supervision, T.Z.; project administration, T.Z.; funding acquisition, T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Independent Innovation Science Foundation (24-ZZCX-BC-01).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

To provide a comprehensive analysis of the classification performance and the consistency across different runs, this appendix presents the detailed confusion matrices for each dataset. For each of the four datasets—QUH-Tangdaowan, WHU-Hi-HanChuan, Indian Pines, and Salinas—we display the confusion matrices from all five independent training runs (subfigures a–e), followed by the average confusion matrix (subfigure f) calculated across these runs. The average matrix offers a consolidated view of the model’s overall performance and its stability, highlighting any consistent patterns of misclassification between classes. These visuals complement the aggregated performance metrics reported in the main paper by illustrating the per-class precision and recall in detail.

Figure A1. Confusion matrices of the QUH-Tangdaowan dataset. (a–e) The confusion matrices from all five independent training runs. (f) The average confusion matrix.

Figure A2. Confusion matrices of the WHU-Hi-HanChuan dataset. (a–e) The confusion matrices from all five independent training runs. (f) The average confusion matrix.

Figure A3. Confusion matrices of the Indian Pines dataset. (a–e) The confusion matrices from all five independent training runs. (f) The average confusion matrix.

Figure A4. Confusion matrices of the Salinas dataset. (a–e) The confusion matrices from all five independent training runs. (f) The average confusion matrix.

References

Teke, M.; Deveci, H.S.; Haliloğlu, O.; Gürbüz, S.Z.; Sakarya, U. A short survey of hyperspectral remote sensing applications in agriculture. In Proceedings of the 2013 6th International Conference on Recent Advances in Space Technologies (RAST), Istanbul, Turkey, 12–14 June 2013; pp. 171–176. [Google Scholar] [CrossRef]
Ham, J.; Chen, Y.; Crawford, M.M.; Ghosh, J. Investigation of the random forest framework for classification of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 2005, 43, 492–501. [Google Scholar] [CrossRef]
Delalieux, S.; Somers, B.; Haest, B.; Spanhove, T.; Borre, J.V.; Mücher, C. Heathland conservation status mapping through integration of hyperspectral mixture analysis and decision tree classifiers. Remote Sens. Environ. 2012, 126, 222–231. [Google Scholar] [CrossRef]
Gualtieri, J.; Chettri, S. Support vector machines for classification of hyperspectral data. In Proceedings of the IGARSS 2000, IEEE 2000 International Geoscience and Remote Sensing Symposium, Taking the Pulse of the Planet: The Role of Remote Sensing in Managing the Environment, Proceedings (Cat. No. 00CH37120), Honolulu, HI, USA, 24–28 July 2000; Volume 2, pp. 813–815. [Google Scholar]
Zhang, Y.; Duan, P.; Kang, X.; Mao, J. Edge Guided Structure Extraction for Hyperspectral Image Classification. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 5993–5996. [Google Scholar] [CrossRef]
He, M.; Li, B.; Chen, H. Multi-scale 3D deep convolutional neural network for hyperspectral image classification. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3904–3908. [Google Scholar]
Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2019, 17, 277–281. [Google Scholar] [CrossRef]
He, X.; Chen, Y.; Lin, Z. Spatial-Spectral Transformer for Hyperspectral Image Classification. Remote Sens. 2021, 13, 498. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4509416. [Google Scholar] [CrossRef]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–Spatial Feature Tokenization Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
Li, Z.; Xue, Z.; Xu, Q.; Zhang, L.; Zhu, T.; Zhang, M. SPFormer: Self-Pooling Transformer for Few-Shot Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5502019. [Google Scholar] [CrossRef]
Liang, L.; Zhang, Y.; Zhang, S.; Li, J.; Plaza, A.; Kang, X. Fast Hyperspectral Image Classification Combining Transformers and SimAM-Based CNNs. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5522219. [Google Scholar] [CrossRef]
Zhang, Y.; Liang, L.; Mao, J.; Wang, Y.; Jia, L. From Global to Local: A Dual-Branch Structural Feature Extraction Method for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 1778–1791. [Google Scholar] [CrossRef]
Zhang, Y.; Duan, P.; Liang, L.; Kang, X.; Li, J.; Plaza, A. PFS3F: Probabilistic Fusion of Superpixel-wise and Semantic-aware Structural Features for Hyperspectral Image Classification. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 8723–8737. [Google Scholar] [CrossRef]
Zhang, Y.; Liang, L.; Li, J.; Plaza, A.; Kang, X.; Mao, J.; Wang, Y. Structural and Textural-Aware Feature Extraction for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5502305. [Google Scholar] [CrossRef]
Zhang, Y.; Duan, P.; Mao, J.; Kang, X.; Fang, L.; Ghamisi, P. Contour Structural Profiles: An Edge-Aware Feature Extractor for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5545914. [Google Scholar] [CrossRef]
Xue, X.; Zhang, H.; Jing, H.; Tao, L.; Bai, Z.; Li, Y. Bridging Sensor Gaps via Attention-Gated Tuning for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 10075–10094. [Google Scholar] [CrossRef]
Huang, W.; Wu, T.; Zhang, X.; Li, L.; Lv, M.; Jia, Z.; Zhao, X.; Ma, H.; Vivone, G. MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 12803–12818. [Google Scholar] [CrossRef]
Zhu, F.; Shi, C.; Shi, K.; Wang, L. Joint Classification of Hyperspectral and LiDAR Data Using Hierarchical Multimodal Feature Aggregation-Based Multihead Axial Attention Transformer. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5503817. [Google Scholar] [CrossRef]
Wu, X.; Arshad, T.; Peng, B. Spectral Spatial Window Attention Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5519413. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, 4–9 September 2017; pp. 6000–6010. [Google Scholar]
He, J.; Zhao, L.; Yang, H.; Zhang, M.; Li, W. HSI-BERT: Hyperspectral Image Classification Using the Bidirectional Encoder Representation From Transformers. IEEE Trans. Geosci. Remote Sens. 2020, 58, 165–178. [Google Scholar] [CrossRef]
He, X.; Chen, Y.; Li, Q. Two-Branch Pure Transformer for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6015005. [Google Scholar] [CrossRef]
Yang, H.; Yu, H.; Zheng, K.; Hu, J.; Tao, T.; Zhang, Q. Hyperspectral Image Classification Based on Interactive Transformer and CNN With Multilevel Feature Fusion Network. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5507905. [Google Scholar] [CrossRef]
Ouyang, E.; Li, B.; Hu, W.; Zhang, G.; Zhao, L.; Wu, J. When Multigranularity Meets Spatial–Spectral Attention: A Hybrid Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4401118. [Google Scholar] [CrossRef]
Qi, W.; Huang, C.; Wang, Y.; Zhang, X.; Sun, W.; Zhang, L. Global–Local 3-D Convolutional Transformer Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5510820. [Google Scholar] [CrossRef]
Zhao, F.; Li, S.; Zhang, J.; Liu, H. Convolution Transformer Fusion Splicing Network for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5501005. [Google Scholar] [CrossRef]
Zhao, E.; Su, Y.; Qu, N.; Wang, Y.; Gao, C.; Zeng, J. Self- and Cross-Attention Enhanced Transformer for Visible and Thermal Infrared Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 13408–13422. [Google Scholar] [CrossRef]
Jia, C.; Zhang, X.; Meng, H.; Xia, S.; Jiao, L. CenterFormer: A Center Spatial–Spectral Attention Transformer Network for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 5523–5539. [Google Scholar] [CrossRef]
Yu, C.; Zhu, Y.; Wang, Y.; Zhao, E.; Zhang, Q.; Lu, X. Concern With Center-Pixel Labeling: Center-Specific Perception Transformer Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5514614. [Google Scholar] [CrossRef]
Hu, X.; Liu, T.; Guo, Z.; Tang, Y.; Peng, Y.; Zhou, T. BinaryViT: Binary Vision Transformer for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 20469–20486. [Google Scholar] [CrossRef]
Wang, Y.; Shu, Z.; Yu, Z. Efficient Attention Transformer Network With Self-Similarity Feature Enhancement for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 11469–11486. [Google Scholar] [CrossRef]
Fu, H.; Sun, G.; Zhang, L.; Zhang, A.; Ren, J.; Jia, X.; Li, F. Three-dimensional singular spectrum analysis for precise land cover classification from UAV-borne hyperspectral benchmark datasets. ISPRS J. Photogramm. Remote Sens. 2023, 203, 115–134. [Google Scholar] [CrossRef]
Zhong, Y.; Hu, X.; Luo, C.; Wang, X.; Zhao, J.; Zhang, L. WHU-Hi: UAV-borne hyperspectral with high spatial resolution (H2) benchmark datasets and classifier for precise crop identification based on deep convolutional neural network with CRF. Remote Sens. Environ. 2020, 250, 112012. [Google Scholar] [CrossRef]
Zhu, T.; Liu, Q.; Zhang, L. 3D atrous spatial pyramid pooling based multi-scale feature fusion network for hyperspectral image classification. In Proceedings of the International Conference on Remote Sensing, Mapping, and Geographic Systems (RSMG 2023), SPIE, Kaifeng, China, 7–9 July 2023; Volume 12815, pp. 225–231. [Google Scholar]
Zhu, Y.; Yuan, K.; Zhong, W.; Xu, L. Spatial–Spectral ConvNeXt for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 5453–5463. [Google Scholar] [CrossRef]
Zhou, Y.; Huang, X.; Yang, X.; Peng, J.; Ban, Y. DCTN: Dual-Branch Convolutional Transformer Network With Efficient Interactive Self-Attention for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5508616. [Google Scholar] [CrossRef]
Meng, Z.; Yan, Q.; Zhao, F.; Chen, G.; Hua, W.; Liang, M. Global-Local MultiGranularity Transformer for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 112–131. [Google Scholar] [CrossRef]
Shi, H.; Zhang, Y.; Cao, G.; Yang, D. MHCFormer: Multiscale Hierarchical Conv-Aided Fourierformer for Hyperspectral Image Classification. IEEE Trans. Instrum. Meas. 2024, 73, 5501115. [Google Scholar] [CrossRef]
Varahagiri, S.; Sinha, A.; Dubey, S.R.; Singh, S.K. 3D-Convolution Guided Spectral-Spatial Transformer for Hyperspectral Image Classification. In Proceedings of the 2024 IEEE Conference on Artificial Intelligence (CAI), Singapore, 25–27 June 2024; pp. 8–14. [Google Scholar]
Xu, Y.; Wang, D.; Zhang, L.; Zhang, L. Dual selective fusion transformer network for hyperspectral image classification. Neural Netw. 2025, 187, 107311. [Google Scholar] [CrossRef]

Figure 1. The proposed EDTST architecture for hyperspectral image analysis.

Figure 2. The large-kernel 3D convlution block in the proposed EDTST model.

Figure 3. Architecture of the 2D convolution block.

Figure 4. The Transformer block in the proposed EDTST model.

Figure 5. Classification maps of the QUH-Tangdaowan dataset. (a) Ground-truth map. (b) 3A-MFFN. (c) SS-ConvNeXt. (d) DCTN. (e) GLMGT. (f) MHCFormer. (g) 3D-ConvSST. (h) DSFormer. (i) EDTST (proposed).

Figure 6. Classification maps of the WHU-Hi-HanChuan dataset. (a) Ground-truth map. (b) 3A-MFFN. (c) SS-ConvNeXt. (d) DCTN. (e) GLMGT. (f) MHCFormer. (g) 3D-ConvSST. (h) DSFormer. (i) EDTST (Proposed).

Figure 7. Classification maps of the Indian Pines dataset. (a) Ground-truth map. (b) 3A-MFFN. (c) SS-ConvNeXt. (d) DCTN. (e) GLMGT. (f) MHCFormer. (g) 3D-ConvSST. (h) DSFormer. (i) EDTST (Proposed).

Figure 8. Classification maps of the Salinas dataset. (a) Ground-truth map. (b) 3A-MFFN. (c) SS-ConvNeXt. (d) DCTN. (e) GLMGT. (f) MHCFormer. (g) 3D-ConvSST. (h) DSFormer. (i) EDTST (proposed).

Table 1. Comparison of CNN and transformer architectures for HSI classification.

Aspect	CNN-Based Models	Transformer-Based Models
Local Feature Extraction	✓Strong; inductive bias favors local patterns	× Weak; requires large data to learn locality
Global Dependency Modeling	× Limited; requires deep stacks or large kernels	✓Strong; innate self-attention mechanism
Computational Efficiency	✓High; linear in image size	× Quadratic in token sequence length
Data Efficiency	✓Moderate; benefits from convolutional priors	× Low; requires pre-training or large datasets

Table 2. Parameter distribution in large-kernel 3D convolution block (

B = 64, D = 40, H = 11,

W = 11

), showing that our convolutional kernel design reduces computational cost and parameter count compared to the traditional

3 \times 3 \times 3

combination.

Table 2. Parameter distribution in large-kernel 3D convolution block (

B = 64, D = 40, H = 11,

W = 11

), showing that our convolutional kernel design reduces computational cost and parameter count compared to the traditional

3 \times 3 \times 3

combination.

Layer Type	Output Shape	Kernel	Params
Input	$[64, 1, 40, 11, 11]$	-	-
Conv3D	$[64, 1, 40, 11, 11]$	$3 \times 3 \times 3 \to 7 \times 7 \times 7$	$28 \to 344$
GroupNorm	$[64, 1, 40, 11, 11]$	-	2
Conv3D	$[64, 4, 40, 11, 11]$	$3 \times 3 \times 3 \to 1 \times 1 \times 1$	$112 \to 8$
GELU	$[64, 4, 40, 11, 11]$	-	0
Conv3D	$[64, 8, 40, 11, 11]$	$3 \times 3 \times 3 \to 1 \times 1 \times 1$	$872 \to 40$
BatchNorm3d	$[64, 8, 40, 11, 11]$	-	16
GELU	$[64, 8, 40, 11, 11]$	-	0
Total Parameters			1030→ 410 100%→ 39.8%
FLOPs			4,990,040 →1,989,240 100% → 40%

Table 3. Computational characteristics of 2D convolution block.

Operation	FLOPs	Parameters
${Conv 2 D}_{3 \times 3}$	$9 H W C_{in} E$	$9 C_{in} E + E$
BatchNorm2d	$2 E H W$	$2 E$
GELU	$\sim 3 E H W$	0
Total	$𝒪 (H W C_{in} E)$	$𝒪 (C_{in} E)$

Table 4. Computational complexity comparison (

N = 121, E = 64

).

Table 4. Computational complexity comparison (

N = 121, E = 64

).

Component	FLOPs
Standard Self-Attention	$2 N^{2} E + 4 E^{2} N$
Proposed Attention	$1.75 N^{2} E + 4 E^{2} N$
Total Saving	6%

Table 5. Training and testing set distribution for QUH-Tangdaowan, WHU-Hi-HanChuan, Indian Pines, and Salinas datasets.

	QUH-Tangdaowan			WHU-Hi-HanChuan
NO.	Class	Train	Test	Class	Train	Test
C1	Rubber track	25	25,824	Strawberry	25	44,710
C2	Flaggingv	25	55,528	Cowpea	25	22,728
C3	Sandy	25	34,012	Soybean	25	10,262
C4	Asphalt	25	60,665	Sorghum	25	5328
C5	Boardwalk	25	1837	Water spinach	25	1175
C6	Rocky shallows	25	37,100	Watermelon	25	4508
C7	Grassland	25	14,102	Greens	25	5878
C8	Bulrush	25	64,062	Trees	25	17,953
C9	Gravel road	25	30,670	Grass	25	9444
C10	Ligustrum vicaryi	25	1758	Red roof	25	10,491
C11	Coniferous pine	25	21,211	Gray roof	25	16,886
C12	Spiraea	25	724	Plastic	25	3654
C13	Bare soil	25	1661	Bare soil	25	9091
C14	Buxus sinica	25	861	Road	25	18,535
C15	Photinia serrulata	25	13,995	Bright object	25	1111
C16	Populus	25	140,879	Water	25	75,376
C17	Ulmus pumila L	25	9777
C18	Seawater	25	42,250
	Total	450	556,916	Total	400	257,530
	Indian Pines			Salinas
NO.	Class	Train	Test	Class	Train	Test
C1	Alfalfa	10	36	Brocoli_green_weeds_1	25	1984
C2	Corn-notill	10	1418	Brocoli_green_weeds_2	25	3701
C3	Corn-mintill	10	820	Fallow	25	1951
C4	Corn	10	227	Fallow_rough_plow	25	1369
C5	Grass-pasture	10	473	Fallow_smooth	25	2653
C6	Grass-trees	10	720	Stubble	25	3934
C7	Grass-pasture-mowed	10	18	Celery	25	3554
C8	Hay-windrowed	10	468	Grapes_untrained	25	11,246
C9	Oats	10	10	Soil_vinyard_develop	25	6178
C10	Soybean-notill	10	962	Corn_senesced_green_weeds	25	3253
C11	Soybean-mintill	10	2445	Lettuce_romaine_4wk	25	1043
C12	Soybean-clean	10	583	Lettuce_romaine_5wk	25	1902
C13	Wheat	10	195	Lettuce_romaine_6wk	25	891
C14	Woods	10	1255	Lettuce_romaine_7wk	25	1045
C15	Buildings-Grass-Trees-Drives	10	376	Vinyard_untrained	25	7243
C16	Stone-Steel-Towers	10	83	Vinyard_vertical_trellis	25	1782
	Total	160	10,089	Total	400	54,429

Table 6. The classification results of the QUH-Tangdaowan dataset. The number in parentheses after each class label is the number of test samples. The reported metrics are overall accuracy (OA, %), average accuracy (AA, %), Cohen’s Kappa coefficient (

κ \times 100