DVR: Towards Accurate Hyperspectral Image Classifier via Discrete Vector Representation

Jiangyun Li; Hao Wang; Xiaochen Zhang; Jing Wang; Tianxiang Zhang; Peixian Zhuang

doi:10.3390/rs17030351

Abstract

In recent years, convolutional neural network (CNN)-based and transformer-based approaches have made strides in improving the performance of hyperspectral image (HSI) classification tasks. However, misclassifications are unavoidable in the aforementioned methods, with a considerable number of these issues stemming from the overlapping embedding spaces among different classes. This overlap results in samples being allocated to adjacent categories, thus leading to inaccurate classifications. To mitigate these misclassification issues, we propose a novel discrete vector representation (DVR) strategy for enhancing the performance of HSI classifiers. DVR establishes a discrete vector quantification mechanism to capture and store distinct category representations in the codebook between the encoder and classification head. Specifically, DVR comprises three components: the Adaptive Module (AM), Discrete Vector Constraints Module (DVCM), and auxiliary classifier (AC). The AM aligns features derived from the backbone to the embedding space of the codebook. The DVCM employs category representations from the codebook to constrain encoded features for a rational feature distribution of distinct categories. To further enhance accuracy, the AC correlates discrete vectors with category information obtained from labels by penalizing these vectors and propagating gradients to the encoder. It is worth noting that DVR can be seamlessly integrated into HSI classifiers with diverse architectures to enhance their performance. Numerous experiments on four HSI benchmarks demonstrate that our DVR scheme improves the classifiers’ performance in terms of both quantitative metrics and visual quality of classification maps. We believe DVR can be applied to more models in the future to enhance their performance and provide inspiration for tasks such as sea ice detection and algal bloom prediction in the marine domain.

Keywords:

hyperspectral image; discrete vector representation; feature optimization; class distribution; accurate classification

1. Introduction

Hyperspectral image (HSI) is an advanced remote sensing technology that captures the electromagnetic radiation over a broad spectrum of wavelengths emitted from the Earth’s surface. This technology provides comprehensive surface information to facilitate various applications, such as precision agriculture, geological exploration, and marine environmental monitoring [1,2,3]. With the advancement of remote sensing technology, HSI classification has increasingly become a crucial research topic [4]. Nevertheless, accurate HSI classification [5,6] remains a challenging task due to its high dimensionality and intricate spectral–spatial relationships.

Traditional HSI classification methods [7,8,9] usually rely on manual feature extraction techniques or shallow classifiers, but they struggle to capture intricate spectral–spatial patterns present in the data. In order to address this limitation, deep learning-based techniques have gained popularity in the field of HSI classification [10,11,12]. Among these techniques, convolutional neural networks [13,14,15,16] (CNNs) have emerged as powerful tools for obtaining hierarchical representations of HSIs, leading to improved classification outcomes. Nevertheless, CNNs have inherent constraints [17] in both modeling long-range dependencies and capturing complex spectral–spatial relationships within hyperspectral data. These limitations are well addressed by the vision transformers (ViTs) [18,19,20,21,22] that leverage self-attention mechanisms to handle global dependencies and interactions. However, these above methods focus on enhancing classification performance through the redesign of model networks but overlook the fundamental cause of misclassification.

We aim to enhance HSI classification performance from a new perspective by analyzing the root causes of misclassification issues, identifying strategies to mitigate these errors, and reducing misclassifications. Specifically, deep learning classifiers typically comprise an encoder and a classification head. The encoder is responsible for capturing category representations, and these representations are subsequently used by the classification head for the accurate classification [23]. This demonstrates that encoded features play a decisive role in the accuracy of classification. Therefore, to analyze the causes of misclassification issues, we visualize the encoded features from a representative hyperspectral classification model SpectralFormer [18], using t-Distributed Stochastic Neighbor Embedding [24] (t-SNE, an effective approach for visualizing the distribution of high-dimensional data via dimension reduction). The t-SNE plot, as depicted in Figure 1, is based on the Pavia University (PU) dataset, with the model trained on 1% of the dataset and the t-SNE visualization generated using 98% of the dataset for testing. Due to overlapping distributions of distinct embedding features, SpectralFormer incorrectly categorizes instances labeled as blue (Figure 1b) to the yellow category (Figure 1a), resulting in heightened misclassification between categories. Likewise, SpectralFormer erroneously classifies the areas with true labels of blue and pink (Figure 1b), inaccurately grouping them into the red category (Figure 1a), as shown by the magnified portions. This occurrence significantly contributes to classification inaccuracies. Existing approaches do not take into account the encoded features, and solely supervise model training based on the classification outcomes. The absence of constraints on the encoded features frequently leads to inadequate clustering and overlaps between embedding spaces of different categories, as illustrated in Figure 2a. The presence of overlapping and disorganized embedding spaces poses a challenge for classifiers in discerning features belonging to different categories.

Figure 1. Misclassification due to encoded features overlap. (a) Category results of prediction. (b) Category results of label. The t-SNE visualization of encoded features from SpectralFormer [18] on the Pavia University dataset clearly demonstrates that SpectralFormer incorrectly categorizes instances (labeled as blue) as belonging to the yellow category, while also misclassifying instances (labeled as blue and pink) as belonging to the red category. The overlap of encoded features significantly contributes to classification inaccuracies.

Figure 2. Comparison between previous architecture and our DVR strategy. (a) Previous architecture. Previous architecture typically comprises an encoder and a classification head, however, it faces difficulties due to the disorderly distribution of encoded features, resulting in a decline in classification accuracy. (b) DVR. Our DVR approach integrates the discrete vector representation into the embedding space of encoded feature, aiming to optimize the distribution of encoded features by making features of the same category more compactly clustered and reduce the likelihood of misclassification by the classifier.

To address the above limitations, we investigate the discrete vector representation (DVR) strategy in order to optimize the distribution of class representation, thus boosting the performance of current HSI classification models. Discrete vectors have the ability to effectively represent essential features in low-dimensional space while preserving global structures of objects [25] and remaining stable when subjected to minor perturbations [26]. In contrast to designing a new network, as illustrated in Figure 2b, our methodology enforces discrete vector constraints on category representation from the encoder, with the goal of achieving a more rational embedding space. This strategy can be plug-and-play and easily integrated into existing HSI image classifiers to improve their classification accuracy. Specifically, DVR comprises three components: the Discrete Vector Constraints Module (DVCM), Adaptive Module (AM), and auxiliary classifier (AC). Initially, we establish a codebook in the DVCM to discretely represent the embedding space and store representative class features in a discrete vector manner. During the training phase, the AM aligns the features extracted by the encoder with the semantic space defined by the codebook. Simultaneously, the DVCM utilizes the vectors from the codebook to regulate the features extracted from the encoder, ensuring that features from the same class are clustered more closely together to prevent a confusing feature distribution. Subsequently, discrete vectors are chosen from the codebook by considering their resemblance to encoded features, and incorporated into the AC to enhance the accuracy of predictions. By implementing the aforementioned procedure, our DVR method efficiently optimizes the distribution of category representations, resulting in a notable improvement in the overall classification performance. The contributions of our work can be summarized as follows:

We propose a novel discrete vector representation (DVR) strategy. Distinguished from previous approaches of optimizing network structures, DVR offers a fresh perspective on optimizing the distribution of category features to mitigate the misclassification problem. Moreover, it can be effortlessly incorporated into various existing HSI classification methods, thus improving their classification accuracy.
We develop the AM, DVCM, and AC to form a complete DVR strategy. The AM aligns the encoded features with the semantic space of the codebook. The DVCM is able to capture essential and stable feature representations in its codebook. The AC enhances classification performance by utilizing representative code information from the codebook. These three components are integrated to improve the discriminability of feature categories and reduce misclassifications.
Our comprehensive evaluations demonstrate that the proposed DVR approach with feature distribution optimization can enhance the performance of HSI classifiers. Through extensive experiments and visual analyses conducted on different HSI benchmarks, our DVR approach consistently surpasses other state-of-the-art backbone networks in terms of both classification accuracy and model stability, while requiring merely a minimal increase in parameters.

The remainder of this paper is organized as follows. In Section 2, we review related work on HSI classification methods and schemes for enhancing model performance. Section 3 presents the proposed methodology, which includes the details of our DVR framework and its training process. In Section 4, we describe experimental results of our approach in comparison to baseline methods. Section 5 discusses the limitations of DVR as well as potential directions for future improvements and applications. Finally, Section 6 concludes the article.

2. Related Work

In this section, we overview existing HSI classification approaches, encompassing convolutional neural networks, vision transformers, and schemes for enhancing model performance.

2.1. Convolutional Neural Networks for HSI Classification

With the advancement of deep learning, convolutional neural networks (CNNs) have emerged as powerful tools for HSI classification [13,27,28,29,30,31,32,33,34]. These CNN-based methods have demonstrated impressive achievements by leveraging convolutional layers to extract discriminative features from HSI data. Initially, two-dimensional (2-D) CNNs [27,28] employed convolutional and pooling layers to capture spatial dependencies within HSIs. A pioneering 2-D CNN architecture [27] was proposed for automated high-level feature extraction in HSI classification. Mei et al. [28] concentrated on memory-efficient 2-D CNNs to accelerate the forward step of the network. Then, Song et al. [29] introduced a fusion-based model to aggregate multi-layer features and leverage complementary HSI information. Moreover, Zhao et al. [30] introduced a dual-tunnel CNN to enforce the spatial consistency within deeper network layers. To account for the three-dimensional (3-D) nature of HSIs, many researchers explored 3-D CNNs [13,31,32,33,34] to incorporate spectral and spatial signatures simultaneously. Chen et al. [31] and He et al. [32] proposed an end-to-end multiscale 3-D deep CNN architecture to capture both multiscale spatial and spectral characteristics. To emphasize the importance of spectral–spatial integration, Zhong et al. [33] introduced a spectral–spatial residual network, while Hamida et al. [13] devised a joint spectral–spatial information processing approach. In addition, Mei et al. [34] proposed an unsupervised spatial–spectral feature learning strategy, enabling 3-D convolutional autoencoder networks to extract meaningful features without pixel-wise annotations. Although CNN-based methods show proficiency in extracting distinctive features using 2-D or 3-D structures to enhance feature representation, they generally demand substantial computational resources and fail to capture long-range dependencies of HSI data.

2.2. Vision Transformers for HSI Classification

The limitations have prompted researchers to explore alternative architectures. Recently, ViTs [19] have gained significant attentions for modeling global dependencies in long-range positions and bands of HSI pixels. These transformers [18,20,21,35], equipped with multi-head self-attention mechanisms, show great promise for HSI classification tasks. Hong et al. [18] proposed a backbone network based on the transformer architecture and utilized attention mechanisms to capture subtle spectral differences. Xue et al. [35] introduced a local transformer to incorporate a partial partition restore module for global context dependencies. Sun et al. [20] designed a Spectral–Spatial Feature Tokenization Transformer (SSFTT) to capture spectral–spatial features and high-level semantic features. Additionally, Mei et al. [21] proposed a Group-Aware Hierarchical Transformer (GAHT) for HSI classification, and enhanced the model’s ability to capture local relationships within HSI spectral channels while maintaining a global understanding of spatial–spectral context. Despite the proficiency of ViTs in modeling global dependencies, they usually lack distribution control over the embedding space, thus leading to the cross-aggregation of features.

2.3. Schemes for Enhancing Model Performance

To enhance the model performance, common techniques such as data augmentation and regularization are employed, and vector quantized-variational autoencoder (VQ-VAE) [25] methods leverage discrete representations to enhance the performance. Data augmentation involves transforming or expanding training data to increase the number and diversity of data samples, reducing the model’s reliance on specific data and enhancing its generalization capability. Various techniques, including random rotation, translation, scaling, and noise addition, are employed for data augmentation. Several carefully designed data augmentations were designed by [36,37]. Moreover, regularization techniques constrain the model’s complexity to prevent overfitting to the training data [38,39]. Common regularization methods include adding L1 or L2 norm penalties on loss functions to limit the magnitude of weights. Furthermore, VQ-VAE introduces a codebook learned through a vector-quantized autoencoder model, using an encoder–decoder architecture to transform images into discrete latent codes for enhancing model robustness. Mao et al. [26] demonstrated the use of discrete representations to strengthen model robustness by preserving the overall structure of an image while disregarding minor local details. Hu et al. [40] designed a discrete codebook for encoded feature representation, which helps combats semantic noise with reduced transmission overhead. These researchers demonstrated that discrete representation is an efficient scheme to achieve satisfactory performance. Data augmentation and regularization techniques have been commonly used in existing hyperspectral classification models to enhance model performance from the perspective of data or the model network. However, the current research in HSI classification mainly focuses on the design of new models [41], neglecting the improvement of model performance. Our proposed strategy aims to incorporate discrete representation schemes from the perspective of optimizing the feature space to boost model performance, thereby addressing the existing research gap in techniques.

3. Methods

3.1. Overall Architecture

The architecture overview of the proposed approach is shown in Figure 3. Based on the encoder and classifier, our DVR incorporates the Adaptive Module (AM), DVCM, and auxiliary classifier (AC) into the existing classification model by optimizing the feature distribution to improve its classification performance. Specifically, given the original HSI data

{x} \in R^{H \times W \times C}

(C denotes the spectral bands and

H \times W

is the spatial resolution), we divide it into N patches

{x_{i}^{p}}_{i = 1}^{N}

in the preprocessing stage. Firstly, we established a codebook to discretize the embedding space and facilitate the extraction and storage of category representation vectors using our designed DVCM. Subsequently, the encoder processed the inputted patch data to obtain class representation, and the AM fine-tuned the encoded feature to align with the embedding space of the codebook. The top k nearest codes to the encoded feature were chosen and averaged to generate the auxiliary class descriptor. This descriptor was then fed into an AC to assist in the prediction. By leveraging the DVCM and AC during the gradual training process, encoded features of the same class are clustered closely together, while maintaining a clear separation between features of different classes. This distinctive attribute allowed the existing HSI classification model to capture more robust and representative class features, leading to significant performance improvements. The DVR and its training process are elaborated below. Table 1 details the definition of notations used in the proposed DVR.

Figure 3. The proposed DVR framework for HSI classification. Firstly, the encoder in the model extracts spatial–spectral features from each patch, and these features are then adjusted by the Adaptive Module to align with the embedding space defined by the codebook. This codebook comprises multiple discrete vectors that represent different classes, which are refined through the DVCM (between AM and AC) during training iterations. Subsequently, the framework calculates the auxiliary (Aux) class descriptor by averaging the top k nearest vectors from the codebook. The descriptor is employed by the Aux classifier to predict the class of each input patch. Ultimately, this prediction combines with the output of primary classifier to generate a classified image as final output.

Table 1. Definition of notations used in DVR.

3.2. Discrete Vector Representation Strategy

Following the structure illustrated in Figure 3, we employed the encoder

f (\cdot)

to transform a patch

x_{i}^{p} \in R^{P \times P \times C}

into a feature vector

e \in R^{D}

, where P and D denote the patch size and the dimension of the encoded feature.

Adaptive Module (AM): The AM is composed of a layer normalization step, a Gaussian error Linear Unit (GeLU) activation function and a linear layer. Layer normalization is a widely used normalization technique in deep learning models that standardizes and rescales the outputs of each neuron. This helps in reducing internal covariate shift, improving training stability and convergence speed, and enhancing the generalization capabilities of the model. The GeLU activation function imitates the behavior of stochastic neurons by multiplying the input x with the value from the cumulative distribution function of the standard normal distribution. This simulation enables the network to adjust to various input distributions, thereby improving its robustness. Additionally, the GeLU activation function provides excellent adaptability and flexibility to accommodate the diversity of codes within the codebook, which is defined as

GELU (x) = x Φ (x) = x \cdot \frac{1}{2} [1 + \erf (x / \sqrt{2})]

(1)

where

Φ (x)

represents the standard Gaussian cumulative distribution function, and

\erf (x) = \int_{0}^{x} e^{- t^{2}} d t

.

The AM is capable of aligning the extracted features from the encoder with the semantic framework defined by the codebook, as well as adjusting the feature dimensions to match the codebook dimension. The feature vector e is processed through an Adaptive Module to produce h.

Discrete Vector Constraint Module (DVCM): The DVCM introduces a codebook that leverages discrete vector quantification for capturing category representation. After aligning the embedding features with a created codebook, the codebook of the DVCM is able to represent the embedding space in a discrete format and retain representative features of classes as discrete vectors. Specifically, after h has been

ℓ_{2}

-normalized, the vector quantizer looks up the top k nearest neighbor codes in the codebook. These selected codes are then averaged to determine the quantized code for the patch feature. Let

{v_{1}, v_{2}, \dots, v_{K}}

(

v_{j} \in R^{D_{c}}

) represent the codes in the codebook, where K and

D_{c}

denote the number and dimension of discrete vector, respectively. For each patch feature h, its quantized code

\bar{v}

is determined by

{z_{1}, \dots, z_{k}} = Topk \min | | ℓ_{2} (h) - ℓ_{2} (v_{j}) {| |}_{2},

(2)

\bar{v} = \frac{1}{k} \sum_{j \in {z_{1}, \dots, z_{k}}} ℓ_{2} (v_{j}),

(3)

where the

ℓ_{2}

normalization is employed for the codebook lookup and

{z_{1}, \dots, z_{k}} \in {1, 2, \dots, K}

presents the index of top k nearest vectors in the codebook. Furthermore, Topkmin refers to the selection of the k smallest-distance items from a set based on specified criteria. Due to the non-differentiable nature of the quantization process in Equation (2), the gradient is directly copied from the input of the auxiliary classifier to the encoder output, as depicted in Figure 3. Intuitively, the quantizer identifies the nearest code for each encoder output, and the gradient of the codebook embedding indicates the useful direction for optimizing the encoder. To ensure the codebook captures representative features, the codebook embeddings are updated using an exponential moving average (EMA) [25], which offers enhanced stability for training discrete vectors. The typical formula for updating with momentum is expressed by

v_{j} = γ v_{j} + (1 - γ) h,

(4)

where

γ

represents the decay factor, with a value typically close to 1, that determines the weight of the previous code value in the updated calculation. The training objective for updating the codebook vectors is formulated as

\min {∥sg [ℓ_{2} (h)] - ℓ_{2} (\bar{v})∥}_{2}^{2},

(5)

where the symbol

sg [\cdot]

denotes the stop-gradient operator, which is an identity in the forward pass and yielding zero gradients in the backward pass.

Additionally, the utilization of clustering techniques (K-means) divides the feature vector space into several regions, where each region’s centroid is represented by a vector in the codebook. These vectors effectively encapsulate the entire feature space and extract crucial information. The classification process consists of mapping feature vectors to the nearest codebook vector, thereby converting continuous data into discrete codified forms. These discrete representations not only optimize the feature distribution, but also boost the efficiency of the representation, ultimately enhancing the performance of classification algorithms.

Auxiliary Classifier (AC): After generating the codebook, we utilize the discrete vectors to improve the classification procedure. This enhancement entails combining the classification outcomes of codebook features with the features extracted by the encoder models. This collaborative strategy enhances the discriminative capability of the feature set, thus improving classification performance in HSI tasks.

To clearly describe the role of a codebook in assisting with classification, we visualized the meanings of the encodings present in the codebook, as illustrated in Figure 4. The different codes in the codebook represent feature information for different categories, for instance, in the Salinas (SA) dataset, code 21 represents the category “Vinyard_untrained”. By utilizing representative code information from the codebook, samples can be classified more accurately. During the validation phase, we enhanced the primary classification procedure by incorporating outcomes obtained from the codebook features. As illustrated in Figure 5, we selected the top five closest codes within the embedding space and averaged them to generate our auxiliary class descriptor. By merging predictions from both the primary and codebook-based classifications, we harnessed the complementary information within the codebook features to enhance classification performance. This dual classification strategy improved the model’s capacity to accurately classify diverse and intricate data instances. The output o of the classification results following the combination of scores is as follows:

o = λ p + β a,

(6)

where p denotes the output of the primary classifier (PC) using the encoded features, and a represents the output from the AC using the codebook features. We carefully tuned parameters

λ

and

β

to balance the contributions of primary and auxiliary classifiers, where

λ = 0.75

and

β = 0.25

.

Figure 4. Codebook visualization. The different codes in the codebook represent the feature information for various categories. For instance, in the SA dataset, code 21 corresponds to the category “Vinyard untrained”.

Figure 5. Dual classification strategy. We select the top five closest codes within the embedding space and average them to generate our auxiliary class descriptor. By merging predictions from both the primary and codebook-based classifications, we leverage the complementary information within the codebook features to improve classification performance.

Building upon the integration of codebook features with encoder models, our approach introduces a novel loss function tailored to optimize the collaborative utilization of both feature sets. This loss function crucially underpins the dual-classification strategy, and guarantees that each element of the feature representation contributes optimally to the ultimate classification accuracy. Our loss function consists of three components:

L = L_{C E} (t, p) + {∥ℓ_{2} (h) - sg [ℓ_{2} (\bar{v})]∥}_{2}^{2} + L_{C E} (t, a),

(7)

where t represent the ground truth. In addition, we adopted the Cross-Entropy (CE) loss function [42] to calculate the classification loss:

L_{C E} (y_{c}, {\hat{y}}_{c}) = - \sum_{c = 1}^{C} y_{c} \log ({\hat{y}}_{c}),

(8)

where Equation (8)

y_{c}

denotes the ground truth, and

{\hat{y}}_{c}

denotes the model’s output probability for the i th patch belonging to the class c. Furthermore, C represents the total number of classes.

We summarize the pseudocode for the DVR inference process in Algorithm 1.

Algorithm 1: Inference of DVR

3.3. Train Strategy

We adopted a two-stage training strategy. Initially, we trained the model without the inclusion of codebook features. This initial phase allowed the model to learn basic patterns within the data. After a certain number of epochs, we incorporated the codebook features. The codebook was initialized with a number of samples equivalent to its capacity. These samples were batch-processed through the encoder to extract their features, which were then collectively used to initialize the codebook in the quantizer. This initialization step was crucial as it enhanced the utilization of the codebook, and improved the efficiency of the training process. Furthermore, introducing these features at a later stage enabled us to leverage their discriminative capabilities. This two-stage training strategy ensured that the model first learns simple features and then progressively refines its understanding by incorporating more representative discrete vector features.

4. Experiment Results

In this section, we evaluate the effectiveness of our DVR by employing four standard HSI datasets including Salinas (SA), Pavia University (PU), HyRANK-Loukia (HR-L), and WHU-Hi-HanChuan (HC) [43], which are extensively utilized for classification tasks. Then, we present the implementation details and evaluation metrics. Next, we conduct both qualitative and quantitative analyses compared to the state-of-the-art (SOTA) results. Last, we perform ablation experiments to gauge the impact of different modules and hyper-parameters on classification accuracy.

4.1. Data Description

We allocated varying proportions of labeled samples across different datasets. Specifically, for the SA and PU datasets, we randomly selected 1% of the labeled samples for training, 1% for validation, and 98% for testing. For the HR-L dataset, we designated 3% of the labeled samples for training, 3% for validation, and 94% for testing. As for the HC dataset, we used 0.2% of the samples for training, 0.2% for validation, and 99.6% for testing. The fixed number of training and testing samples can be found in Table 2.

Table 2. The numbers of training, validation, and testing samples in the SA dataset, the PU dataset, the HR-L dataset, and the HC Dataset.

4.1.1. Salinas

The SA dataset was captured through the use of the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Salinas Valley, California, USA. It is composed of 204 spectral bands after discarding the 20 water absorption bands, covering range from 400 nm to 2500 nm. The image size is 512 × 217 pixels with a ground sampling distance of 3.7 m. It includes 16 different land cover classes. (https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Salinas (accessed on 18 January 2025)).

4.1.2. Pavia University

The PU dataset was acquired utilizing the Reflective Optics System Imaging Spectrometer (ROSIS) sensor over the area of Pavia University and its surroundings in Italy. It comprises 103 spectral bands, spanning range from 430 nm to 860 nm. The image size is 610 × 340 pixels with a ground sampling distance of 1.3 m, encompassing nine different land cover categories (https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Pavia_University_scene (accessed on 18 January 2025)).

4.1.3. HyRANK-Loukia

The HR-L dataset was sourced by the Hyperion sensor on the Earth Observing-1 satellite. It encompasses a total of 176 spectral bands, spanning range from 400 nm to 2500 nm. The image size is 249 × 945 pixels with a ground sampling distance of 30m within this dataset, and there are 14 distinct land cover classes (https://zenodo.org/records/1222202 (accessed on 18 January 2025)).

4.1.4. WHU-Hi-HanChuan

The HC dataset was collected using the Headwall Nano-Hyperspec sensor mounted on a UAV. It contains 274 spectral bands ranging from 400 nm to 1000 nm, with a spatial resolution of 0.109 m. The imagery size is 1217 × 303 pixels, and the dataset includes seven crop species along with other land cover types such as buildings and water bodies (http://rsidea.whu.edu.cn/resource_WHUHi_sharing.htm (accessed on 18 January 2025)).

4.2. Experiment Setups

4.2.1. Implementation Details

In our experimental setup, we maintained the settings of the encoder models unchanged, while integrating our codebook to assist in the classification process. We implemented our approach using the PyTorch framework and trained it on an NVIDIA GeForce GTX 2080 Ti GPU with 11 GB of memory. The batch size and epoch count were configured at 64 and 300, respectively. To attain the best performance, as reported in [13,18,20,21], both the optimizer and scheduler were maintained at their default configurations. Additionally, data augmentation was employed to mitigate the issue of insufficient training samples in all approaches. For each batch of data in an iteration, one of the five augmentation techniques (vertical flip, horizontal flip, 90° rotation, 180° rotation, and 270° rotation) is randomly chosen with an equal probability.

4.2.2. Evaluation Metrics

To quantitatively evaluate the performance of HSI classification, we employ four metrics: overall accuracy (OA), average accuracy (AA), kappa coefficient (

κ

), and per-class accuracies. OA indicates the percentage of correctly predicted samples out of the total samples. AA represents the mean classification accuracy of each class. The

κ

coefficient measures the agreement between the ground truth and classification maps. To minimize experimental variability, we randomly split the labeled samples five times, and reported the mean values and standard deviations for these metrics. A lower standard deviation indicates a higher reliability and consistency.

4.2.3. Baseline Models

To demonstrate the effectiveness of the suggested DVR, a number of representative methods are selected for comparative experiments: the 3D-CNN [13], SpectralFormer (SF) [18], SSFTT [20], and GAHT [21]. The 3D-CNN employs a exclusively convolutional architecture, while the SpectralFormer is based on transformer architectures. The SSFTT and GAHT combine convolution and transformer elements in a hybrid architecture. The universality of DVR is better demonstrated through comparative experiments across various architectural types.

4.3. Comparative Experiments

4.3.1. Quantitative Assessment

Table 3, Table 4, Table 5 and Table 6 present the results for the OA, AA, kappa, and each class accuracy using various methods on the Salinas, Pavia University, and HyRANK-Loukia datasets, respectively. The optimal results are highlighted in bold. As the results illustrated, our method outperforms other SOTA methods across all four benchmark datasets. On the SA datasets, our DVR based on respective models achieved significantly higher OA compared with the 3D-CNN and SF, with a difference of 2.39% and 1.22%, respectively. It is obvious that our method also demonstrates lower standard deviations in both OA and specific accuracy for each class. On the PU datasets, our method incorporating DVR with the 3D-CNN achieved the highest improvement of 7.58% in OA. Meanwhile, the kappa value improved from 83.70% ± 1.77% to 94.06% ± 0.46%, indicating a substantial enhancement in model reliability and classification consistency. Similarly, our method based on the SSFTT and GAHT, respectively, showed performance enhancements. On the challenging HR-L dataset, DVR performed much better than the other methods. It is noted that the modified baseline model exhibited greater potential in classifying challenging datasets. This trend highlights the effectiveness of DVR in enhancing the reliability of the classification outcomes under constrained training scenarios.

Table 3. Classification performance of various methods on the SA dataset using only 1% training samples.

Table 4. Classification performance of various methods on the PU dataset using only 1% training samples.

Table 5. Classification performance of various methods on the HR-L dataset using only 3% training samples.

Table 6. Classification performance of various methods on the HC dataset using only 0.2% training samples.

4.3.2. Visual Evaluation

Figure 6, Figure 7, Figure 8 and Figure 9 display the classification maps obtained from various comparison methods on the Salinas, Pavia University, HyRANK-Loukia, and WHU-Hi-HanChuan datasets. We chose the results with the highest OA values from five trials to visualize the predicted samples using different methods for model comparison. Based on the visual comparisons, it is evident that the DVR strategy produces more accurate and less-noise classification maps, which more closely resembles the ground truth.

Figure 6. Classification maps by different methods (3D-CNN [13], SF [18], SSFTT [20], GAHT [21]) on the SA dataset with 1% training samples.

Figure 7. Classification maps by different methods (3D-CNN [13], SF [18], SSFTT [20], GAHT [21]) on the PU dataset with 1% training samples.

Figure 8. Classification maps by different methods (3D-CNN [13], SF [18], SSFTT [20], GAHT [21]) on the HR-L dataset with 3% training samples.

Figure 9. Classification maps by different methods (3D-CNN [13], SF [18], SSFTT [20], GAHT [21]) on the HC dataset with 0.2% training samples.

Furthermore, Figure 10 displays the t-SNE visualization [24] of hidden features from different methods on four distinct datasets. Compared to other methods, DVR demonstrates a more cohesive distribution, with less cross-aggregation and fewer instances of misclassifications among categories.

Figure 10. The t-SNE visualization results of encoded features by different methods (3D-CNN [13], SF [18], SSFTT [20], GAHT [21]) on the four datasets. Compared to other methods, DVR demonstrates a more cohesive distribution, with less cross-aggregation and fewer instances of misclassifications among categories.

4.4. Ablation Study

In this subsection, we employ a representative HSI classification approach (SpectralFormer [18]) to conduct extensive ablation experiments, focusing on key components and parameters of DVR that impact classification performance. We investigate the significance of the DVCM and the AC; then, we examine the effects of codebook size, codebook dimension, and the top-k nearest vectors from the codebook. To assess the impact of individual hyper-parameters on classification performance, we employ a systematic grid search [44] approach and alter one hyper-parameter at a time while fixing the values of others. Table 7 provides a summary of the hyper-parameter configurations that yields the highest classification accuracy across the four datasets.

Table 7. Hyper-parameter settings of four datasets.

4.4.1. Impact of DVCM

Table 8 displays the results from the ablation study of the DVCM on the PU dataset using the SpectralFormer backbone. The baseline model attains an OA of 88.80% ± 0.94%. By incorporating the AM and AC, our model improves the OA to 89.32% ± 0.85%. Taking into account these results, the inclusion of the DVCM enabled our model to achieve the highest OA of 90.90% ± 0.50%, underscoring the substantial performance enhancement facilitated by the codebook.

Table 8. Analysis of the DVCM on the PU dataset.

4.4.2. Codebook Size

As shown in Table 9, the analysis results demonstrate the impact of codebook size on OA. In the case of the SA dataset, we increased the codebook size from 70 to 100, and the OA was improved from 91.99% to 92.19%. However, when the size further increased to 150, there was a slight decrease in OA to 91.93%. Similarly, for the HR-L dataset, an initial increase in OA from 74.02% to 74.26% was observed when the codebook size was enlarged from 70 to 100. Nevertheless, when the codebook size reached 150, the OA dropped to 73.90%. It is noted that beyond a certain threshold (100), larger codebooks result in inefficiencies or over-parameterization in the model. This suggests that a moderate enlargement in the codebook size can help in capturing more intricate data characteristics and slightly enhance model performance. For the PU dataset, a slight variation appeared in the trend, with the highest OA of 90.90% achieved using a codebook size of 70. As the codebook size increased from 70 to 100 and then to 150, the OA decreased to 90.63% and 90.38%, respectively. This trend highlights that a smaller codebook size (70) is more appropriate for the PU dataset as it aligns with its intrinsic characteristics featuring fewer categories.

Table 9. Analysis of codebook size on the four datasets.

4.4.3. Codebook Dimension

We analyze the impact of the codebook dimension on the OA of the SpectralFormer model enhanced by our methodology on the PU dataset. Table 10 clearly demonstrates that varying the codebook dimension leads to subtle differences in model performance. Specifically, the codebook dimension was varied across five different sizes: 32, 64, 128, 256, and 512. The highest OA, at 90.90% ± 0.50%, was achieved with a codebook dimension of 64. This indicates an optimal setting at the size of 64, where the model was capable of effectively capturing essential features without excessive redundancy. As the dimension increased from 64 to 128, there was a slight decrease in OA to 90.59% ± 0.40%. This trend continued with further increments to 256 and 512, resulting in OA dropping slightly to 90.30% ± 0.60% and 90.58% ± 0.60%, respectively. These observations suggest that larger codebook dimensions do not confer better performance, as informative features become diluted within a larger embedding space. The consistent OA across all settings highlights the stability of the DVR model configuration. Our model maintains high performance regardless of substantial changes in the codebook dimension.

Table 10. Analysis of codebook dimension on the PU dataset.

4.4.4. Top-k Selection

We investigate the impact of the Top-k parameter on the performance of our model, as detailed in Table 11. Here, Top-k denotes the k codes from the codebook that were closest to the encoder features. These codes were averaged before being fed into the AC. All experiments were carried out on the PU dataset with a consistent configuration where the codebook size was 70 and the codebook dimension was 64. With Top-k = 1, our model achieved an OA of 90.79% ± 0.58%, indicating a highly focused representation based on the most relevant code. Increasing to 5, the OA was slightly improved to 90.90% ± 0.50%, and it suggests that incorporating additional relevant codes can enhance model performance by providing a richer feature representation. However, expanding to 10 led to a slight decrease in OA to 90.75% ± 0.51%, indicating that including too many codes may dilute the feature representation, potentially introducing noise or less relevant information. These results demonstrate that the Top-k parameter has a nuanced impact on model accuracy, highlighting that a moderate number of codes offers a better balance between accuracy and feature representation.

Table 11. Analysis of Top-k on the PU dataset.

4.4.5. Impact of AC

We evaluate the impact of AC on the OA of our model. We conducted experiments on the PU dataset using the SpectralFormer backbone with and without the AC, as shown in Table 12. The naive SpectralFormer achieved an OA of 88.80% ± 0.94%. When incorporating our modifications into the SpectralFormer without AC, we observed an improvement in OA to 90.66% ± 0.81%. Furthermore, the inclusion of the AC in the modified SpectralFormer led to a further enhancement in OA to 90.90% ± 0.50%. These results clearly demonstrate the positive contribution of the AC to both model accuracy and stability.

Table 12. Analysis of the auxiliary classifier on the PU dataset.

4.5. Robustness Evaluation

Figure 11 displays the results of OA achieved by different methods with different proportions of training samples. To assess the stability and robustness of our proposed method, we randomly selected 1%, 2%, and 4% of labeled samples for the SA and the PU datasets, and 3%, 4%, and 5% for the HR-L dataset. Our method consistently outperformed other methods in all scenarios, which highlights the robustness of our model. The OA of the 3D-CNN was obviously low when training data were limited. However, when our method was integrated, it significantly enhanced the performance of the 3D-CNN. Varying degrees of improvement were also observed in other methods. The most significant enhancement was pronounced in the HR-L dataset. As the volume of training data increased, our method maintained higher accuracy than the other baselines. As the OA approached 100%, the rate of improvement diminished, and the observed marginal effects could be logically explained.

Figure 11. OA of different models (3D-CNN [13], SF [18], SSFTT [20], GAHT [21]) with different percentages of training samples.

4.6. Computational Cost

This subsection evaluates the incremental computational cost of enhancing the SpectralFormer model with various components within our methodology. We analyzed the impact of integrating the AM, DVCM, and AC on the total number of parameters, trainable parameters, and FLOPs (floating point operations per second). In Table 13, the baseline configuration of SpectralFormer includes 352,405 parameters, all of which are trainable, with a computational workload of 16.235776 million FLOPs. To be specific, with adding AM, the parameters and computational cost experienced a slight increase by 1.18% and 0.025%, respectively. Furthermore, the introduction of the DVCM led to a total parameter increase of 6.31% and a 0.053% rise in FLOPs, while maintaining the trainable parameters constant. This highlights its role as a static feature extractor. Incorporating AC further raised both total and trainable parameters by 6.61% and 1.48%, respectively. The computational costs were minimally increased by 0.059%. It was emphasized that the increases in parameter count and computational load brought about by our proposed DVR were extremely negligible.

Table 13. Analysis of computational cost.

5. Discussion

The proposed DVR method, while effective, has certain limitations. The performance shows slight sensitivity to codebook parameters, such as its size and dimension, which may require moderate tuning to achieve optimal results. An oversized codebook may increase computational costs, while an undersized one might fail to capture feature diversity effectively. Future work could focus on optimizing the codebook’s efficiency through advanced encoding algorithms, as well as leveraging automated hyper-parameter tuning methods to reduce the reliance on manual adjustments. Exploring dynamic codebook adjustment could further enhance the scalability and applicability of DVR. In future ocean applications, the visual or spectral differences between classes (e.g., in tasks such as sea ice detection or algal bloom prediction) can be very subtle, making it potentially difficult for existing models to distinguish between them effectively. DVR can leverage the codebook to regulate features, thereby enlarging the distance between features of different classes. This will enable more effective and accurate classification in such challenging scenarios.

6. Conclusions

To mitigate the common misclassification issues in current models for HSI classification, this article introduces an innovative DVR strategy that leverages discrete vectors from the codebook to regulate embedding features. This plug-and-play method enables models to attain a more robust aggregated distribution in the embedding space, thereby enhancing the overall performance of HSI classification. Experimental results conducted on four HSI benchmarks confirm the superiority of our proposed method on both the visual quality of classification maps and quantitative metrics compared to baseline models. Specifically, our DVR improves the OA of the 3D-CNN by 7.58% on the PU dataset and enhances the OA of SpectralFormer by more than 1% across all four datasets. Additionally, integrating DVR into SpectralFormer increases its trainable parameters by only 1.48% and computational cost by just 0.059%. In future work, we will extend the applications of our method to a wider range of models to further enhance performance and explore the potential of our approach in the marine domain, such as sea ice detection and algal bloom prediction.

Author Contributions

Conceptualization, J.L., H.W., X.Z. and J.W.; methodology, J.L., H.W., X.Z. and J.W.; software, H.W., X.Z. and J.W.; validation, J.L. and P.Z.; formal analysis, P.Z.; investigation, J.L. and H.W.; resources, P.Z.; data curation, T.Z.; writing—original draft preparation, J.L., H.W., X.Z. and J.W.; writing—review and editing, H.W., X.Z., T.Z. and P.Z.; visualization, H.W.; supervision, J.L. and P.Z.; project administration, J.L. and P.Z.; funding acquisition, J.L., P.Z. and T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62171252, in part by the Fundamental Research Funds for the Central Universities under Grant 00007764, in part by the Natural Science Foundation of China under Grant 42201386, in part by the Interdisciplinary Research Project for Young Teachers of USTB (Fundamental Research Funds for the Central Universities: FRF-IDRY-22-018), and Fundamental Research Funds for the Central Universities of USTB: FRF-TP-24-060A.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral remote sensing data analysis and future challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [Google Scholar] [CrossRef]
Hu, C. Hyperspectral reflectance spectra of floating matters derived from Hyperspectral Imager for the Coastal Ocean (HICO) observations. Earth Syst. Sci. Data 2022, 14, 1183–1192. [Google Scholar] [CrossRef]
Grøtte, M.E.; Birkeland, R.; Honoré-Livermore, E.; Bakken, S.; Garrett, J.L.; Prentice, E.F.; Sigernes, F.; Orlandić, M.; Gravdahl, J.T.; Johansen, T.A. Ocean color hyperspectral remote sensing with high resolution and low latency—The HYPSO-1 CubeSat mission. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–19. [Google Scholar] [CrossRef]
Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep learning for hyperspectral image classification: An overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef]
Kumar, B.; Dikshit, O.; Gupta, A.; Singh, M.K. Feature extraction for hyperspectral image classification: A review. Int. J. Remote Sens. 2020, 41, 6248–6287. [Google Scholar] [CrossRef]
Yang, X.; Cao, W.; Lu, Y.; Zhou, Y. QTN: Quaternion transformer network for hyperspectral image classification. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7370–7384. [Google Scholar] [CrossRef]
Fauvel, M.; Benediktsson, J.A.; Chanussot, J.; Sveinsson, J.R. Spectral and spatial classification of hyperspectral data using SVMs and morphological profiles. IEEE Trans. Geosci. Remote Sens. 2008, 46, 3804–3814. [Google Scholar] [CrossRef]
Pal, M.; Foody, G.M. Feature selection for classification of hyperspectral data by SVM. IEEE Trans. Geosci. Remote Sens. 2010, 48, 2297–2307. [Google Scholar] [CrossRef]
Fan, J.; Chen, T.; Lu, S. Superpixel guided deep-sparse-representation learning for hyperspectral image classification. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 3163–3173. [Google Scholar] [CrossRef]
Li, J.; Zhao, X.; Li, Y.; Du, Q.; Xi, B.; Hu, J. Classification of hyperspectral imagery using a new fully convolutional neural network. IEEE Geosci. Remote Sens. Lett. 2018, 15, 292–296. [Google Scholar] [CrossRef]
Hu, J.F.; Huang, T.Z.; Deng, L.J.; Jiang, T.X.; Vivone, G.; Chanussot, J. Hyperspectral image super-resolution via deep spatiospectral attention convolutional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 7251–7265. [Google Scholar] [CrossRef] [PubMed]
Pu, C.; Huang, H.; Shi, X.; Wang, T. Semisupervised spatial-spectral feature extraction with attention mechanism for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Hamida, A.B.; Benoit, A.; Lambert, P.; Amar, C.B. 3-D deep learning approach for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4420–4434. [Google Scholar] [CrossRef]
Cao, X.; Xu, L.; Meng, D.; Zhao, Q.; Xu, Z. Integration of 3-dimensional discrete wavelet transform and Markov random field for hyperspectral image classification. Neurocomputing 2017, 226, 90–100. [Google Scholar] [CrossRef]
Cao, X.; Zhou, F.; Xu, L.; Meng, D.; Xu, Z.; Paisley, J. Hyperspectral image classification with Markov random fields and a convolutional neural network. IEEE Trans. Image Process. 2018, 27, 2354–2367. [Google Scholar] [CrossRef] [PubMed]
Xie, J.; He, N.; Fang, L.; Ghamisi, P. Multiscale densely-connected fusion networks for hyperspectral images classification. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 246–259. [Google Scholar] [CrossRef]
Ran, R.; Deng, L.J.; Zhang, T.J.; Chang, J.; Wu, X.; Tian, Q. KNLConv: Kernel-space non-local convolution for hyperspectral image super-resolution. IEEE Trans. Multimed. 2024, 26, 8836–8848. [Google Scholar] [CrossRef]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
Mei, S.; Song, C.; Ma, M.; Xu, F. Hyperspectral image classification using group-aware hierarchical transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5539014. [Google Scholar] [CrossRef]
Song, L.; Feng, Z.; Yang, S.; Zhang, X.; Jiao, L. Interactive Spectral-Spatial Transformer for Hyperspectral Image Classification. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 8589–8601. [Google Scholar] [CrossRef]
Hong, D.; Gao, L.; Hang, R.; Zhang, B.; Chanussot, J. Deep encoder–decoder networks for classification of hyperspectral and LiDAR data. IEEE Geosci. Remote Sens. Lett. 2020, 19, 1–5. [Google Scholar] [CrossRef]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Van Den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural discrete representation learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 6309–6318. [Google Scholar]
Mao, C.; Jiang, L.; Dehghani, M.; Vondrick, C.; Sukthankar, R.; Essa, I. Discrete Representations Strengthen Vision Transformer Robustness. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 25–29 April 2022. [Google Scholar]
Makantasis, K.; Karantzalos, K.; Doulamis, A.; Doulamis, N. Deep supervised learning for hyperspectral data classification through convolutional neural networks. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 4959–4962. [Google Scholar]
Mei, S.; Chen, X.; Zhang, Y.; Li, J.; Plaza, A. Accelerating convolutional neural network-based hyperspectral image classification by step activation quantization. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–12. [Google Scholar]
Song, W.; Li, S.; Fang, L.; Lu, T. Hyperspectral image classification with deep feature fusion network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3173–3184. [Google Scholar] [CrossRef]
Zhao, X.; Tao, R.; Li, W.; Li, H.C.; Du, Q.; Liao, W.; Philips, W. Joint classification of hyperspectral and LiDAR data using hierarchical random walk and deep CNN architecture. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7355–7370. [Google Scholar] [CrossRef]
Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef]
He, M.; Li, B.; Chen, H. Multi-scale 3D deep convolutional neural network for hyperspectral image classification. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3904–3908. [Google Scholar]
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–spatial residual network for hyperspectral image classification: A 3-D deep learning framework. IEEE Trans. Geosci. Remote Sens. 2017, 56, 847–858. [Google Scholar] [CrossRef]
Mei, S.; Ji, J.; Geng, Y.; Zhang, Z.; Li, X.; Du, Q. Unsupervised spatial–spectral feature learning by 3D convolutional autoencoder for hyperspectral classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6808–6820. [Google Scholar] [CrossRef]
Xue, Z.; Xu, Q.; Zhang, M. Local transformer with spatial partition restore for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4307–4325. [Google Scholar] [CrossRef]
Hendrycks, D.; Basart, S.; Mu, N.; Kadavath, S.; Wang, F.; Dorundo, E.; Desai, R.; Zhu, T.; Parajuli, S.; Guo, M.; et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 8340–8349. [Google Scholar]
Steiner, A.; Kolesnikov, A.; Zhai, X.; Wightman, R.; Uszkoreit, J.; Beyer, L. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv 2021, arXiv:2106.10270. [Google Scholar]
Wang, H.; Ge, S.; Lipton, Z.; Xing, E.P. Learning robust global representations by penalizing local predictive power. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 10506–10518. [Google Scholar]
Huang, Z.; Wang, H.; Xing, E.P.; Huang, D. Self-challenging improves cross-domain generalization. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 124–140. [Google Scholar]
Hu, Q.; Zhang, G.; Qin, Z.; Cai, Y.; Yu, G.; Li, G.Y. Robust semantic communications with masked VQ-VAE enabled codebook. IEEE Trans. Wirel. Commun. 2023, 22, 8707–8722. [Google Scholar] [CrossRef]
Xi, B.; Li, J.; Diao, Y.; Li, Y.; Li, Z.; Huang, Y.; Chanussot, J. Dgssc: A deep generative spectral-spatial classifier for imbalanced hyperspectral imagery. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 1535–1548. [Google Scholar] [CrossRef]
Mao, A.; Mohri, M.; Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; pp. 23803–23828. [Google Scholar]
Zhong, Y.; Hu, X.; Luo, C.; Wang, X.; Zhao, J.; Zhang, L. WHU-Hi: UAV-borne hyperspectral with high spatial resolution (H2) benchmark datasets and classifier for precise crop identification based on deep convolutional neural network with CRF. Remote Sens. Environ. 2020, 250, 112012. [Google Scholar] [CrossRef]
Brito, J.A.; McNeill, F.E.; Webber, C.E.; Chettle, D.R. Grid search: An innovative method for the estimation of the rates of lead exchange between body compartments. J. Environ. Monit. 2005, 7, 241–247. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Misclassification due to encoded features overlap. (a) Category results of prediction. (b) Category results of label. The t-SNE visualization of encoded features from SpectralFormer [18] on the Pavia University dataset clearly demonstrates that SpectralFormer incorrectly categorizes instances (labeled as blue) as belonging to the yellow category, while also misclassifying instances (labeled as blue and pink) as belonging to the red category. The overlap of encoded features significantly contributes to classification inaccuracies.

Figure 2. Comparison between previous architecture and our DVR strategy. (a) Previous architecture. Previous architecture typically comprises an encoder and a classification head, however, it faces difficulties due to the disorderly distribution of encoded features, resulting in a decline in classification accuracy. (b) DVR. Our DVR approach integrates the discrete vector representation into the embedding space of encoded feature, aiming to optimize the distribution of encoded features by making features of the same category more compactly clustered and reduce the likelihood of misclassification by the classifier.

Figure 3. The proposed DVR framework for HSI classification. Firstly, the encoder in the model extracts spatial–spectral features from each patch, and these features are then adjusted by the Adaptive Module to align with the embedding space defined by the codebook. This codebook comprises multiple discrete vectors that represent different classes, which are refined through the DVCM (between AM and AC) during training iterations. Subsequently, the framework calculates the auxiliary (Aux) class descriptor by averaging the top k nearest vectors from the codebook. The descriptor is employed by the Aux classifier to predict the class of each input patch. Ultimately, this prediction combines with the output of primary classifier to generate a classified image as final output.

Figure 4. Codebook visualization. The different codes in the codebook represent the feature information for various categories. For instance, in the SA dataset, code 21 corresponds to the category “Vinyard untrained”.

Figure 5. Dual classification strategy. We select the top five closest codes within the embedding space and average them to generate our auxiliary class descriptor. By merging predictions from both the primary and codebook-based classifications, we leverage the complementary information within the codebook features to improve classification performance.

Figure 6. Classification maps by different methods (3D-CNN [13], SF [18], SSFTT [20], GAHT [21]) on the SA dataset with 1% training samples.

Figure 7. Classification maps by different methods (3D-CNN [13], SF [18], SSFTT [20], GAHT [21]) on the PU dataset with 1% training samples.

Figure 8. Classification maps by different methods (3D-CNN [13], SF [18], SSFTT [20], GAHT [21]) on the HR-L dataset with 3% training samples.

Figure 9. Classification maps by different methods (3D-CNN [13], SF [18], SSFTT [20], GAHT [21]) on the HC dataset with 0.2% training samples.

Figure 10. The t-SNE visualization results of encoded features by different methods (3D-CNN [13], SF [18], SSFTT [20], GAHT [21]) on the four datasets. Compared to other methods, DVR demonstrates a more cohesive distribution, with less cross-aggregation and fewer instances of misclassifications among categories.

Figure 11. OA of different models (3D-CNN [13], SF [18], SSFTT [20], GAHT [21]) with different percentages of training samples.

Table 1. Definition of notations used in DVR.

Parameters	Definition
x	An original HSI data
$x_{i}^{p}$	An HSI data patch
$f (\cdot)$	Encoder
e	The feature vector from encoder
$Φ (x)$	The standard Gaussian cumulative distribution function
h	The feature vector from the Adaptive Module
v	The code in the codebook
z	The index of top k nearest vectors in the codebook
$Topkmin$	The k minest-distance items
$γ$	The decay factor
sg[·]	The stop-gradient operator
o	The final output combining two classifiers
a	The output of the auxiliary classifier
p	The output of the primary classifier
t	The ground truth
$λ$	The weight of primary classifier
$β$	The weight of auxiliary classifier

Table 2. The numbers of training, validation, and testing samples in the SA dataset, the PU dataset, the HR-L dataset, and the HC Dataset.

SA (1% for Training)					PU (1% for Training)
NO.	Class Name	Train	Val.	Test	NO.	Class Name	Train	Val.	Test
1	Broccoli_green_weeds_1	40	40	1929	1	Asphalt	132	133	6366
2	Broccoli_green_weeds_2	74	75	3577	2	Meadows	373	373	17,903
3	Fallow	40	39	1897	3	Gravel	42	42	2015
4	Fallow_rough_plow	28	28	1338	4	Trees	62	61	2940
5	Fallow_smooth	53	54	2571	5	Metal Sheets	27	27	1291
6	Stubble	79	79	3801	6	Bare Soil	100	101	4828
7	Celery	71	72	3436	7	Bitumen	27	26	1277
8	Grapes_untrained	225	226	10,820	8	Bricks	73	74	3535
9	Soil_vinyard_develop	124	124	5955	9	Shadows	19	19	909
10	Corn_sensced_green_weeds	65	66	3147	-	-	-	-	-
11	Lettuce_romaine_4wk	22	21	1025	-	-	-	-	-
12	Lettuce_romaine_5wk	39	38	1850	-	-	-	-	-
13	Lettuce_romaine_6wk	19	18	879	-	-	-	-	-
14	Lettuce_romaine_7wk	22	21	1027	-	-	-	-	-
15	Vinyard_untrained	145	146	6977	-	-	-	-	-
16	Vinyard_vertical_trellis	36	36	1735	-	-	-	-	-
Total	-	1082	1083	51,964	-	-	855	856	41,065
HR-L (3% for Training)					HC (0.2% for Training)
NO.	Class Name	Train	Val.	Test	NO.	Class Name	Train	Val.	Test
1	Dense Urban Fabric	8	9	271	1	Strawberry	89	90	44,556
2	Mineral Extraction Sites	2	2	63	2	Cowpea	46	45	22,662
3	Non Irrigated Arable Land	17	16	509	3	Soybean	21	20	10,246
4	Fruit Trees	3	2	74	4	Sorghum	10	11	5332
5	Olive Groves	42	42	1317	5	Water spinach	2	3	1195
6	Broad-leaved Forest	6	7	210	6	Watermelon	9	9	4515
7	Coniferous Forest	15	15	470	7	Greens	12	12	5879
8	Mixed Forest	32	32	1008	8	Trees	36	36	17,906
9	Dense Sclerophyllous Vegetation	114	114	3565	9	Grass	19	19	9431
10	Sparse Sclerophyllous Vegetation	84	84	2635	10	Red roof	21	21	10,474
11	Sparcely Vegetated Areas	12	12	380	11	Gray roof	34	34	16,843
12	Rocks and Sand	14	15	458	12	Plastic	8	7	3664
13	Water	42	42	1309	13	Bare soil	18	18	9080
14	Coastal Water	14	13	424	14	Road	37	37	18,486
-	-	-	-	-	15	Bright object	2	2	1132
-	-	-	-	-	16	Water	151	151	75,099
Total	-	405	405	12,693	-	-	515	515	256,500

Note: The colors of each category in the table correspond to the colors used in the visualization.

Table 3. Classification performance of various methods on the SA dataset using only 1% training samples.

	3D-CNN [13]	3D-CNN + DVR	SF [18]	SF + DVR	SSFTT [20]	SSFTT + DVR	GAHT [21]	GAHT + DVR
Class	3D-CNN [13]	3D-CNN + DVR	SF [18]	SF + DVR	SSFTT [20]	SSFTT + DVR	GAHT [21]	GAHT + DVR
1	96.48 ± 1.12	98.75 ± 1.04	95.80 ± 0.83	96.91 ± 0.18	99.87 ± 0.26	99.47 ± 0.85	100.00 ± 0.00	99.93 ± 0.14
2	99.92 ± 0.03	99.78 ± 0.36	99.06 ± 0.45	99.03 ± 0.44	99.67 ± 0.27	99.85 ± 0.12	99.99 ± 0.01	100.00 ± 0.00
3	91.47 ± 2.28	96.29 ± 0.52	94.42 ± 2.81	91.85 ± 3.38	96.02 ± 3.74	97.44 ± 1.91	99.40 ± 0.28	99.03 ± 1.24
4	98.07 ± 0.39	98.73 ± 0.69	93.44 ± 1.70	95.80 ± 1.52	99.39 ± 0.41	99.08 ± 0.69	98.55 ± 1.66	98.70 ± 1.49
5	95.60 ± 2.34	95.50 ± 2.73	92.31 ± 3.04	90.18 ± 2.60	97.07 ± 2.51	98.28 ± 1.23	99.06 ± 0.61	99.28 ± 0.63
6	99.34 ± 0.45	99.82 ± 0.18	99.51 ± 0.57	99.45 ± 0.54	99.42 ± 0.97	99.86 ± 0.17	99.99 ± 0.02	100.00 ± 0.00
7	99.45 ± 0.27	99.80 ± 0.17	98.87 ± 0.46	98.64 ± 0.67	99.09 ± 0.58	99.56 ± 0.42	99.89 ± 0.19	99.99 ± 0.01
8	84.13 ± 2.04	87.86 ± 2.42	85.58 ± 2.32	85.88 ± 1.81	89.52 ± 2.39	88.18 ± 3.05	93.66 ± 1.29	93.64 ± 1.15
9	97.33 ± 1.46	98.79 ± 1.06	96.99 ± 0.91	99.12 ± 0.81	99.15 ± 0.61	99.27 ± 0.55	99.96 ± 0.06	99.96 ± 0.06
10	90.59 ± 1.68	93.01 ± 2.10	88.20 ± 2.43	89.60 ± 1.32	92.76 ± 2.19	94.70 ± 0.81	96.71 ± 2.25	97.98 ± 2.03
11	81.57 ± 9.03	94.71 ± 3.37	89.02 ± 1.11	91.90 ± 2.51	98.15 ± 1.09	96.91 ± 2.03	98.76 ± 0.73	99.31 ± 0.38
12	99.34 ± 0.38	98.49 ± 1.32	98.76 ± 0.65	98.91 ± 0.67	99.90 ± 0.10	99.79 ± 0.23	99.83 ± 0.20	99.74 ± 0.21
13	97.77 ± 2.50	94.08 ± 6.89	93.54 ± 7.05	96.39 ± 5.40	96.44 ± 4.04	98.44 ± 1.50	97.84 ± 2.22	98.57 ± 2.04
14	96.19 ± 3.14	97.06 ± 1.78	95.54 ± 2.02	96.97 ± 1.38	95.94 ± 2.51	95.27 ± 2.70	97.98 ± 1.71	98.02 ± 1.07
15	75.29 ± 3.66	80.50 ± 3.47	76.63 ± 4.94	82.88 ± 2.76	78.62 ± 4.27	84.85 ± 3.04	89.69 ± 1.32	91.76 ± 2.61
16	90.83 ± 4.59	93.38 ± 5.32	93.50 ± 4.55	92.64 ± 2.71	96.75 ± 1.55	95.45 ± 3.46	97.72 ± 2.04	98.61 ± 0.35
OA (%)	90.89 ± 0.55	93.28 ± 0.32	91.04 ± 0.63	92.26 ± 0.11	93.69 ± 0.34	94.49 ± 0.33	96.80 ± 0.41	97.20 ± 0.21
AA (%)	93.33 ± 0.67	95.41 ± 0.44	93.20 ± 0.61	94.13 ± 0.29	96.11 ± 0.42	96.65 ± 0.22	98.07 ± 0.18	98.41 ± 0.36
$κ$ × 100	89.86 ± 0.61	92.51 ± 0.35	90.03 ± 0.71	91.39 ± 0.12	92.97 ± 0.39	93.86 ± 0.37	96.43 ± 0.45	96.88 ± 0.24
Train time(s)	173.65 ± 0.44	184.17 ± 1.37	192.51 ± 1.08	221.42 ± 1.05	171.98 ± 3.26	193.48 ± 1.71	222.26 ± 1.33	243.29 ± 1.31
Test time(s)	6.98 ± 0.21	8.52 ± 0.05	19.29 ± 0.05	22.43 ± 0.19	6.91 ± 0.12	8.68 ± 0.12	10.12 ± 0.45	11.61 ± 0.38

Table 4. Classification performance of various methods on the PU dataset using only 1% training samples.

	3D-CNN [13]	3D-CNN + DVR	SF [18]	SF + DVR	SSFTT [20]	SSFTT + DVR	GAHT [21]	GAHT + DVR
Class	3D-CNN [13]	3D-CNN + DVR	SF [18]	SF + DVR	SSFTT [20]	SSFTT + DVR	GAHT [21]	GAHT + DVR
1	93.48 ± 1.07	95.38 ± 1.68	87.87 ± 1.79	90.81 ± 0.94	95.84 ± 0.94	97.20 ± 0.90	97.67 ± 1.16	97.43 ± 0.69
2	97.35 ± 1.71	99.12 ± 0.36	96.29 ± 1.94	98.07 ± 0.39	99.25 ± 0.77	99.39 ± 0.31	99.69 ± 0.09	99.67 ± 0.18
3	47.78 ± 7.32	81.17 ± 3.20	68.40 ± 8.13	67.77 ± 5.64	89.23 ± 3.27	89.48 ± 3.03	90.23 ± 4.01	90.00 ± 2.58
4	94.45 ± 1.36	96.42 ± 1.20	89.70 ± 3.27	90.90 ± 2.00	97.86 ± 0.79	97.81 ± 0.60	97.19 ± 0.87	97.83 ± 0.33
5	99.74 ± 0.23	99.89 ± 0.15	100.00 ± 0.00	99.97 ± 0.04	99.98 ± 0.03	99.94 ± 0.09	100.00 ± 0.00	100.00 ± 0.00
6	56.84 ± 4.53	92.62 ± 2.19	79.94 ± 6.92	83.43 ± 1.42	98.50 ± 0.77	99.23 ± 0.23	98.25 ± 1.25	99.48 ± 0.40
7	66.66 ± 8.18	82.10 ± 5.90	58.25 ± 6.65	55.36 ± 5.36	92.66 ± 3.83	91.34 ± 3.52	92.82 ± 5.66	98.27 ± 0.78
8	90.91 ± 1.29	91.32 ± 1.83	80.69 ± 1.18	87.27 ± 1.25	90.08 ± 4.58	95.33 ± 1.65	94.15 ± 1.94	95.39 ± 1.09
9	99.01 ± 0.75	99.38 ± 0.36	95.82 ± 1.40	92.18 ± 0.46	99.81 ± 0.13	99.61 ± 0.33	99.68 ± 0.10	99.63 ± 0.27
OA (%)	87.95 ± 1.31	95.53 ± 0.35	88.80 ± 0.94	90.90 ± 0.50	97.08 ± 0.64	97.85 ± 0.22	97.89 ± 0.30	98.29 ± 0.12
AA (%)	82.91 ± 1.93	93.04 ± 0.49	84.11 ± 1.86	85.09 ± 0.93	95.91 ± 0.90	96.59 ± 0.52	96.63 ± 0.59	97.52 ± 0.07
$κ$ × 100	83.70 ± 1.77	94.06 ± 0.46	85.09 ± 1.26	87.84 ± 0.66	96.14 ± 0.8	97.16 ± 0.29	97.19 ± 0.41	97.74 ± 0.16
Train time(s)	161.01 ± 1.09	179.40 ± 2.79	221.72 ± 11.13	237.73 ± 0.66	171.68 ± 0.34	187.51 ± 1.27	207.99 ± 0.99	220.48 ± 1.52
Test time(s)	9.05 ± 0.45	11.92 ± 0.09	29.38 ± 7.23	33.62 ± 0.28	9.15 ± 0.37	12.48 ± 0.50	17.63 ± 0.42	18.79 ± 0.19

Table 5. Classification performance of various methods on the HR-L dataset using only 3% training samples.

	3D-CNN [13]	3D-CNN + DVR	SF [18]	SF + DVR	SSFTT [20]	SSFTT + DVR	GAHT [21]	GAHT + DVR
Class	3D-CNN [13]	3D-CNN + DVR	SF [18]	SF + DVR	SSFTT [20]	SSFTT + DVR	GAHT [21]	GAHT + DVR
1	40.07 ± 3.94	47.08 ± 7.23	46.49 ± 10.08	40.37 ± 7.08	64.28 ± 6.01	53.58 ± 6.20	64.06 ± 10.27	71.14 ± 5.08
2	87.62 ± 8.60	83.81 ± 16.69	96.83 ± 4.92	96.83 ± 6.35	86.35 ± 8.61	79.37 ± 6.02	90.48 ± 10.43	90.79 ± 5.35
3	76.54 ± 6.78	81.69 ± 5.55	64.95 ± 6.10	65.78 ± 5.99	80.51 ± 7.05	84.83 ± 1.93	84.09 ± 5.91	85.78 ± 2.23
4	3.78 ± 4.63	44.05 ± 9.65	2.97 ± 3.86	5.95 ± 4.41	38.11 ± 14.29	42.70 ± 14.32	35.41 ± 13.63	33.24 ± 4.32
5	77.86 ± 2.67	84.87 ± 2.36	71.28 ± 5.70	77.01 ± 2.74	87.64 ± 2.87	91.94 ± 0.66	89.61 ± 4.25	89.96 ± 3.10
6	40.29 ± 5.82	62.29 ± 7.79	24.48 ± 10.40	18.29 ± 9.40	55.05 ± 9.13	53.62 ± 4.82	42.48 ± 12.53	49.24 ± 4.91
7	48.21 ± 8.23	60.89 ± 5.53	41.62 ± 4.38	47.62 ± 2.47	65.53 ± 5.40	66.21 ± 5.04	61.66 ± 5.25	60.81 ± 6.70
8	61.53 ± 9.13	74.13 ± 2.70	63.81 ± 5.04	63.19 ± 3.17	70.42 ± 5.28	71.17 ± 7.46	71.31 ± 7.90	68.69 ± 6.24
9	82.19 ± 2.53	78.59 ± 1.35	68.59 ± 2.19	74.42 ± 3.34	81.91 ± 3.28	85.80 ± 2.27	83.94 ± 2.10	83.36 ± 2.42
10	76.70 ± 2.54	81.18 ± 0.86	71.51 ± 2.58	75.29 ± 2.34	81.65 ± 2.57	82.09 ± 1.22	76.51 ± 3.31	82.05 ± 2.45
11	43.42 ± 15.16	61.63 ± 4.54	53.63 ± 11.62	57.58 ± 5.41	65.00 ± 10.93	71.00 ± 4.40	72.84 ± 5.65	72.42 ± 4.49
12	91.75 ± 1.57	92.45 ± 1.19	90.04 ± 6.52	90.57 ± 3.48	93.97 ± 1.03	92.49 ± 1.01	92.93 ± 3.25	92.88 ± 2.85
13	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
14	100.00 ± 0.00	99.86 ± 0.19	100.00 ± 0.00	99.72 ± 0.57	100.00 ± 0.00	99.95 ± 0.09	99.81 ± 0.23	99.67 ± 0.35
OA (%)	77.07 ± 0.63	80.69 ± 0.41	71.12 ± 0.47	74.26 ± 0.24	82.22 ± 0.76	83.97 ± 0.19	81.98 ± 0.62	83.07 ± 0.59
AA (%)	66.43 ± 0.74	75.18 ± 1.39	64.01 ± 1.25	65.19 ± 0.81	76.46 ± 1.30	76.77 ± 1.09	76.08 ± 2.20	77.15 ± 0.58
$κ$ × 100	72.46 ± 0.72	77.08 ± 0.46	65.66 ± 0.52	69.27 ± 0.22	78.82 ± 0.86	80.87 ± 0.22	78.51 ± 0.70	79.79 ± 0.67
Train time	165.27 ± 7.74	175.99 ± 2.25	257.05 ± 1.01	261.42 ± 3.28	185.41 ± 0.34	195.18 ± 1.05	219.04 ± 0.97	230.73 ± 2.03
Test time	10.94 ± 1.41	12.13 ± 0.13	53.05 ± 0.54	58.35 ± 1.63	13.73 ± 0.30	15.45 ± 0.27	21.95 ± 0.56	22.50 ± 0.14

Table 6. Classification performance of various methods on the HC dataset using only 0.2% training samples.

	3D-CNN [13]	3D-CNN + DVR	SF [18]	SF + DVR	SSFTT [20]	SSFTT + DVR	GAHT [21]	GAHT + DVR
Class	3D-CNN [13]	3D-CNN + DVR	SF [18]	SF + DVR	SSFTT [20]	SSFTT + DVR	GAHT [21]	GAHT + DVR
1	92.14 ± 3.07	90.99 ± 2.77	90.57 ± 1.54	93.39 ± 2.15	94.27 ± 1.54	92.41 ± 3.98	96.35 ± 0.49	95.42 ± 0.97
2	79.04 ± 1.47	78.48 ± 1.97	70.53 ± 3.35	66.02 ± 9.33	80.10 ± 4.21	81.49 ± 2.44	86.13 ± 1.73	84.97 ± 3.84
3	58.81 ± 6.87	59.98 ± 4.90	61.89 ± 9.70	65.80 ± 8.61	58.05 ± 6.82	56.30 ± 10.70	79.62 ± 4.41	80.00 ± 6.30
4	67.01 ± 9.84	77.48 ± 3.08	49.48 ± 13.68	59.50 ± 12.21	70.55 ± 11.83	78.56 ± 6.22	90.32 ± 5.85	92.34 ± 2.13
5	5.56 ± 7.86	8.82 ± 7.08	8.02 ± 4.94	18.56 ± 10.94	13.84 ± 5.70	7.25 ± 6.24	7.50 ± 13.47	28.44 ± 12.78
6	1.79 ± 2.25	9.52 ± 7.55	7.15 ± 4.43	5.44 ± 3.24	14.05 ± 4.99	3.39 ± 3.88	10.17 ± 6.58	26.82 ± 5.03
7	84.74 ± 8.08	80.73 ± 6.04	47.31 ± 12.25	50.02 ± 11.56	64.48 ± 20.83	75.23 ± 16.35	73.70 ± 5.03	78.28 ± 7.55
8	66.40 ± 2.72	61.88 ± 2.10	65.11 ± 2.05	65.76 ± 2.38	69.33 ± 8.39	76.31 ± 3.94	73.61 ± 4.71	73.93 ± 4.08
9	23.92 ± 6.13	32.48 ± 4.94	21.18 ± 5.47	29.11 ± 8.46	48.11 ± 17.59	27.40 ± 13.18	49.87 ± 7.43	66.21 ± 5.78
10	37.82 ± 10.60	54.09 ± 7.42	77.56 ± 3.68	68.90 ± 2.85	87.80 ± 8.67	86.33 ± 4.66	92.33 ± 4.08	92.09 ± 3.77
11	52.83 ± 15.64	62.28 ± 7.31	71.08 ± 3.52	79.54 ± 4.91	70.01 ± 14.16	83.43 ± 7.05	84.58 ± 5.08	87.59 ± 8.24
12	1.92 ± 3.51	0.04 ± 0.07	19.11 ± 10.26	21.27 ± 13.21	15.13 ± 12.45	15.03 ± 7.96	12.91 ± 11.95	30.84 ± 11.22
13	26.81 ± 10.61	32.51 ± 8.36	36.12 ± 7.87	44.36 ± 2.80	36.60 ± 13.85	40.51 ± 16.04	50.91 ± 4.65	60.85 ± 5.75
14	75.46 ± 4.03	79.92 ± 2.65	82.53 ± 5.47	79.30 ± 6.19	74.67 ± 6.09	82.70 ± 4.69	90.03 ± 2.54	88.15 ± 4.97
15	33.50 ± 19.37	30.09 ± 22.28	24.98 ± 12.50	22.33 ± 12.02	47.60 ± 22.82	39.89 ± 15.42	0.30 ± 0.60	17.70 ± 21.52
16	98.43 ± 1.25	99.04 ± 0.67	98.37 ± 0.50	98.06 ± 0.78	98.43 ± 1.05	97.30 ± 2.50	99.23 ± 0.48	99.12 ± 0.87
OA (%)	74.64 ± 1.44	76.66 ± 0.52	76.28 ± 0.51	77.34 ± 0.45	79.74 ± 1.65	80.56 ± 1.59	85.13 ± 0.62	86.75 ± 0.77
AA (%)	50.39 ± 2.98	53.65 ± 1.09	51.94 ± 1.55	54.21 ± 1.06	58.94 ± 2.42	58.97 ± 1.89	62.35 ± 1.68	68.92 ± 2.14
$κ$ × 100	69.88 ± 1.81	72.40 ± 0.59	72.07 ± 0.59	73.34 ± 0.53	76.12 ± 1.98	77.20 ± 1.80	82.53 ± 0.74	84.46 ± 0.88
Train time(s)	150.60 ± 1.09	161.40 ± 1.93	272.96 ± 1.45	286.50 ± 2.16	165.61 ± 1.44	181.70 ± 1.45	214.08 ± 0.80	229.91 ± 1.58
Test time(s)	28.18 ± 0.07	32.67 ± 0.10	118.27 ± 0.43	125.34 ± 0.68	28.04 ± 0.07	32.16 ± 0.10	27.66 ± 0.11	33.88 ± 1.34

Table 7. Hyper-parameter settings of four datasets.

Dataset	Codebook Size	Codebook Dim	Top-k
SA	100	64	5
PU	70	64	5
HR-L	100	64	5
HC	100	64	5

Table 8. Analysis of the DVCM on the PU dataset.

Backbone	AM	DVCM	AC	OA (%)
SpectralFormer	×	×	×	88.80 ± 0.94
SpectralFormer	✓	×	✓	89.32 ± 0.85
SpectralFormer	✓	✓	✓	90.90 ± 0.50

Table 9. Analysis of codebook size on the four datasets.

Backbone	Dataset	Codebook Size	OA (%)
SpectralFormer	SA	70	91.99 ± 0.55
SpectralFormer	SA	100	92.26 ± 0.11
SpectralFormer	SA	150	91.93 ± 0.48
SpectralFormer	PU	70	90.90 ± 0.50
SpectralFormer	PU	100	90.63 ± 0.71
SpectralFormer	PU	150	90.38 ± 0.94
SpectralFormer	HR-L	70	74.02 ± 0.61
SpectralFormer	HR-L	100	74.26 ± 0.24
SpectralFormer	HR-L	150	73.90 ± 0.48
SpectralFormer	HC	70	77.05 ± 0.51
SpectralFormer	HC	100	77.34 ± 0.45
SpectralFormer	HC	150	77.10 ± 0.62

Table 10. Analysis of codebook dimension on the PU dataset.

Backbone	Dataset	Codebook Dim	OA (%)
SpectralFormer	PU	32	90.52 ± 0.60
SpectralFormer	PU	64	90.90 ± 0.50
SpectralFormer	PU	128	90.59 ± 0.40
SpectralFormer	PU	256	90.30 ± 0.60
SpectralFormer	PU	512	90.58 ± 0.60

Table 11. Analysis of Top-k on the PU dataset.

Backbone	Dataset	Top-k	OA (%)
SpectralFormer	PU	1	90.79 ± 0.58
SpectralFormer	PU	5	90.90 ± 0.50
SpectralFormer	PU	10	90.75 ± 0.51

Table 12. Analysis of the auxiliary classifier on the PU dataset.

Method	Dataset	AC	OA (%)
SpectralFormer	PU	×	88.80 ± 0.94
SpectralFormer+DVR	PU	×	90.66 ± 0.81
SpectralFormer+DVR	PU	✓	90.90 ± 0.50

Table 13. Analysis of computational cost.

Backbone	AM	DVCM	AC	Total Params	Trainable Params	FLOPs
SpectralFormer	×	×	×	352,405	352,405	16.235776M
SpectralFormer	✓	×	×	356,565 (1.18%)	356,565 (1.18%)	16.239872M (0.025%)
SpectralFormer	✓	✓	×	374,625 (6.31%)	356,565 (1.18%)	16.244352M (0.053%)
SpectralFormer	✓	✓	✓	375,665 (6.61%)	357,605 (1.48%)	16.245376M (0.059%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.