Transfer Learning-Based Hyperspectral Image Classification Using Residual Dense Connection Networks

The extraction of effective classification features from high-dimensional hyperspectral images, impeded by the scarcity of labeled samples and uneven sample distribution, represents a formidable challenge within hyperspectral image classification. Traditional few-shot learning methods confront the dual dilemma of limited annotated samples and the necessity for deeper, more effective features from complex hyperspectral data, often resulting in suboptimal outcomes. The prohibitive cost of sample annotation further exacerbates the challenge, making it difficult to rely on a scant number of annotated samples for effective feature extraction. Prevailing high-accuracy algorithms require abundant annotated samples and falter in deriving deep, discriminative features from limited data, compromising classification performance for complex substances. This paper advocates for an integration of advanced spectral–spatial feature extraction with meta-transfer learning to address the classification of hyperspectral signals amidst insufficient labeled samples. Initially trained on a source domain dataset with ample labels, the model undergoes transference to a target domain with minimal samples, utilizing dense connection blocks and tree-dimensional convolutional residual connections to enhance feature extraction and maximize spatial and spectral information retrieval. This approach, validated on three diverse hyperspectral datasets—IP, UP, and Salinas—significantly surpasses existing classification algorithms and small-sample techniques in accuracy, demonstrating its applicability to high-dimensional signal classification under label constraints.


Introduction
Hyperspectral imaging (HSI) systems amass extensive spatial and spectral data across a broad array of spectral bands, presenting a rich tapestry of information [1,2].This bounty has catalyzed advancements across varied domains, such as precision agriculture [3], environmental surveillance [4,5], and disaster mitigation [6,7], signifying its interdisciplinary impact.The realm of hyperspectral image classification, a pivotal segment of hyperspectral analysis, has elicited considerable scholarly interest [8,9].Yet, the classification endeavors for hyperspectral remote-sensing imagery confront persistent obstacles.A critical imperative lies in the more profound exploration of the intrinsic deep features within hyperspectral images.Addressing the paucity of training samples and enhancing classification efficacy in high-dimensional contexts with limited data remain pressing challenges.These hurdles underscore the substantial prospects for continued research and advancements in the field.
In traditional classification methods, the classification of hyperspectral images has focused on manual feature extraction [9][10][11][12] and the use of traditional shallow classifiers, including K-Nearest Neighbor (KNN) [13], Support Vector Machine (SVM) [14], logistic regression [15], the manifold learning method [16], among others.These conventional methods can only extract shallow feature information and neglect deep feature information.
Sensors 2024, 24, 2664 2 of 18 Classification performance relies significantly on prior knowledge, manual parameter adjustments, and feature selection.However, this approach lacks the adaptability required to address classification tasks in complex scenarios.
Deep learning methods possess the ability to acquire discriminative features from extensive annotated data and apply these features to classification tasks.As a result, deep learning methods have emerged as a promising approach to hyperspectral image (HSI) classification, offering substantial advantages over traditional methods.Chen et al. [17] utilized deep stacked autoencoders to extract spatial and spectral features from hyperspectral images.This approach effectively captured contextual spatial information and spectral information from HSIs, leading to a successful classification and good performance.To address the distinct characteristics of hyperspectral image data cubes, Li et al. [18] employed a 3D convolutional neural network (3D-CNN) for hyperspectral image classification.Thompson et al. used deep belief networks to extract features at a deep level for hyperspectral image classification [19].Zhong et al. [20] introduced a supervised spectral-spatial residual network (SSRN) to iteratively acquire discriminative features from the abundant spectral characteristics and spatial contexts of hyperspectral images (HSI).The goal was to extract integrated spatial-spectral information and identify significant spectral-spatial features for classification purposes.
The performance of conventional supervised deep learning methods is based on a significant number of labeled samples for model training.Nevertheless, the exorbitant cost of annotation leads to a severely restricted number of labeled samples for hyperspectral images as a whole.Therefore, using traditional deep learning models for hyperspectral image classification with insufficient training samples can easily lead to overfitting and suboptimal classification performance.To overcome this challenge, researchers have proposed various approaches to tackle the issue of hyperspectral image classification in scenarios with limited sample sizes.Some approaches [21,22] employ data augmentation to generate additional training samples for deep learning models such as CNNs, thus expanding data size and improving the model's generalizability.Several semi-supervised approaches [23,24] involve the combination of a limited number of labeled samples with unlabeled samples during training.These methods leverage the information from unlabeled samples to obtain feature representations that are more robust and highly generalized.Transfer learning-based approaches [25,26] employ a model that has been pre-trained on a large-scale dataset.The weights of the pre-trained model serve as initialization parameters, which are subsequently fine-tuned on a small sample dataset.By harnessing the feature extraction capabilities of the pre-trained model, this approach effectively enhances the classification performance using small-sample datasets.
Taking into account the challenges in hyperspectral image classification, the limited availability of labeled training samples in hyperspectral images poses a significant constraint on the learning and feature extraction capacity of deep neural network models.Furthermore, the high-dimensional characteristics of hyperspectral images present a challenge to models trained on a small number of annotated samples regarding the extraction of an adequate set of features.As a consequence, the extraction of intrinsic deep-level features from hyperspectral images becomes arduous, leading to a diminished classification accuracy in hyperspectral image classification tasks.Therefore, the construction of deep neural network models for hyperspectral image classification in scenarios with limited training samples poses a significant research challenge.We took into account that ResNet, through residual blocks, enables inter-layer connections that reinforce feature reuse and alleviate the vanishing gradient problem, and in DenseNet structures, each layer is directly connected to all subsequent layers, allowing the extraction of deeper features, further mitigating the vanishing gradient problem and effectively extracting deep features.Therefore, to more effectively extract deep features from the spectral and spatial dimensions of hyperspectral images under conditions of limited samples and to enhance the performance of hyperspectral classification, this paper presents a meta-transfer framework for few-shot hyperspectral image classification based on a three-dimensional Residual Dense Connection Network (ResDenseNet).The primary contributions of this paper are summarized as follows.
(1) The proposition of a meta-transfer few-shot learning classification (MFSC) method aimed at surmounting the hurdle of scarce annotated samples: The method employs a meta-learning training strategy to harmonize data from disparate class samples within a unified feature space, facilitating the prediction of categories for unlabeled samples through similarity between the support set and query set within this feature domain.
(2) The introduction of a novel hyperspectral image classification network, dubbed ResDenseNet, designed to address the underutilization of spectral and spatial information within hyperspectral images: This architecture synergizes the DenseNet (Densely Connected Convolutional Networks) [27,28] and ResNet (Residual Network) [29] frameworks.An enhanced spectral dense block is deployed for the assimilation of spatial-spectral features, complemented by a three-dimensional residual block for the further extraction of spatial and spectral attributes.Classification is achieved through a multilayer perceptron (MLP).The ResDenseNet architecture comprehensively mines deep features within the proximal space of samples, extracting more discriminative attributes to bolster the classification acumen of hyperspectral images.
The remainder of this study is structured as follows: Section 2 provides an overview of the existing cross-domain few-shot hyperspectral classification algorithm for transfer learning.In Section 3, we present the framework of our proposed MFSC approach, which aims to tackle the issue of limited labeled samples in hyperspectral images.Section 4 presents the experimental results of our methods, along with our analysis.Finally, Section 5 concludes our work.

Related Work
In the context of transfer learning [30][31][32], a model is initially trained on a source dataset, which comprises abundant annotated data from multiple classes known as source classes.Subsequently, the model parameters and features are then adapted to the target dataset with a limited number of labeled samples, where the classes are non-overlapping.This process allows the model to be transferred and adjusted to handle the target dataset, which contains only a small number of labeled samples.Koch et al. [32] proposed an early technique known as Deep Convolutional Siamese Networks.This method performs feature extraction on a pair of samples using the same network and employs the Euclidean distance to measure similarity for classification.However, despite its simplicity and intuitiveness, this approach often fails to achieve satisfactory results in complex scenarios.Based on this, Vinyals et al. [33] introduced Matching Networks, which integrate bidirectional LSTM networks with feature metric learning.By calculating the cosine distance between output features, it captures the similarity between support set and query set images, thereby achieving the classification objective.Nevertheless, this approach encounters difficulties when dealing with intricate and irregular spatial structures.Although this method performs well when the distribution of the source domain data is close to that of the target domain data, existing transfer learning methods struggle to effectively generalize the model from the source domain to the target domain when there is a significant difference in data distributions.Therefore, research is conducted on cross-domain small-sample classification techniques for situations where the source and target domain data distributions differ, aiming to bolster the transfer learning model's capacity for generalization.
To address the challenges posed by cross-task learning, researchers have proposed a range of meta-learning techniques [34][35][36][37][38][39], which can be classified into two main categories: metric-based and optimization-based approaches.Metric-based methods focus on acquiring a robust feature space by employing the Euclidean distance to gauge the likeness between unlabeled samples and labeled samples of each class.Conversely, optimizationbased meta-learning strategies aim to train a universal model capable of swiftly converging to an effective solution for new tasks through a limited number of gradient descent iterations.Nevertheless, when dealing with scant training samples, these methods are susceptible to overfitting, and their weight-update process tends to be relatively sluggish.Consequently, there is a pressing need to enhance and refine these meta-learning techniques to ensure their practicality and efficacy within the realm of few-shot learning.
On the other hand, given the high-dimensional characteristics of hyperspectral images, combining more efficient hyperspectral feature extraction methods with small-sample learning techniques has become a pivotal approach to tackle the challenge of limited annotated samples in hyperspectral data.Liu et al. [39] introduced a Deep Few-shot Learning (DFSL) method that explores the impact of various feature extraction methods on the metric space for classification outcomes.However, this approach still faces limitations when dealing with similarity issues within the metric space.Reference [40] proposes a novel and compact framework based on the Transformer, called the Spectral Coordinate Transformer (SCFormer), which employs two mask patterns (Random Mask and Sequential Mask) in SCFormer-R and SCFormer-S, respectively, aiming to generate more distinguishable spectral features using the existing spectral priors.
To tackle the challenges posed by the characteristics of high-dimensional data features in hyperspectral images and the limited number of labeled training samples, which make it difficult to thoroughly explore the deep-level features of hyperspectral images and subsequently result in suboptimal classification accuracy, this paper proposes a novel approach: the meta-transfer few-shot classification method.Furthermore, to enhance the classification of hyperspectral images, a residual dense connection network is introduced.On the one hand, this method facilitates the transfer of the transferable knowledge acquired from a source domain dataset to the target domain with a limited number of samples.This addresses the issue of restricted training samples that hinder the accuracy of classification in deep learning models.On the other hand, by taking advantage of the capabilities of the residual dense connection network, features are used more effectively, and the exchange of features between convolutional layers is intensified, ultimately contributing to an overall improvement in classification accuracy.

Proposed MFSC Framework
The entire process flow diagram is shown in Figure 1.It comprises two main components: the cross-domain few-shot learning strategy and the residual dense connection feature extraction and classification network.Arrows indicate the flow of feature vectors in the algorithm, with red arrows representing feature vectors originating from the target domain, while black arrows represent feature vectors coming from the source domain.
The few-shot learning strategy, based on metric learning-based meta-transfer, leverages the transferrable feature knowledge trained from the source domain dataset and transfers this knowledge to the target domain with a small number of labeled samples.These two types of small-sample learning are conducted simultaneously.Model weights trained on the source domain dataset are used to initialize the weights of the feature extraction network.This is performed to enhance the hyperspectral image (HSI) classification accuracy, addressing the issue of limited training samples that constrain the classification accuracy in deep learning models.
By utilizing the mapping layer and the residual dense connection network, features from the source domain and the target domain are mapped to a feature space.This ensures that samples from the same class have a similar distribution in the feature space, while samples from different classes are distributed as far apart as possible in the feature space.The residual dense connection network allows for the more comprehensive extraction of spatial-spectral features and enhances direct feature transfer between convolutional layers, thus improving classification accuracy.spatial-spectral features and enhances direct feature transfer between convolutional layers, thus improving classification accuracy.

Cross-Domain Few-Shot Learning and Training Strategy
The entire process flowchart for the few-shot learning is shown in Figure 2    From the original HSI datasets of the source and target classes, C classes are randomly selected from each, with each class containing K labeled samples to create the source domain support set S s = {(xs i , ys i )} C×K i=1 and the target domain support set S t = {(xt i , yt i )} C×K i=1 .Then, N unlabeled samples are randomly selected from the remaining data in both the source and target domains to create the source domain query set, Q s = (xs j , ys j ) C×N j=1 , and the target domain query set, Q t = (xt j , yt j ) C×N j=1 .This entire selection process is referred to as a C-way K-shot task.Each time the support and query sets are selected for model training, it constitutes an episode.
In each training episode, during the training cycle, the model is first trained on the source domain dataset.The source domain support set, S s = {(xs i , ys i )} C×K i=1 , is fed into the network to extract features, and the feature vectors, cs k , for the k-th class in the support set in the feature space are computed.The source domain query set samples, xs j , are then passed through the feature network to extract embedded features, f φ(xs j ).The Euclidean distance, d f φ xs j , cs k , between the feature vectors of the query set samples, xs j , and the feature vector, cs k , of the class to which the support set samples belong in the feature space is calculated [41].Subsequently, the probability that a query set sample, xs j , belongs to class k in the support set is computed using the following the SoftMax function: In each episode, during the training process, f φ represents a mapping layer and a spatial-spectral feature extraction network with learnable parameters denoted as φ, y j represents the true class labels of the samples xs j , and C is the number of classes in each episode.The training loss in each episode is calculated as the sum of the negative log probabilities between all query set samples and their corresponding true class labels: Then, the model continues training using the target domain data.The support set data, S t = {(xt i , yt i )} C×K i=1 , from the target domain dataset are fed into the model trained on the source domain data.This calculates the feature vector, ct k , for the k-th class in the feature space.Similarly, the samples xt j from the target domain query set, Q t = (xt j , yt j ) C×N j=1 , are input into the feature extraction network, extracting embedded features, f φ(xt j ), for the query set samples.The Euclidean distance, d f φ xt j , ct k , between the query sample xt j and the feature vectors of the samples belonging to class k in the feature space is computed.The probability that the query sample xt j belongs to class k is calculated through the SoftMax function.On this basis, the loss value for the query sample is also computed.
The data from the source domain and the target domain are randomly selected to form a training dataset that includes support and query sets.The model is trained by minimizing the loss function and optimizing the parameters of the model.This ensures that the features f φ(xs j ) and f φ(xt i ) of the query samples from the source domain and target domain, respectively, are as close as possible to the corresponding support set features, cs k and ct k , for that sample.The minimization of the loss function, J(φ), is calculated using Equation (4).
After multiple rounds of training with multiple episodes and models, when the loss function in the target domain meets the termination condition, the training is concluded.

Spatial-Spectral Feature Extraction Module Based on ResDenseNet Network
The proposed algorithm workflow is illustrated in Figure 1, which shows the MFSC framework.It mainly consists of three parts: the mapping layer module, the ResDenseNet feature extractor, and the multilayer perceptron module.

Mapping Layer Module
In the mapping module, first 9 × 9 × Sc data cubes, D S , are selected from the source dataset as the network's input, where 9 × 9 represents the spatial dimensions, and Sc represents the number of spectral bands.For the target domain dataset, 9 × 9 × T C data cubes, D T , are selected as input for network testing, where T C represents the number of spectral bands.Mapping layers are used to reduce the dimensionality of the input samples, ensuring that the input dimensions are the same.Due to the large number of spectral bands in HSI and strong correlations between adjacent bands, mapping layers use a 1 × 1 × 100 convolutional kernel to reduce the number of spectral bands in both the source and target domains, reducing the data to 100 dimensions for convenience in subsequent convolution calculations.The final output of the mapping layer is a support feature vector or a query set feature vector with a size of 9 × 9 × 100.

ResDenseNet Feature Extractor
The ResDenseNet feature extractor is used as the spatial-spectral feature extraction network; it consists mainly of a DenseNet module and ResidualNet module.In order to address the loss of feature information due to gradient vanishing, amplify feature propagation, and extract feature vectors more effectively, the algorithm initially employs DenseNet module for model training.
The spectral dense block consists of four sets of convolutional kernels, with each set containing 8 filters of size 3 × 3 × 3.These are combined with Mish activation functions and batch normalization (BN) to perform non-linear transformations on the feature maps.In DenseNet, each layer is concatenated with all preceding layers along the channel dimension, combining feature maps from all previous layers as input for the next layer to achieve feature reuse and enhance efficiency: where DH(•) is a non-linear transformation function, which uses the structure of Convolution 3 × 3 × 3 (Conv), batch normalization (BN), Mish, and concatenation operations.The subscript l denotes the layer number.The ReLU function causes some neurons to have an output of 0, resulting in network sparsity, and the Mish [42] function, f (x) = xtanh(ln(1 + ex)), unlike the ReLU function, has a softer zero boundary and smoother characteristics, allowing for a better flow of information into deep neural networks and better preservation of information, thus producing enhanced accuracy and generalization.The output of the function is not affected by saturation, and positive values can reach arbitrarily high values, avoiding saturation due to a cap.Therefore, Mish is used as the activation function in this paper.The output feature map from the last layer of dense connection block undergoes average pooling, yielding a vector, DenseFV, of dimensions 8 × 7 × 7 × 100.Subsequently, this vector is fed into the three-dimensional ResidualNet module.
In the ResidualNet module, there are four sets of non-linear transformation functions.Each set of non-linear transformation functions includes 16 filters of size 3 × 3 × 3, batch normalization (BN), and Mish activation.It employs a shortcut connection structure, creating a skip connection between the input of the first layer and the output of the last layer.This design allows the network to concentrate on learning the disparity between input and output, streamlining the learning objectives and challenges.The output feature map of the residual block is of size 16 × 7 × 7 × 100.After undergoing average pooling, max pooling, and a set of 32 filters of size 3 × 3 × 3, the feature map is flattened to a 1 × 1 × 160 vector (ResidualFV).This vector is then processed through a fully connected layer and a SoftMax activation function.Additionally, it undergoes a multilinear mapping as input to the MLP.The number of nodes in the fully connected layer corresponds to the number of classes in the dataset.

Multilayer Perceptron Module
The ultimately extracted feature vector from the multilinear mapping is fed into the MLP for classification.This MLP consists of five fully connected layers, with the first four layers each containing 1024 nodes.The final fully connected layer has only one node.ReLU activation functions and dropout are incorporated between adjacent fully connected layers.The ultimate output of the multilayer perceptron is employed to compute the loss value following Formula (4), after which the classification process is executed.
Through training, the loss function of the spatial-spectral feature extraction network model is minimized.This optimization of parameters in the residual dense connection module allows it to extract features from the input sample data, mapping them into feature space.In this space, the feature vectors of samples with the same class are closer to each other, resulting in smaller interclass distances, while the feature vectors of samples from different classes are farther apart, leading to larger interclass distances.

Experiments 4.1. Experimental Dataset
To validate the effectiveness of our approach, we utilized the hyperspectral Chikusei dataset as the source domain dataset, and the Indian Pines, Pavia University, and Salinas datasets [43,44] as the target domain datasets.The pseudo-color images and real land cover maps of this experimental dataset are shown in Figures 3 and 4.

Experimental Settings
To evaluate the effectiveness of the MFSC method, 9 9 C × × data cubes were selected as input for the network from the Chikusei source domain dataset, where  The image size for this dataset is 512 × 217 pixels and includes 224 spectral bands.However, due to the impact of water vapor absorption on certain bands, only 204 bands are retained.This dataset covers 16 different categories of agricultural land cover, including, but not limited to, corn, wheat, soybeans, grasslands, and vineyards.The Pavia University dataset's spectral wavelength range is 430-860 nm, with a spatial resolution of approximately 1.3 m.After preprocessing, the dataset has a total of 115 spectral bands, with 13 noisy bands removed.Land cover types in this region consist of nine classes, including asphalt roads, meadows, gravel, trees, metals, bare land, asphalt roofs, bricks, and shadows.

Experimental Settings
To evaluate the effectiveness of the MFSC method, 9 × 9 × C data cubes were selected as input for the network from the Chikusei source domain dataset, where 9 × 9 represents the spatial dimensions, and C is the number of spectral bands.For the target domain datasets, namely Indian Pines, Pavia University, and Salinas, 9 × 9 × L cubes were chosen as the input for testing, where L is the number of spectral bands.The model was trained for 10,000 episodes, and for each episode iteration, following the few-shot training method, 1 labeled sample and 19 unlabeled samples from each class were randomly selected to form the source dataset for model training.The Adam optimizer was used, and to balance convergence speed and accuracy, the model learning rate was set to 0.001.Furthermore, to account for the impact of random sample selection on model training, all experimental results were averaged over 10 trials.The hardware environment used for this experiment is a laptop equipped with an Intel Core i7-4810MQ 8-core 2.80 GHz processor, 16 GB of memory, and an NVIDIA GeForce RTX 2060 graphics card with 6 GB RAM, while the software environment utilized Python 3.8 and PyTorch 1.7.1 running on Windows 10.

Experimental Results and Analysis
To validate the effectiveness of the proposed method in the paper, it was compared with non-few-shot learning methods and few-shot learning methods.In experiments comparing the proposed method with non-few-shot learning methods, the proposed method was compared with SVM, 3D-CNN [45], and SSRN [46].In experiments comparing the proposed method with other few-shot learning methods, the proposed method was compared with the DFSL + NN [37], DFSL + SVM [47,48], RN-FSL [49], Gai-CFSL [50], DPGN [51], DCFSL [52], SCFormer-R, and SCFormer-S [41] methods.In each comparison experiment, the same training approach as the few-shot methods was employed.Five labeled samples from each class in the target domain dataset were randomly selected for transferring the model trained in the source domain to the target domain, with the remaining target domain samples used as test data.For the small-sample learning methods in comparison, we randomly selected 200 labeled source domain samples from each class to learn transferable knowledge, following the same setup for comparison.To verify the effectiveness of the Mish function and batch normalization (BN) added to the model in the paper, a comparative performance analysis was performed using the DCFSL method.In this comparison, the Mish + BN part was removed, while keeping the rest of the network structure consistent, serving as a set of ablation experiments.The results of the ablation experiments are presented in the "MFSC" row of the tables, where the activation function used is the Softmax activation function, consistent with the DCFSL method.In contrast, the experimental data in the "Ours" row were obtained under the MFSC algorithm framework, incorporating Mish + BN and replacing the original Softmax activation function.For the IP, UP, and Salinas datasets, the study compared the classification performance of different methods.The evaluation was carried out using three metrics: overall accuracy (OA), average accuracy (AA), and Kappa coefficient.Specific comparative results are shown in Tables 1-3.Tables 1-3 present the results of comparative experiments on the target datasets, IP, UP, and Salinas, with each class having five labeled samples.From the tables, it can be observed that the methods based on few-shot learning achieve higher overall accuracy compared to non-few-shot methods.This indicates that the episodic training strategy is better suited for classification tasks with limited labeled samples.In the IP dataset, the proposed few-shot learning method shows significant improvements over the traditional SVM classification method, with an increase of 25.64% in OA, 21.95% in AA, and a 28.13% increase in Kappa.
In the IP, UP, and Salinas datasets, when compared to deep learning-based methods like 3D-CNN and SSRN, the proposed method achieves significant increases in OA when the number of labeled samples is five, with improvements of 16.73%, 19.35% and 6.34% in IP; and 10.13%, 8.83%, and 4.15% in UP and Salinas, respectively.This indicates that the metalearning training strategy allows the model to learn transferable knowledge and features from the source-class data, thus aiding in predicting the target-class data.The relatively low performance of the non-few-shot learning methods shown in Tables 1-3 illustrates that non-small-sample learning methods extract shallow features with weaker discriminative capabilities for different target categories.The limited labeled samples are insufficient for non-small-sample learning methods to effectively train a classification model.However, meta-learning training strategies enable the model to learn transferable knowledge and features from the source-class data, aiding in predicting target-class data.In the few-shot classification methods, the method proposed in this paper also demonstrates significant improvements in detection accuracy compared to other methods.On the IP, UP, and Salinas datasets, when compared to the DFSL + NN, DFSL + SVM, RN-FSL, Gai-CFSL, DCFSL, SCFormer-R, and SCFormer-S methods, the proposed method achieves improvements in OA of 12.95%, 10.91%, 14.43%, 8.83%, 5.79%, 7.59%, and 7.65% on IP; 8.27%, 6.39%, 5.84%, 2.9%, 2.37%, 3.71%, and 2.19% on UP; and 3.92%, 4.02%, 6.86%, 3.14%, 1.63% 1.67%, and 2.15% on Salinas, respectively, when there are few labeled samples in the target domain.With the presence of a small number of labeled samples in the target domain, the method proposed in this article utilizes the ResDenseNet network to reduce data distribution differences and learn more discriminative feature spaces.Compared to other methods, this approach can obtain a better feature space, which can improve the classification performance of the target domain samples.The classification results on the IP, UP, and Salinas datasets show that the proposed method achieves average accuracy (OA) of 72.60%, 86.02%, and 90.97%, respectively.This strongly confirms the effectiveness and robustness of the ResDenseNet model in the few-shot high-dimensional spectral data classification task.Additionally, the incorporation of the Mish function and batch normalization (BN) not only effectively mitigates the vanishing gradient problem but also enhances the model's generalization capabilities.Furthermore, compared to the ReLU function, the Mish function is smoother, leading to an improvement in training stability and average accuracy.
Tables 4-6 report the detailed classification results of different classification algorithms on the UP, IP, and Salinas datasets, respectively.The last columns of the tables present the classification accuracy and standard deviation for each class in the dataset based on multiple experiments.It can be observed from Table 4 that, compared to other algorithms, the proposed method achieved the highest recognition rates in three of nine categories.It also performed well in accurately classifying the "Bricks", "Bitumen", "Metal sheets", and "Trees" categories, which were challenging for other methods.The proposed method shows a certain gap from the optimal results among the three categories, including "Gravel", "Meadows", and "Asphalt" in the UP dataset, when compared to the methods of contrast.The UP dataset has the highest spatial resolution among the three datasets, but it has the lowest spectral resolution.The data for the three categories are the most prone to generating spectrally similar but different substances.The data in Tables 5 and 6 illustrate that, compared to other algorithms, the method proposed in the paper achieved the highest recognition rates in 11 out of 16 categories and 10 out of 16 categories, respectively.It significantly improved the classification accuracy for categories like "Grapes_untrained", "Vinyard_untrained", and "Soil_vinyard_develop" in the Salines dataset, where other methods had relatively lower accuracy.Furthermore, compared to other methods, the proposed method also substantially increased the classification accuracy of categories like "Grass-pasture", "Corn", "Corn-mintill", "Corn-notill", and "Woods" in the IP dataset.Figures 5-7 display the classification results of the proposed method and comparative methods using the IP, UP, and Salinas datasets.It can be seen from the figures that the method proposed in this paper exhibits fewer misclassifications.On the contrary, the SVM-based method shows more misclassified objects.Compared to the SVM-based method, the 3D-CNN and SSRN methods have fewer misclassifications, mainly due to the stronger representation learning capabilities of deep learning methods.However, deep learning methods require a large number of training samples, and when the number of training samples is reduced, these methods experience a significant decrease in classification accuracy.This indicates that, when labeled samples are limited, the extracted features are not effective enough, leading to lower accuracy when classifying objects with similar spectral characteristics.In the case of few-shot data, using a few-shot learning approach to construct ResDenseNet significantly improves the classification accuracy compared to the SVM method and deep learning methods like 3D-CNN and SSRN.(f) (g) (h) (i) (j)  In complex scenes, objects within a specific area are rarely composed of just one type of material.Typically, there are varying amounts of other material categories present, leading to spectral noise from other categories within the spectral characteristics of the primary material.Additionally, at the boundaries between two different land cover types, there inevitably exists interference from neighboring land cover categories' spectral feature vectors.This makes it difficult to accurately extract both the spatial and spectral information of land cover, resulting in subtle differences between different types of land cover.In addition, it can lead to significant distinctions between the same types of land cover, causing the misclassification of certain land cover areas at the boundaries.In the case of few-shot data, while methods like DFSL + NN, DFSL + SVM, and RN-FSC consider the scarcity of labeled samples in hyperspectral imagery, their performance in accurately classifying challenging classes still lags behind the method proposed in this paper.
From the experimental results shown in the figures, it can be observed that when land cover features are relatively easy to distinguish and the feature vectors are distinct, the classification method employed in this paper, as well as other few-shot learning methods, can achieve good classification results.For example, in Figure 5, for the IP dataset, classes like "Oats" and "Grass-Trees"; in Figure 6, for the UP dataset, classes like "Asphalt" and "Shadow"; and in Figure 7, for the Salinas dataset, classes like "Celery", "Stubble", "Fallow_smooth", "Lettuce_romaince_5wk", and "Brocoil_green_weeds_1" have feature vectors in the feature space that are relatively easy to differentiate.In situations with only a small number of labeled samples, traditional machine learning methods, such as SVM, and general few-shot learning methods can also achieve good classification results.On the contrary, deep learning methods that require a large number of training samples are prone to overfitting, leading to a lower classification accuracy.In complex scenes, objects within a specific area are rarely composed of just one type of material.Typically, there are varying amounts of other material categories present, leading to spectral noise from other categories within the spectral characteristics of the primary material.Additionally, at the boundaries between two different land cover types, there inevitably exists interference from neighboring land cover categories' spectral feature vectors.This makes it difficult to accurately extract both the spatial and spectral information of land cover, resulting in subtle differences between different types of land cover.In addition, it can lead to significant distinctions between the same types of land cover, causing the misclassification of certain land cover areas at the boundaries.In the case of few-shot data, while methods like DFSL + NN, DFSL + SVM, and RN − FSC consider the scarcity of labeled samples in hyperspectral imagery, their performance in accurately classifying challenging classes still lags behind the method proposed in this paper.
From the experimental results shown in the figures, it can be observed that when land cover features are relatively easy to distinguish and the feature vectors are distinct, the classification method employed in this paper, as well as other few-shot learning methods, can achieve good classification results.For example, in Figure 5, for the IP dataset, classes like "Oats" and "Grass-Trees"; in Figure 6, for the UP dataset, classes like "As- For land cover categories with similar features and small feature vector distances that tend to produce errors in classification, such as "Meadows" and "Alfalfa" in the UP dataset; "Vinyard_untrained", "Vinyard_vertical_trellis", and "Corn_senesced_green_weeds" in the Salinas dataset; and "Stone-Steel-Tower", "Hay-windrowed", "Woods", and "Soybeanmintill" in the IP dataset, the classification results rely more on the effective extraction of land cover features.From the classification results, it can be seen that the method proposed in this paper achieves a relatively good classification accuracy for such categories.MFSC follows, and DCFSL has fewer misclassifications compared to SVM, 3D-CNN, and SSRN.This indicates, on the one hand, that meta-learning training strategies are advantageous for enhanced knowledge transfer and improved classification performance.On the other hand, it also demonstrates that the residual dense connection network designed in this paper can reduce data distribution differences, leading to a better feature space with higher interclass Sensors 2024, 24, 2664 16 of 18 discriminability.Under small-sample training conditions, the training data's effectiveness and robustness are superior to those of other methods.Furthermore, the method proposed in this paper has fewer misclassification points than DCFSL, indicating that this network model has good generalizability, can extract deeper and more discriminative features, and can achieve better classification results for classes that are difficult to accurately classify.

Conclusions
To address the contradiction between the limited number of training samples in HSI (hyperspectral imaging) and the need for a large number of annotated samples for effective deep learning, as well as the trade-off between a small number of labeled samples and the extraction of more effective feature vectors, this paper proposes a hyperspectral image classification method based on the residual dense connection network in the metric learning framework.The main contributions are as follows: Improved ResDenseNet Network: In comparison to traditional residual networks, this paper introduces a dense connection structure in the three-dimensional convolutional block of the improved ResDenseNet network.This structure can fully explore deep features in the spatial neighborhood of samples, effectively extract spatial and spectral features, and complement the original spectral features.It can obtain more representative features, contributing to hyperspectral images classification.
Activation function and batch normalization: Building on the original network, the ReLU activation function is replaced with the Mish function, and batch normalization (BN) is introduced.This not only effectively alleviates the problem of gradient vanishing; it also enhances the model's generalization ability.Additionally, compared to the ReLU function, the Mish function is smoother, leading to improvement in training stability and average accuracy.
The experimental results demonstrate that the proposed method, when compared to classical hyperspectral image classification methods and other classic few-shot learning methods, exhibits strong generalization capabilities in deep network models on three datasets: IP, UP, and Salinas.When only a limited number of labeled samples are available, the proposed method achieves a higher recognition accuracy than the algorithms used in the control experiments.Our future work will focus on accurately identifying ground objects in the presence of mixed substances, investigating Transformer learning strategies that can more effectively mine the spatial-spectral features of hyperspectral images, thereby enhancing the classification accuracy of complex ground objects.

1 {. 1 {
. Training of the few-shot learning model consists of two stages.First, a set of data called source class data are used to train the model, with this class having an abundant number of samples.Then, training and testing are carried out on the target class data, where the classes do not overlap and only a small number of labeled samples are available.These two stages alternate until the model converges.From the original HSI datasets of the source and target classes, C classes are ran- domly selected from each, with each class containing K labeled samples to create the source domain support set N unlabeled samples are randomly selected from the remaining data in both the source and target domains to create the source domain query set, This entire selection process is referred to as a C-way K-shot task.Each time the support and query sets are selected for model training, it constitutes an episode.In each training episode, during the training cycle, the model is first trained on the source domain dataset.The source domain support set, into the network to extract features, and the feature vectors, k cs , for the k-th class in the support set in the feature space are computed.The source domain query set samples, j xs , are then passed through the feature network to extract embedded features, cs ϕ , between the feature vectors of the query set samples, j xs , and the feature vector, kcs , of the class to which the support set samples belong in the feature space is calculated[41].Subsequently, the probability that a query set sample, j xs , be- longs to class k in the support set is computed using the following the SoftMax function:

3. 2 .Figure 2 .
Figure 2. Flowchart of the cross-domain few-shot learning algorithm.In each episode, during the training process, f ϕ represents a mapping layer and a spatial-spectral feature extraction network with learnable parameters denoted as ϕ , j y

Figure 2 .
Figure 2. Flowchart of the cross-domain few-shot learning algorithm.

Figure 3 .
Figure 3. Chikusei and Indian Pines dataset.(a) False color image of the Chikusei dataset.(b) Ground-truth map of the Chikusei dataset.(c) False color image of Indian Pines dataset.(d) Groundtruth map of Indian Pines dataset.

Figure 3 .
Figure 3. Chikusei and Indian Pines dataset.(a) False color image of the Chikusei dataset.(b) Groundtruth map of the Chikusei dataset.(c) False color image of Indian Pines dataset.(d) Ground-truth map of Indian Pines dataset.

Figure 3 .
Figure 3. Chikusei and Indian Pines dataset.(a) False color image of the Chikusei dataset.(b) Ground-truth map of the Chikusei dataset.(c) False color image of Indian Pines dataset.(d) Groundtruth map of Indian Pines dataset.

Figure 4 .
Figure 4. Pavia University and Salinas dataset.(a) False color image of the Pavia University dataset.(b) Ground-truth map of the Pavia University dataset.(c) False color image of the Salinas dataset.(d) Ground truth map of the Salinas dataset.

9 9 ×
represents the spatial dimensions, and C is the number of spectral bands.For the target domain datasets, namely Indian Pines, Pavia University, and Salinas, 9 9 L × × cubes were chosen as the input for testing, where L is the number of spectral bands.The model was trained for 10,000 episodes, and for each episode iteration, following the few-shot training method, 1 labeled sample and 19 unlabeled samples from each class were randomly selected to form the source dataset for model training.The Adam optimizer was used, and to balance convergence speed and accuracy, the model learning rate was set to 0.001.Furthermore, to account for the impact of random sample selection on model training, all

Figure 4 .
Figure 4. Pavia University and Salinas dataset.(a) False color image of the Pavia University dataset.(b) Ground-truth map of the Pavia University dataset.(c) False color image of the Salinas dataset.(d) Ground truth map of the Salinas dataset.The Chikusei dataset has a spectral wavelength range of 343-1080 nm, a spatial resolution of approximately 2.5 m, and a data size of 2571 × 2335 pixels.It consists of 128 spectral bands and includes 77,592 ground pixels, categorized into 19 distinct land cover classes.The Indian Pines dataset covers a spectral wavelength range of 400-2500 nm, with a spatial resolution of about 20 m.The image data size is 145 × 145 pixels and comprises 200 spectral bands.It encompasses a total of 16 land cover classes.The Salinas dataset has a spectral wavelength range of 400-2500 nm and a spatial resolution of approximately 3.7 m.The image size for this dataset is 512 × 217 pixels and includes 224 spectral bands.However, due to the impact of water vapor absorption on certain bands, only 204 bands are retained.This dataset covers 16 different categories of agricultural land cover, including, but not limited to, corn, wheat, soybeans, grasslands, and vineyards.The Pavia University dataset's spectral wavelength range is 430-860 nm, with a spatial resolution of approximately 1.3 m.After preprocessing, the dataset has a total of 115 spectral bands, with 13 noisy bands removed.Land cover types in this region consist of nine classes, including asphalt roads, meadows, gravel, trees, metals, bare land, asphalt roofs, bricks, and shadows.

Table 1 .
Comparison of the classification performance of different methods in Indian Pines datasets at number of labeled samples K = 5.

Table 2 .
Comparison of the classification performance of different methods in Pavia University datasets at number of labeled samples K = 5.

Table 3 .
Comparison of the classification performance of different methods in Salinas datasets at number of labeled samples K = 5.

Table 4 .
Class-specific classification accuracy (%) of different methods for the target-scene UP datasets (five labeled samples from TD).

Table 5 .
Class-specific classification accuracy (%) of different methods for the Salinas target scene datasets (five labeled samples from TD).

Table 6 .
Class-specific classification accuracy (%) of different methods for the Indian Pines datasets from the target scene (five TD labeled samples).