A Novel Knowledge Distillation Method for Self-Supervised Hyperspectral Image Classiﬁcation

: Using deep learning to classify hyperspectral image(HSI) with only a few labeled samples available is a challenge. Recently, the knowledge distillation method based on soft label generation has been used to solve classiﬁcation problems with a limited number of samples. Unlike normal labels, soft labels are considered the probability of a sample belonging to a certain category, and are therefore more informative for the sake of classiﬁcation. The existing soft label generation methods for HSI classiﬁcation cannot fully exploit the information of existing unlabeled samples. To solve this problem, we propose a novel self-supervised learning method with knowledge distillation for HSI classiﬁcation, termed SSKD. The main motivation is to exploit more valuable information for classiﬁcation by adaptively generating soft labels for unlabeled samples. First, similarity discrimination is performed using all unlabeled and labeled samples by considering both spatial distance and spectral distance. Then, an adaptive nearest neighbor matching strategy is performed for the generated data. Finally, probabilistic judgment for the category is performed to generate soft labels. Compared to the state-of-the-art method, our method improves the classiﬁcation accuracy by 4.88%, 7.09% and 4.96% on three publicly available datasets, respectively.


Introduction
Hyperspectral images (HSI) are three-dimensional data consisting of hundreds of spectral channels, containing rich spectral and spatial information [1].The information is obtained through hundreds of narrow bands emitted by imaging spectrometers through solid reflections, and different spectral profiles can be obtained due to the different absorption and reflection effects of different types of materials on the spectrum.It is possible to classify different types of materials on a pixel-by-pixel basis according to their different properties on the spectrum.HSI classification is an important issue in the field of hyperspectral applications, and it has a wide range of applications in agricultural surveying, ecological control, environmental science, marine exploration, etc. [2][3][4].The high cost of obtaining labeled samples has made the implementation of HSI classification with only a limited number of labeled samples a hot research topic [5,6].
Traditional HSI classification methods have focused on feature extraction for labeled samples [7,8].To extract spectral features, methods such as Principal Component Analysis (PCA) [9], Maximum Noise Fraction (MNF) [10] and Independent Component Analysis (ICA) [11] are used to feed the extracted feature information into a classifier.The classifiers used mainly include Support Vector Machine (SVM) [12], K-Nearest Neighbor (KNN) [13] and Random Forest(RF) [14].The classification method based on spectral feature extraction is simple to implement but fails to capture the spatial variations in HSI.In [15,16], the feature extraction phase considers both spatial and spectral information.The use of both spatial and spectral information can obtain more discriminative features than if only spectral information was used.Most traditional hyperspectral classification methods are based on shallow models and manual features and are highly dependent on domain-specific a priori knowledge.In addition, HSI is a high-dimensional data body which contains a large amount of redundant information, coupled with the limited number of samples, which leads to the Hughes phenomenon [17] and poses a huge challenge for HSI classification.
In recent years, the development of deep learning has driven the advancement of HSI classification [18][19][20][21].Deep neural networks allow for the automatic extraction of valuable features in a hierarchical manner to provide a high level of abstraction of the input data [22,23].Starting from the spectral domain, Chen et al. used a deep stacked autoencoder (SAE) to extract spectral features and verified the feasibility of SAE to extract HSI spectral features [24].Based on convolutional neural networks, Hu et al. constructed a 1DCNN to obtain the spectral features of HSIs and used logistic regression for classification [25], which greatly reduced the number of parameters compared to fully connected neural networks because convolutional neural networks use local connections and share weights.Heming et al. used principal component analysis in the process of HSI dimensionality reduction and 2DCNN to obtain spatial information, and finally used sparse coding to obtain classification results [26].Chen et al. constructed a 3DCNN model for HSI classification, which consists of two 3D convolutional layers, two pooling layers and an output layer, and also used dropout, L2 regularization and data augmentation strategies to alleviate the problem of overfitting [27].Instead of forward convolution layers, dense blocks are used in [28] to fuse multi-scale information between different layers and extract multi-scale features to solve the problem that a single scale may not reflect the complex spatial structure information in HSI.
Deep-learning-based HSI classification has still faced some difficulties [8,29,30].The acquisition of labels for hyperspectral data mainly takes place via manual annotation, which is costly and time consuming, leading to difficulties in acquiring labeled samples.Therefore, how to achieve HSI classification using a small number of labeled samples is a meaningful problem.In fact, deep neural networks contain a large number of parameters and often require as many training samples as possible.With only a small number of labeled samples available (e.g., only five labeled samples per class), deep neural network training is prone to overfitting, resulting in very low test accuracy.To solve this problem, it is common to treat HSI classification as a few-shot task in deep learning [6,31].Liu et al. [32] learned a metric space in the training dataset and generalized it to the classes of the test dataset, achieving better classification accuracy with only a small number of samples.In contrast, in [31], the similarity relationship between samples is learned through a relational network, thus changing the simple way of using Euclidean distance.Cao et al. [33] combined active learning and deep learning to reduce the required labeling cost.Another way to address the lack of labeled samples is to introduce self-supervised learning into HSI classification [5,34].Self-supervised learning mainly uses the auxiliary task (pretext) to mine its own supervised information from large-scale unsupervised data, and the entire network is trained by the supervised information constructed so that it can learn valuable representations for downstream tasks [35].The above methods alleviate the problem of inadequate samples to some extent, but do not make deeper use of unlabeled samples.HSI contains a small proportion of labeled samples, and most of them are unlabeled.Most of the existing methods focus on how to make full use of the information contained in the labeled samples, neglecting the use of the information in the more unlabeled samples; therefore, how to make better use of these unlabeled samples is a key issue in terms of improving the classification accuracy.
In order to better utilize the unlabeled samples, knowledge distillation was introduced into HSI.The main idea of knowledge distillation is that soft labels are first generated by learning the labels through the teacher model, and then the student model uses the metric between the soft labels.Finally, the results generated by the network motivate the network once again to learn more feature information.One of the key parts of the soft label generation method is for the training of the whole network [36].Currently, a number of recent knowledge distillation techniques have been proposed for HSI classification.In [37], a complex multi-scale teacher network model and a simple single-scale student network model are combined to implement knowledge distillation, explaining the relationship between multi-scale features and HSI categories.To address the possible catastrophic forgetting problem of the network, a knowledge distillation strategy is proposed for incorporation into the model, enabling the model to recognize new categories while maintaining the ability to recognise old ones [38].Soft labels were first used with self-supervised learning for HSI classification in [5], which is used here for knowledge distillation.However, the current research on soft label generation is still in a black box state, and it is difficult to guarantee the quantity and quality of generated soft labels.Generating the wrong soft labels for unlabeled samples will negatively affect the knowledge distillation.
Based on the above analysis, the main contributions of this paper are as follows: 1.
A novel deep-learning method SSKD with combined knowledge distillation and self-supervised learning is proposed to achieve HSI classification in an end-to-end way with only a small number of labeled samples; 2.
A novel adaptive soft label generation method is proposed, in which the similarity between labeled and unlabeled samples is first calculated from spectral and spatial perspectives, and then the nearest-neighbour distance ratio between labeled and unlabeled samples is calculated to filter the available samples.The proposed adaptive soft label generation achieves a significant improvement in classification accuracy compared to state-of-the-art methods; 3.
We present the first concept of soft label quality in the hyperspectrum and provide a simple measure of soft label quality, the idea being to generate soft label quality by using the soft label generation algorithm for existing labeled samples and measuring it by combining the sample labels.
The remainder of this paper is organized as follows.Section 2 provides details of Materials and Methods, followed by the experimental results and analysis in Section 3. Discussion of the experimental results is provided in Section 4. The paper is concluded in Section 5.

Materials And Methods
2.1.Related Work 2.1.1.Self-Supervised Learning Self-supervised learning focuses on obtaining supervised information from the unlabeled data of a dataset using pretext tasks.By utilizing this supervised information, network training can be achieved without labels and valuable representational information can be extracted from the unsupervised data.Since self-supervised learning does not require the data itself to have labeled information, it has a wide range of applications in various fields [39][40][41][42][43].The main auxiliary tasks in the image domain include Jigsaw Puzzles [44], Image Colorization [45], image rotation [46], image restoration [47], image fusion [48] and so on.In the case of image rotation, for example, the unlabeled image can be rotated by four angles and given labels, and the rotated image and the original image are fed into the network to predict the rotation angle.With this self-supervised approach, it is possible to learn the discriminative information of the image, even in the absence of labels.The design of pretext tasks for self-supervised learning is a difficult and focused element.Most self-supervised pretext tasks designed a priori may face ambiguity problems; for example, in rotation angle prediction where some objects do not have a usual orientation [49].Given the advantages of self-supervised learning, a number of methods have been proposed to use self-supervised learning for HSI classification [50][51][52].The existing methods verify the feasibility of self-supervised learning in the field of HSI classification.In the case of only a small number of labeled samples, self-supervised learning can further achieve improved spectral image classification accuracy by constructing supervised information.

Knowledge Distillation
As the number of network layers deepens, current deep-learning models are becoming more and more complex, while the computational resources required to consume them become increasingly large.To alleviate this problem, Hinton et al. proposed the knowledge distillation method [53].Traditional knowledge distillation methods train a teacher model on a known dataset and then supervise the training of a student model using the soft labels of the teacher model as well as the real labels.In general, the higher the training accuracy of the teacher model compared to the student model, the more effective the distillation effect is [36,54].According to the present traditional method, a series of novel distillation models have been proposed [55][56][57][58].Traditional knowledge distillation between models often suffers from inefficient knowledge transfer and requires a lot of experimentation to find the optimal teacher model.For this reason, a novel approach to knowledge distillation is proposed.This is called self-distillation, where the network itself acts as both a teacher model and a student model.Knowledge distillation usually takes place between different layers of the network.In [59], a self-distillation strategy is proposed that achieves improved computational efficiency by designing in a new network structure for knowledge distillation at each layer of the network.Additionally, in [60], the simultaneous use of soft labels and feature maps to achieve knowledge distillation is proposed.Since knowledge distillation enables the knowledge contained in the teacher model to be transferred to the student model, the trained student model can be utilized to achieve good classification results using only a small number of labeled samples.Given the advantages of knowledge distillation on a limited sample dataset, we added it to the training.This is different from traditional methods, which use teacher models to generate soft labels.In summary, we are implementing hierarchical prediction by adding a fully connected layer to each layer of the network and combining it with soft labels to achieve knowledge distillation.Since no teacher model is used, it can be said that this is a self-distilling approach.In accordance with existing methods, which pay less attention to soft label quality, we propose the concept of soft label quality in hyperspectrum and devise a simple and effective way to measure the quality of soft labels generated from unlabeled samples.

Methodology
This section elaborates our proposed SSKD.Section 2.2.1 presents the overall network, followed by the soft label generation in Section 2.2.2.The operation of knowledge distillation is described in Section 2.2.3.

Self-Supervised Learning Network
To overcome the problem of the limited number of labeled samples, we treated the HSI in two parts: a small number of labeled samples and a large number of unlabeled samples.Firstly, for the small number of labeled samples, the HSI data were extended by geometric transformations in the spectral and spatial domains in order to make fuller use of their information [5].In the spatial domain, we rotated one original HSI image by 0°, 90°, 180°a nd 270°, after which we performed a mirror flip operation on these four images, resulting in eight transformed images.In the spectral domain, a spectral domain inversion operation was performed for the HSI, specifying the task of predicting the spectral sequence order, through which the information related to the spectral domain of the images was learned.Through spatial and spectral transformation operations, it is possible to make full use of a small number of labeled samples.In fact, the process described above is implemented using a self-supervised learning approach, i.e., the input image rotation and spectral flip need to be given labels and compared to the network output data to obtain the self-supervised loss.For the unlabeled samples, soft labels are generated for them, and the unlabeled samples for which soft labels have been generated are fed into the network for training, and the distillation loss is calculated by comparing the output results with the soft labels.The detailed procedure on soft label generation is described in Section 2.2.The overall network structure of the proposed SSKD is shown in Figure 1.In this feature extraction part, we use a progressive convolutional neural network model.The network is characterized by the fact that the output of each layer is used as the input to the next layer.Through progressive image feature accumulation, the convolutional layer can effectively learn multi-scale feature information.The feature extractor embedded in the network is shown in Figure 2, and the training process is as follows.The input and output of the network at layer n are calculated as follows, with the training set denoted as H and its width, height and depth denoted as w, h and d.To ensure that the output of each layer of the network can be connected to the input data, the input data first needs to be padded, and the size of the convolution kernel is chosen as then the output data of each layer of the network is represented by F. The output of layer 0 to layer n − 1 data are combined as the input to the next layer.The formula is described as follows: where F n stands for the output of the n th layer and Conv denotes the convolution operation of each layer.As the number of layers in the network deepens, the reception field becomes larger, so that HSI multi-scale features can be effectively extracted by this structured network.The amount of information contained in each layer increases as the number of layers increases.To make full use of the information in each layer, a multilayer information fusion strategy is used, as shown in Figure 2. The fusion strategy F in Figure 2 is described in detail below.The results after convolution in each layer are concatenated with the fully connected layers to generate the category predictions.Specifically, for the kth layer, since each input sample is spectral-space transformed, the original and transformed samples are fed into the network to obtain the result set Q k .Feeding Q k into the softmax function and averaging it gives the fusion result B k for the kth layer.The same fusion strategy is then applied to each layer to obtain T = {B k |k ∈ [1, 2, . . .R k ]}, which is generated from the first to the R k convolutional layer, with R k denoting the total number of convolutional layers.The B k is normalized using softmax, and the results are averaged.The final HSI pixel-level labels with maximum logit values are generated.In addition, to prevent overfitting and improve the robustness of the network, the Relu activation function and dropout strategy are used in the network.

Soft Label Generation
To make better use of the information contained in the unlabeled samples, we propose a novel algorithm for adaptively generating soft labels.As shown in Figure 3, the generated soft labels are added to the network training using a cross-entropy function from which self-supervised knowledge is extracted.x 2 ), ( y 1 , y 2 ) denote the position coordinates of a, b, respectively.The ent is shorthand for entropy and the details of the Selector are shown in Equations ( 4)- (7).
The spatial distance D a between labeled and unlabeled samples is calculated using Euclidean distance: where (x l ,y l ), (x u ,x u ) are the two-dimensional spatial coordinates of the labeled and unlabeled samples on the HSI, respectively.The spectral distance D e is calculated using a commonly used spectral similarity measure, Kullback-Leibler divergence [61]: where entropy is used to find the entropy of the two spectra, with l and u being the labeled and unlabeled spectral vectors, respectively.Combining the above two distances, the total distance of the spatial spectra between the labeled and unlabeled samples is defined as : Due to the high degree of similarity between HSI spectra and the problems of homospectrality and heterospectrality, the distance obtained above deviates from the real data.In order to obtain accurate and reliable spectral vectors, we add adaptive comparison judgement below to select the optimal data to generate soft labels.First, we find the minimum distance that belongs to the same category: where s denotes the number of marker samples selected and the min(.)function takes out the minimum value in the current category D t .The D t after obtaining the minimum value of all categories can be expressed as follows . c is the number of HSI categories.For the obtained D t use the smin(.)function to take the next smallest value D m , while adding the category judgement.The following formula provides the selection process: where α and β represent the set threshold values, and the minimum distance value needs to be smaller than the optimal parameter α, while the ratio of the minimum value to the second smallest value is smaller than the optimal parameter β.The curly brackets here indicate that both are in use.The sub-criteria allows for the selection of data with a high confidence level.For the selected unlabeled samples, the distance between the unlabeled samples and the category is defined according to the distance D t between the samples: where n is the index value of the data in the D t and sorted from smallest to largest, and C marks the size of the labels set.Finally, the generated D u is fed into the softmax function to generate the probability P that the unlabeled samples belong to each class, and the soft label is composed of the P vector.With regard to whether the soft labels generated for a sample are a good representation of that information, we propose a concept of soft label quality and provide a simple method to measure the quality of soft labels.Firstly, all the labeled samples L in the dataset are selected, from which a small number of samples are drawn to form the labeled sample set L , and the remaining samples are denoted as U .Since the soft label generated by the algorithm for a sample is a vector, the label corresponding to the position with the largest value of the vector should be the same as the true label of that sample.Based on this idea, by generating soft labels for all samples in U by the algorithm, and since the labels of these samples are known, the correctness rate of the soft labeled samples can be obtained by comparing the corresponding labels of all the soft labels generated with the true labels of the corresponding data, and the number of soft labels generated that are identical to the true labels of the samples can be obtained.The higher the correctness rate, the more soft labels are generated for the sample that are identical to the true labels of the sample, and thus are more accurate.The higher the number, the more widely the algorithm can search for similar samples and generate soft labels for them.The strength of the soft labeling algorithm is measured by the number and the accuracy.The higher the number of generated soft labels, the more accurate the algorithm, which means it is superior.

Knowledge Distillation
Knowledge distillation is achieved by introducing a soft target associated with the teacher network as part of the overall loss to guide the training of the student network for knowledge transfer.The extraction of knowledge from unlabeled samples is achieved by adding unlabeled samples to the network training and incorporating the generated soft labels.This allows the network to learn image discriminative information from a large number of unlabeled samples, beyond the limitation of having only a small number of labeled samples.The total loss of the network is defined as: where R k denotes the number of layers of the network and L expresses the loss of the whole network, which consists of three components: L h is the loss between the network output prediction and the hard label; L s denotes the loss between the network output prediction and the soft label (the labeled samples are not involved in the operation); L q represents the loss of self-supervised learning in the spectral domain.In this case, L h and L s are gained by calculating the cross-entropy between the network output and the hard and soft labels, respectively, and L q is gained by calculating the cross-entropy between the network output sequence prediction and the defined label.

Datasets
Four commonly used hyperspectral datasets were selected for the experiment: Indian Pines (IP), University of Pavia (UP), Kennedy Space Center (KSC) and Botswana(Bot), all of which can be downloaded for free from the website http://www.ehu.eus/ccwintco/index. php?title=Hyperspectral_Remote_Sensing_Scenes, accessed on 26 January 2022.False color images and ground truth images for each of the three datasets are shown in Figures 4-6, along with the names of each category and the number of labeled samples per category for each of the three datasets in Tables 1-3.
IndianPines data set: The IndianPines data set is an airborne-visible red spectrometer Infrared imaging spectrometer (AVIRIS) sensors used in the United States in 1992 in the Indiana pine area to obtain a piece of India.The spectral range was 400~2500 nm; the image size is 145 × 145 pixels, as shown in Figure 4.Among them, there are 10,249 pixels including ground objects with a spatial resolution of 20 m.After removing the bands affected by noise, the remaining 200 bands can be used for classification.The dataset annotated 16 land features.University of Pavia data set: The University of Pavia data set was made using Germany's airborne reflection optical spectral imager imaging of Pavia city in Italy in 2003.The spectral coverage range was 430~860 nm, and the image size is 610 × 340 pixels, as shown in Figure 5.There are 42,776 pixels including ground objects, with a spatial resolution of 1.3 m.After removing the bands affected by noise, the remaining 103 bands can be used for classification.The dataset annotated 9 land features.
Kennedy Space Center Dataset: The KSC dataset was acquired by NASA AVIRIS at the Kennedy Space Center in Florida on 23 March 1996.Its spectral coverage range was 400~2500 nm; its image size is 512 × 614 pixels, as shown in Figure 6, including 5211 ground object pixels.The spatial resolution was 18m, and there were 176 bands left after water vapor noise removal.This data set marked 13 ground objects.
In our experiments, we used three widely used evaluation metrics for HSI classification, namely overall accuracy (OA), average accuracy (AA) and Kappa coefficient.OA represents the number of correctly classified samples in the test set as a proportion of the total number of samples, AA is the average of the accuracy of each category in the test set and the Kappa coefficient is a robustness measure of the degree of sample agreement.In addition, to ensure the accuracy of the results, each task result was performed 10 times and the final classification result was the average of 10 experiments.The CPU of our experimental platform was chosen: Intel(R) Xeon(R) Gold 5118, and the graphics card was: NVIDIA GeForce RTX 2080 Ti, implemented with the Pytorch platform.Our source code is available at https://github.com/qiangchi/SSKD,accessed on 28 August 2022.

Experimental Setup
The experiment was designed to test the quality of soft labels in order to quantitatively analyze the resulting data.Using the UP dataset as an example, all the labeled samples in it are denoted as L. Five samples from each class in L were selected to form the sample set L , and the remaining samples formed the sample set U .The algorithm was used to generate soft labels for all samples in U , and the results were compared with the true labels to the number of correct ones and the accuracy.The soft label algorithm of one is selected here for comparison with our method [5], where the soft label generation algorithm has a parameter γ = 0.085 set empirically for the judgement condition.The two parameters set empirically in our proposed algorithm are α = 0.15 and β = 0.5, where α and β are defined in Equation (7).The number of labeled samples in each category was five.The experiment was run five times and the results were averaged.The experiments were conducted on the IP, UP and KSC datasets, and the results are shown in Table 4.The experimental results on the quality of the soft labels show that our proposed method outperforms the comparison method in terms of the number of correct soft labels and accuracy on all three datasets.Thus, it can be illustrated that our method is used to enable the generation of soft labels for more unlabeled samples, and can be more extensive in exploring the information of unlabeled samples.The accuracy of 100% can be achieved on the UP dataset, allowing for more accurate soft labeling of unlabeled samples, and thus reducing the occurrence of errors.More accurate soft labels contain more information in the image, and the network can be trained to obtain more information about the features of the HSI, resulting in more accurate classification results.For the feature extraction network, the number of layers of the network was set to 3, dropout = 0.5.Adam random optimization was used for ease of comparison.

Classification Maps and Categorized Results
To validate the feasibility of the method in scenarios with only a small number of labeled samples, we conducted experiments on three datasets, taking five labeled samples from each class for training and the rest for testing.The comparison methods selected for the experiments include the traditional method SVM, the deep-learning-based methods 2DCNN and 3DCNN [27], the deep few-shot learning methods DFSL [32] and RN-FSC [31] and the soft-label-based method SSAD [5].The quantitative comparisons of these compared methods are shown in Tables 5-7, and the best results in each table are shown in bold.To further evaluate the compared methods, classification images are shown in Figures 7-9, where five labeled samples were selected for each class.It can be clearly seen from Figures 7-9 that SSKD achieved the least number of misclassified locations.Among all the compared methods, the proposed SSKD is the closest to the ground truth.

Compared with Different Number of Training Samples
To explore the effect of variation in sample size on classification accuracy, we randomly selected between one and five samples from each class in the UP dataset to form the training set.All methods were compared under the same number of training sets.The results are displayed in Figure 10.We can observe from Figure 10 that all methods achieved an increase in accuracy as the sample size increased.Of these, the SSKD achieved the best performance at each sample size, which also demonstrates the adaptability of the methods at different sample sizes.
With only a small number of samples, the commonly used deep-learning methods 2DCNN and 3DCNN, and the traditional method SVM, each have advantages and disadvantages, with the deep-learning methods performing better on the IP and UP datasets and SVM performing better on the KSC dataset.In a direct comparison between 2DCNN and 3DCNN, it is slightly better on 3DCNN than 2DCNN.

Ablation Study
To further validate the effectiveness of the proposed method, we designed ablation experiments.The UP dataset was selected for testing, with the number of samples selected for class ranging from 1 to 5. To validate the impact of self-supervised learning and knowledge distillation, these two modules were separately used in the experiments.In addition, the spatial and spectral transformations for the samples were separated to see their individual effect.The classification results are displayed in Table 8, where -SS and -KD denote the removal of the self-supervised learning module and the knowledge distillation module, respectively, and -SPA and -SPE remove the spatial and spectral transformations for the samples, respectively, while the rest of the proposed method remains unchanged.From Table 8, it can be concluded that both the SS and KD modules are beneficial to the classification accuracy improvement.The KD module with soft labels has a greater impact on classification accuracy compared to the SS module, where the removal of the SS module and the KD module reduced accuracy by 1.8% and 11.91% from the accuracy achieved by SSKD, respectively, when tested with five labeled samples.In the case of a very small number of labels, by generating soft labels for a large number of unlabeled samples, more image information can be extracted and used for training.In addition, Table 8 shows that the spatial transformation has a greater impact on accuracy compared to the spectral transformation.This may be because the spatial transformation involves four angular changes compared to the spectral flip, thus extracting more knowledge in the network from the large number of unlabeled samples.The spatial spectral transformation of further samples plays an important role in the multilayer information fusion strategy.By fusing information from different angular samples, the model can obtain more discriminative information.

Efficiency Comparison
In contrast with the efficiency of each method execution, the test times for each method are given in Table 9.From the results, it can be seen that the SVM and 2DCNN tests took the shortest time.The tests used for the SSKD and the SSAD were similar and did not differ too much from the other methods.

Discussion
From the experimental results that have been shown in Tables 5-7, we can conclude the following (1) Deep-learning methods 2DCNN and 3DCNN always outperform the traditional methods SVM.Traditional methods of SVM are limited by the inherently shallow structure of the image, making it difficult to extract deeper features.Deep learning can extract deeper image discriminative features through deep neural networks, which can achieve better classification performance.For example, the deep-learning methods 2DCNN and 3DCNN improved the overall accuracy over SVM by 6.5% and 7.33%, respectively, in the UP dataset.(2) The deep-learning approaches 2DCNN and 3DCNN achieved better classification results on all three datasets compared to the few-shot-learning-based approaches DFSL and RN-FSC.The deep-learning methods (2DCNN, 3DCNN) require a high number of training samples, so they do not perform well with only a few samples.The few-shot-learning approaches enable the models to acquire transferable visual analysis abilities by using a meta-learning training strategy, which allows the models to perform better than the general deep network models when only a small number of labeled samples are provided.(3) The approaches using soft label SSAD and SSKD performed better overall compared to the traditional methods SVM, deep-learning methods (2DCNN, 3DCNN), and fewshot-learning approaches.With only a small number of labeled samples, the previous methods only utilize a limited number of labeled samples, ignoring the problem of unlabeled sample utilization.The SSAD and SSKD, on the other hand, generate soft labels for the unlabeled samples and feed them into the network for training, fully exploiting the information contained in the unlabeled samples.The actual number of samples used is higher than other deep-learning approaches, allowing the model to extract more discriminative features from the images and achieve better classification results.The problem of a limited number of samples is overcome effectively.(4) The proposed method SSKD outperformed SSAD on the three test datasets.In terms of overall accuracy, the performance was improved by 4.88%, 7.09% and 4.96% on the three test datasets, respectively.The SSKD outperforms the SSAD in terms of the number and accuracy of soft labels generated, so the SSKD can use more unlabeled samples to train the network and can achieve efficient classification with a limited number of samples.

Conclusions
In this paper, we propose an adaptive soft label generation method named SSKD, which is used in conjunction with knowledge distillation.To solve the problem of low classification accuracy with limited samples, the generated soft labels are combined with self-supervised learning, where self-supervised learning uses image rotation as supervised information and soft labels extract feature information in HSI through knowledge distillation to achieve pixel-level classification.The qualitative and quantitative results have demonstrated the effectiveness of the proposed SSKD.
The proposed method still has limitations in unlabeled sample selection, and in the future, we may consider using sample information from other datasets across domains and combining it with knowledge distillation to achieve knowledge transfer.In addition, domain adaptation methods may be utilized to alleviate the domain shift problem in cross-domain scenarios.

Figure 3 .
Figure 3. Soft Label Generation.The figure shows a, b are the HSI data corresponding to the ground truth image, and (x 1 ,x 2 ), ( y 1 , y 2 ) denote the position coordinates of a, b, respectively.The ent is shorthand for entropy and the details of the Selector are shown in Equations (4)-(7).

Figure 4 .Figure 5 .Figure 6 .
Figure 4. False color image and ground truth image of the IP dataset.(a) False color image.(b) ground truth image.(c) Color coding for each category.

Figure 10 .
Figure 10.Sample size variation and overall precision.

Table 1 .
Information on the number of samples per category in the IP dataset.

Table 2 .
Information on the number of samples per category in the PaviaU dataset.

Table 3 .
Information on the number of samples per category in the KSC dataset.

Table 4 .
Use two judgment methods to generate soft label results.

Table 5 .
The classification accuracy of several different methods on the IP dataset, with five labeled samples for each category.The best accuracy is shown in bold.

Table 6 .
The classification accuracy of several different methods on the UP dataset, with five labeled samples for each category.The best accuracy is shown in bold.

Table 7 .
The classification accuracy of several different methods on the KSC dataset, with five labeled samples for each category.The best accuracy is shown in bold.

Table 8 .
The overall classification accuracy for the ablation study conducted on the UP dataset with one to five samples selected from each category.

Table 9 .
Efficiency comparison on the testing phase of the compared methods (using UP dataset of five samples per class).