A Novel Knowledge Distillation Method for Self-Supervised Hyperspectral Image Classification

Chi, Qiang; Lv, Guohua; Zhao, Guixin; Dong, Xiangjun

doi:10.3390/rs14184523

Open AccessArticle

A Novel Knowledge Distillation Method for Self-Supervised Hyperspectral Image Classification

by

Qiang Chi

,

Guohua Lv

^*,

Guixin Zhao

and

Xiangjun Dong

Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(18), 4523; https://doi.org/10.3390/rs14184523

Submission received: 3 August 2022 / Revised: 2 September 2022 / Accepted: 6 September 2022 / Published: 10 September 2022

(This article belongs to the Special Issue Advances in Hyperspectral Remote Sensing: Methods and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Using deep learning to classify hyperspectral image(HSI) with only a few labeled samples available is a challenge. Recently, the knowledge distillation method based on soft label generation has been used to solve classification problems with a limited number of samples. Unlike normal labels, soft labels are considered the probability of a sample belonging to a certain category, and are therefore more informative for the sake of classification. The existing soft label generation methods for HSI classification cannot fully exploit the information of existing unlabeled samples. To solve this problem, we propose a novel self-supervised learning method with knowledge distillation for HSI classification, termed SSKD. The main motivation is to exploit more valuable information for classification by adaptively generating soft labels for unlabeled samples. First, similarity discrimination is performed using all unlabeled and labeled samples by considering both spatial distance and spectral distance. Then, an adaptive nearest neighbor matching strategy is performed for the generated data. Finally, probabilistic judgment for the category is performed to generate soft labels. Compared to the state-of-the-art method, our method improves the classification accuracy by 4.88%, 7.09% and 4.96% on three publicly available datasets, respectively.

Keywords:

soft labeling; deep learning; knowledge distillation; self-supervised learning; hyperspectral image classification

Graphical Abstract

1. Introduction

Hyperspectral images (HSI) are three-dimensional data consisting of hundreds of spectral channels, containing rich spectral and spatial information [1]. The information is obtained through hundreds of narrow bands emitted by imaging spectrometers through solid reflections, and different spectral profiles can be obtained due to the different absorption and reflection effects of different types of materials on the spectrum. It is possible to classify different types of materials on a pixel-by-pixel basis according to their different properties on the spectrum. HSI classification is an important issue in the field of hyperspectral applications, and it has a wide range of applications in agricultural surveying, ecological control, environmental science, marine exploration, etc. [2,3,4]. The high cost of obtaining labeled samples has made the implementation of HSI classification with only a limited number of labeled samples a hot research topic [5,6].

Traditional HSI classification methods have focused on feature extraction for labeled samples [7,8]. To extract spectral features, methods such as Principal Component Analysis (PCA) [9], Maximum Noise Fraction (MNF) [10] and Independent Component Analysis (ICA) [11] are used to feed the extracted feature information into a classifier. The classifiers used mainly include Support Vector Machine (SVM) [12], K-Nearest Neighbor (KNN) [13] and Random Forest(RF) [14]. The classification method based on spectral feature extraction is simple to implement but fails to capture the spatial variations in HSI. In [15,16], the feature extraction phase considers both spatial and spectral information. The use of both spatial and spectral information can obtain more discriminative features than if only spectral information was used. Most traditional hyperspectral classification methods are based on shallow models and manual features and are highly dependent on domain-specific a priori knowledge. In addition, HSI is a high-dimensional data body which contains a large amount of redundant information, coupled with the limited number of samples, which leads to the Hughes phenomenon [17] and poses a huge challenge for HSI classification.

In recent years, the development of deep learning has driven the advancement of HSI classification [18,19,20,21]. Deep neural networks allow for the automatic extraction of valuable features in a hierarchical manner to provide a high level of abstraction of the input data [22,23]. Starting from the spectral domain, Chen et al. used a deep stacked autoencoder (SAE) to extract spectral features and verified the feasibility of SAE to extract HSI spectral features [24]. Based on convolutional neural networks, Hu et al. constructed a 1DCNN to obtain the spectral features of HSIs and used logistic regression for classification [25], which greatly reduced the number of parameters compared to fully connected neural networks because convolutional neural networks use local connections and share weights. Heming et al. used principal component analysis in the process of HSI dimensionality reduction and 2DCNN to obtain spatial information, and finally used sparse coding to obtain classification results [26]. Chen et al. constructed a 3DCNN model for HSI classification, which consists of two 3D convolutional layers, two pooling layers and an output layer, and also used dropout, L2 regularization and data augmentation strategies to alleviate the problem of overfitting [27]. Instead of forward convolution layers, dense blocks are used in [28] to fuse multi-scale information between different layers and extract multi-scale features to solve the problem that a single scale may not reflect the complex spatial structure information in HSI.

Deep-learning-based HSI classification has still faced some difficulties [8,29,30]. The acquisition of labels for hyperspectral data mainly takes place via manual annotation, which is costly and time consuming, leading to difficulties in acquiring labeled samples. Therefore, how to achieve HSI classification using a small number of labeled samples is a meaningful problem. In fact, deep neural networks contain a large number of parameters and often require as many training samples as possible. With only a small number of labeled samples available (e.g., only five labeled samples per class), deep neural network training is prone to overfitting, resulting in very low test accuracy. To solve this problem, it is common to treat HSI classification as a few-shot task in deep learning [6,31]. Liu et al. [32] learned a metric space in the training dataset and generalized it to the classes of the test dataset, achieving better classification accuracy with only a small number of samples. In contrast, in [31], the similarity relationship between samples is learned through a relational network, thus changing the simple way of using Euclidean distance. Cao et al. [33] combined active learning and deep learning to reduce the required labeling cost. Another way to address the lack of labeled samples is to introduce self-supervised learning into HSI classification [5,34]. Self-supervised learning mainly uses the auxiliary task (pretext) to mine its own supervised information from large-scale unsupervised data, and the entire network is trained by the supervised information constructed so that it can learn valuable representations for downstream tasks [35]. The above methods alleviate the problem of inadequate samples to some extent, but do not make deeper use of unlabeled samples. HSI contains a small proportion of labeled samples, and most of them are unlabeled. Most of the existing methods focus on how to make full use of the information contained in the labeled samples, neglecting the use of the information in the more unlabeled samples; therefore, how to make better use of these unlabeled samples is a key issue in terms of improving the classification accuracy.

In order to better utilize the unlabeled samples, knowledge distillation was introduced into HSI. The main idea of knowledge distillation is that soft labels are first generated by learning the labels through the teacher model, and then the student model uses the metric between the soft labels. Finally, the results generated by the network motivate the network once again to learn more feature information. One of the key parts of the soft label generation method is for the training of the whole network [36]. Currently, a number of recent knowledge distillation techniques have been proposed for HSI classification. In [37], a complex multi-scale teacher network model and a simple single-scale student network model are combined to implement knowledge distillation, explaining the relationship between multi-scale features and HSI categories. To address the possible catastrophic forgetting problem of the network, a knowledge distillation strategy is proposed for incorporation into the model, enabling the model to recognize new categories while maintaining the ability to recognise old ones [38]. Soft labels were first used with self-supervised learning for HSI classification in [5], which is used here for knowledge distillation. However, the current research on soft label generation is still in a black box state, and it is difficult to guarantee the quantity and quality of generated soft labels. Generating the wrong soft labels for unlabeled samples will negatively affect the knowledge distillation.

Based on the above analysis, the main contributions of this paper are as follows:

A novel deep-learning method SSKD with combined knowledge distillation and self-supervised learning is proposed to achieve HSI classification in an end-to-end way with only a small number of labeled samples;
A novel adaptive soft label generation method is proposed, in which the similarity between labeled and unlabeled samples is first calculated from spectral and spatial perspectives, and then the nearest-neighbour distance ratio between labeled and unlabeled samples is calculated to filter the available samples. The proposed adaptive soft label generation achieves a significant improvement in classification accuracy compared to state-of-the-art methods;
We present the first concept of soft label quality in the hyperspectrum and provide a simple measure of soft label quality, the idea being to generate soft label quality by using the soft label generation algorithm for existing labeled samples and measuring it by combining the sample labels.

The remainder of this paper is organized as follows. Section 2 provides details of Materials and Methods, followed by the experimental results and analysis in Section 3. Discussion of the experimental results is provided in Section 4. The paper is concluded in Section 5.

2. Materials and Methods

2.1. Related Work

2.1.1. Self-Supervised Learning

Self-supervised learning focuses on obtaining supervised information from the unlabeled data of a dataset using pretext tasks. By utilizing this supervised information, network training can be achieved without labels and valuable representational information can be extracted from the unsupervised data. Since self-supervised learning does not require the data itself to have labeled information, it has a wide range of applications in various fields [39,40,41,42,43]. The main auxiliary tasks in the image domain include Jigsaw Puzzles [44], Image Colorization [45], image rotation [46], image restoration [47], image fusion [48] and so on. In the case of image rotation, for example, the unlabeled image can be rotated by four angles and given labels, and the rotated image and the original image are fed into the network to predict the rotation angle. With this self-supervised approach, it is possible to learn the discriminative information of the image, even in the absence of labels. The design of pretext tasks for self-supervised learning is a difficult and focused element. Most self-supervised pretext tasks designed a priori may face ambiguity problems; for example, in rotation angle prediction where some objects do not have a usual orientation [49]. Given the advantages of self-supervised learning, a number of methods have been proposed to use self-supervised learning for HSI classification [50,51,52]. The existing methods verify the feasibility of self-supervised learning in the field of HSI classification. In the case of only a small number of labeled samples, self-supervised learning can further achieve improved spectral image classification accuracy by constructing supervised information.

2.1.2. Knowledge Distillation

As the number of network layers deepens, current deep-learning models are becoming more and more complex, while the computational resources required to consume them become increasingly large. To alleviate this problem, Hinton et al. proposed the knowledge distillation method [53]. Traditional knowledge distillation methods train a teacher model on a known dataset and then supervise the training of a student model using the soft labels of the teacher model as well as the real labels. In general, the higher the training accuracy of the teacher model compared to the student model, the more effective the distillation effect is [36,54]. According to the present traditional method, a series of novel distillation models have been proposed [55,56,57,58]. Traditional knowledge distillation between models often suffers from inefficient knowledge transfer and requires a lot of experimentation to find the optimal teacher model. For this reason, a novel approach to knowledge distillation is proposed. This is called self-distillation, where the network itself acts as both a teacher model and a student model. Knowledge distillation usually takes place between different layers of the network. In [59], a self-distillation strategy is proposed that achieves improved computational efficiency by designing in a new network structure for knowledge distillation at each layer of the network. Additionally, in [60], the simultaneous use of soft labels and feature maps to achieve knowledge distillation is proposed. Since knowledge distillation enables the knowledge contained in the teacher model to be transferred to the student model, the trained student model can be utilized to achieve good classification results using only a small number of labeled samples. Given the advantages of knowledge distillation on a limited sample dataset, we added it to the training. This is different from traditional methods, which use teacher models to generate soft labels. In summary, we are implementing hierarchical prediction by adding a fully connected layer to each layer of the network and combining it with soft labels to achieve knowledge distillation. Since no teacher model is used, it can be said that this is a self-distilling approach. In accordance with existing methods, which pay less attention to soft label quality, we propose the concept of soft label quality in hyperspectrum and devise a simple and effective way to measure the quality of soft labels generated from unlabeled samples.

2.2. Methodology

This section elaborates our proposed SSKD. Section 2.2.1 presents the overall network, followed by the soft label generation in Section 2.2.2. The operation of knowledge distillation is described in Section 2.2.3.

2.2.1. Self-Supervised Learning Network

To overcome the problem of the limited number of labeled samples, we treated the HSI in two parts: a small number of labeled samples and a large number of unlabeled samples. Firstly, for the small number of labeled samples, the HSI data were extended by geometric transformations in the spectral and spatial domains in order to make fuller use of their information [5]. In the spatial domain, we rotated one original HSI image by 0°, 90°, 180° and 270°, after which we performed a mirror flip operation on these four images, resulting in eight transformed images. In the spectral domain, a spectral domain inversion operation was performed for the HSI, specifying the task of predicting the spectral sequence order, through which the information related to the spectral domain of the images was learned. Through spatial and spectral transformation operations, it is possible to make full use of a small number of labeled samples. In fact, the process described above is implemented using a self-supervised learning approach, i.e., the input image rotation and spectral flip need to be given labels and compared to the network output data to obtain the self-supervised loss. For the unlabeled samples, soft labels are generated for them, and the unlabeled samples for which soft labels have been generated are fed into the network for training, and the distillation loss is calculated by comparing the output results with the soft labels. The detailed procedure on soft label generation is described in Section 2.2. The overall network structure of the proposed SSKD is shown in Figure 1. In this feature extraction part, we use a progressive convolutional neural network model. The network is characterized by the fact that the output of each layer is used as the input to the next layer. Through progressive image feature accumulation, the convolutional layer can effectively learn multi-scale feature information. The feature extractor embedded in the network is shown in Figure 2, and the training process is as follows.

The input and output of the network at layer n are calculated as follows, with the training set denoted as H and its width, height and depth denoted as w, h and d. To ensure that the output of each layer of the network can be connected to the input data, the input data first needs to be padded, and the size of the convolution kernel is chosen as (3 × 3,d), then the output data of each layer of the network is represented by F. The output of layer 0 to layer

n - 1

data are combined as the input to the next layer. The formula is described as follows:

\{\begin{cases} F_{n} = Conv (H), & n = 0 \\ F_{n} = Conv ([H, F_{0}, \dots F_{n - 1}]), & n > 0, \end{cases}

(1)

where

F_{n}

stands for the output of the

n^{t h}

layer and Conv denotes the convolution operation of each layer. As the number of layers in the network deepens, the reception field becomes larger, so that HSI multi-scale features can be effectively extracted by this structured network.

The amount of information contained in each layer increases as the number of layers increases. To make full use of the information in each layer, a multilayer information fusion strategy is used, as shown in Figure 2. The fusion strategy F in Figure 2 is described in detail below. The results after convolution in each layer are concatenated with the fully connected layers to generate the category predictions. Specifically, for the kth layer, since each input sample is spectral-space transformed, the original and transformed samples are fed into the network to obtain the result set

Q_{k}

. Feeding

Q_{k}

into the softmax function and averaging it gives the fusion result

B_{k}

for the kth layer. The same fusion strategy is then applied to each layer to obtain

T = {B_{k} | k \in [1, 2, \dots R_{k}]}

, which is generated from the first to the

R_{k}

convolutional layer, with

R_{k}

denoting the total number of convolutional layers. The

B_{k}

is normalized using softmax, and the results are averaged. The final HSI pixel-level labels with maximum logit values are generated. In addition, to prevent overfitting and improve the robustness of the network, the Relu activation function and dropout strategy are used in the network.

2.2.2. Soft Label Generation

To make better use of the information contained in the unlabeled samples, we propose a novel algorithm for adaptively generating soft labels. As shown in Figure 3, the generated soft labels are added to the network training using a cross-entropy function from which self-supervised knowledge is extracted.

The spatial distance

D_{a}

between labeled and unlabeled samples is calculated using Euclidean distance:

D_{a} = \sqrt{{(x_{l} - x_{u})}^{2} + {(y_{l} - y_{u})}^{2}},

(2)

where (

x_{l}

,

y_{l}

), (

x_{u}

,

x_{u}

) are the two-dimensional spatial coordinates of the labeled and unlabeled samples on the HSI, respectively. The spectral distance

D_{e}

is calculated using a commonly used spectral similarity measure, Kullback–Leibler divergence [61]:

D_{e} = entropy (l, u) + entropy (u, l),

(3)

where entropy is used to find the entropy of the two spectra, with l and u being the labeled and unlabeled spectral vectors, respectively. Combining the above two distances, the total distance of the spatial spectra between the labeled and unlabeled samples is defined as:

D_{t} = \sqrt{D_{a} * D_{e}} .

(4)

Due to the high degree of similarity between HSI spectra and the problems of homospectrality and heterospectrality, the distance obtained above deviates from the real data. In order to obtain accurate and reliable spectral vectors, we add adaptive comparison judgement below to select the optimal data to generate soft labels. First, we find the minimum distance that belongs to the same category:

D_{t}^{^{'}} = \min (D_{t}) D_{t} = (d_{1}, d_{2}, \dots, d_{s}),

(5)

where s denotes the number of marker samples selected and the min(.) function takes out the minimum value in the current category

D_{t}

. The

D_{t}^{^{'}}

after obtaining the minimum value of all categories can be expressed as follows

D_{t}^{^{'}} = [d_{1}^{^{'}}, d_{2}^{^{'}} \dots d_{c}^{^{'}}]

. c is the number of HSI categories. For the obtained

D_{t}^{^{'}}

use the smin(.) function to take the next smallest value

D_{m}

, while adding the category judgement. The following formula provides the selection process:

D_{m} = smin (D_{t}^{^{'}}),

(6)

\{\begin{cases} D_{t}^{^{'}} \leq α \\ D_{t}^{^{'}} / D_{m} \leq β, \end{cases}

(7)

where

α

and

β

represent the set threshold values, and the minimum distance value needs to be smaller than the optimal parameter

α

, while the ratio of the minimum value to the second smallest value is smaller than the optimal parameter

β

. The curly brackets here indicate that both are in use. The sub-criteria allows for the selection of data with a high confidence level. For the selected unlabeled samples, the distance between the unlabeled samples and the category is defined according to the distance

D_{t}

between the samples:

D_{u} = \sum \frac{D_{t}}{C^{n}},

(8)

P = softmax (D_{u}),

(9)

where n is the index value of the data in the

D_{t}

and sorted from smallest to largest, and C marks the size of the labels set. Finally, the generated

D_{u}

is fed into the softmax function to generate the probability P that the unlabeled samples belong to each class, and the soft label is composed of the P vector.

With regard to whether the soft labels generated for a sample are a good representation of that information, we propose a concept of soft label quality and provide a simple method to measure the quality of soft labels. Firstly, all the labeled samples L in the dataset are selected, from which a small number of samples are drawn to form the labeled sample set

L^{^{'}}

, and the remaining samples are denoted as

U^{^{'}}

. Since the soft label generated by the algorithm for a sample is a vector, the label corresponding to the position with the largest value of the vector should be the same as the true label of that sample. Based on this idea, by generating soft labels for all samples in

U^{^{'}}

by the algorithm, and since the labels of these samples are known, the correctness rate of the soft labeled samples can be obtained by comparing the corresponding labels of all the soft labels generated with the true labels of the corresponding data, and the number of soft labels generated that are identical to the true labels of the samples can be obtained. The higher the correctness rate, the more soft labels are generated for the sample that are identical to the true labels of the sample, and thus are more accurate. The higher the number, the more widely the algorithm can search for similar samples and generate soft labels for them. The strength of the soft labeling algorithm is measured by the number and the accuracy. The higher the number of generated soft labels, the more accurate the algorithm, which means it is superior.

2.2.3. Knowledge Distillation

Knowledge distillation is achieved by introducing a soft target associated with the teacher network as part of the overall loss to guide the training of the student network for knowledge transfer. The extraction of knowledge from unlabeled samples is achieved by adding unlabeled samples to the network training and incorporating the generated soft labels. This allows the network to learn image discriminative information from a large number of unlabeled samples, beyond the limitation of having only a small number of labeled samples. The total loss of the network is defined as:

L = \frac{1}{R_{k}} \sum_{k = 1}^{R_{k}} ({L_{k}}^{h} + {L_{k}}^{s} + {L_{k}}^{q}),

(10)

where

R_{k}

denotes the number of layers of the network and L expresses the loss of the whole network, which consists of three components:

L^{h}

is the loss between the network output prediction and the hard label;

L^{s}

denotes the loss between the network output prediction and the soft label (the labeled samples are not involved in the operation);

L^{q}

represents the loss of self-supervised learning in the spectral domain. In this case,

L^{h}

and

L^{s}

are gained by calculating the cross-entropy between the network output and the hard and soft labels, respectively, and

L^{q}

is gained by calculating the cross-entropy between the network output sequence prediction and the defined label.

3. Results

3.1. Datasets

Four commonly used hyperspectral datasets were selected for the experiment: Indian Pines (IP), University of Pavia (UP), Kennedy Space Center (KSC) and Botswana(Bot), all of which can be downloaded for free from the website http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes, accessed on 26 January 2022. False color images and ground truth images for each of the three datasets are shown in Figure 4, Figure 5 and Figure 6, along with the names of each category and the number of labeled samples per category for each of the three datasets in Table 1, Table 2 and Table 3.

IndianPines data set: The IndianPines data set is an airborne-visible red spectrometer Infrared imaging spectrometer (AVIRIS) sensors used in the United States in 1992 in the Indiana pine area to obtain a piece of India. The spectral range was 400~2500 nm; the image size is 145 × 145 pixels, as shown in Figure 4. Among them, there are 10,249 pixels including ground objects with a spatial resolution of 20 m. After removing the bands affected by noise, the remaining 200 bands can be used for classification. The dataset annotated 16 land features.

University of Pavia data set: The University of Pavia data set was made using Germany’s airborne reflection optical spectral imager imaging of Pavia city in Italy in 2003. The spectral coverage range was 430~860 nm, and the image size is 610 × 340 pixels, as shown in Figure 5. There are 42,776 pixels including ground objects, with a spatial resolution of 1.3 m. After removing the bands affected by noise, the remaining 103 bands can be used for classification. The dataset annotated 9 land features.

Kennedy Space Center Dataset: The KSC dataset was acquired by NASA AVIRIS at the Kennedy Space Center in Florida on 23 March 1996. Its spectral coverage range was 400~2500 nm; its image size is 512 × 614 pixels, as shown in Figure 6, including 5211 ground object pixels. The spatial resolution was 18 m, and there were 176 bands left after water vapor noise removal. This data set marked 13 ground objects.

In our experiments, we used three widely used evaluation metrics for HSI classification, namely overall accuracy (OA), average accuracy (AA) and Kappa coefficient. OA represents the number of correctly classified samples in the test set as a proportion of the total number of samples, AA is the average of the accuracy of each category in the test set and the Kappa coefficient is a robustness measure of the degree of sample agreement. In addition, to ensure the accuracy of the results, each task result was performed 10 times and the final classification result was the average of 10 experiments. The CPU of our experimental platform was chosen: Intel(R) Xeon(R) Gold 5118, and the graphics card was: NVIDIA GeForce RTX 2080 Ti, implemented with the Pytorch platform. Our source code is available at https://github.com/qiangchi/SSKD, accessed on 28 August 2022.

3.2. Experimental Setup

The experiment was designed to test the quality of soft labels in order to quantitatively analyze the resulting data. Using the UP dataset as an example, all the labeled samples in it are denoted as L. Five samples from each class in L were selected to form the sample set

L^{^{'}}

, and the remaining samples formed the sample set

U^{^{'}}

. The algorithm was used to generate soft labels for all samples in

U^{^{'}}

, and the results were compared with the true labels to the number of correct ones and the accuracy. The soft label algorithm of one is selected here for comparison with our method [5], where the soft label generation algorithm has a parameter

γ

= 0.085 set empirically for the judgement condition. The two parameters set empirically in our proposed algorithm are

α

= 0.15 and

β

= 0.5, where

α

and

β

are defined in Equation (7). The number of labeled samples in each category was five. The experiment was run five times and the results were averaged. The experiments were conducted on the IP, UP and KSC datasets, and the results are shown in Table 4. The experimental results on the quality of the soft labels show that our proposed method outperforms the comparison method in terms of the number of correct soft labels and accuracy on all three datasets. Thus, it can be illustrated that our method is used to enable the generation of soft labels for more unlabeled samples, and can be more extensive in exploring the information of unlabeled samples. The accuracy of 100% can be achieved on the UP dataset, allowing for more accurate soft labeling of unlabeled samples, and thus reducing the occurrence of errors. More accurate soft labels contain more information in the image, and the network can be trained to obtain more information about the features of the HSI, resulting in more accurate classification results. For the feature extraction network, the number of layers of the network was set to 3, dropout = 0.5. Adam random optimization was used for ease of comparison.

3.3. Classification Maps and Categorized Results

To validate the feasibility of the method in scenarios with only a small number of labeled samples, we conducted experiments on three datasets, taking five labeled samples from each class for training and the rest for testing. The comparison methods selected for the experiments include the traditional method SVM, the deep-learning-based methods 2DCNN and 3DCNN [27], the deep few-shot learning methods DFSL [32] and RN-FSC [31] and the soft-label-based method SSAD [5]. The quantitative comparisons of these compared methods are shown in Table 5, Table 6 and Table 7, and the best results in each table are shown in bold.

To further evaluate the compared methods, classification images are shown in Figure 7, Figure 8 and Figure 9, where five labeled samples were selected for each class. It can be clearly seen from Figure 7, Figure 8 and Figure 9 that SSKD achieved the least number of misclassified locations. Among all the compared methods, the proposed SSKD is the closest to the ground truth.

3.4. Compared with Different Number of Training Samples

To explore the effect of variation in sample size on classification accuracy, we randomly selected between one and five samples from each class in the UP dataset to form the training set. All methods were compared under the same number of training sets. The results are displayed in Figure 10.

We can observe from Figure 10 that all methods achieved an increase in accuracy as the sample size increased. Of these, the SSKD achieved the best performance at each sample size, which also demonstrates the adaptability of the methods at different sample sizes.

With only a small number of samples, the commonly used deep-learning methods 2DCNN and 3DCNN, and the traditional method SVM, each have advantages and disadvantages, with the deep-learning methods performing better on the IP and UP datasets and SVM performing better on the KSC dataset. In a direct comparison between 2DCNN and 3DCNN, it is slightly better on 3DCNN than 2DCNN.

3.5. Ablation Study

To further validate the effectiveness of the proposed method, we designed ablation experiments. The UP dataset was selected for testing, with the number of samples selected for each class ranging from 1 to 5. To validate the impact of self-supervised learning and knowledge distillation, these two modules were separately used in the experiments. In addition, the spatial and spectral transformations for the samples were separated to see their individual effect. The classification results are displayed in Table 8, where -SS and -KD denote the removal of the self-supervised learning module and the knowledge distillation module, respectively, and -SPA and -SPE remove the spatial and spectral transformations for the samples, respectively, while the rest of the proposed method remains unchanged. From Table 8, it can be concluded that both the SS and KD modules are beneficial to the classification accuracy improvement. The KD module with soft labels has a greater impact on classification accuracy compared to the SS module, where the removal of the SS module and the KD module reduced accuracy by 1.8% and 11.91% from the accuracy achieved by SSKD, respectively, when tested with five labeled samples. In the case of a very small number of labels, by generating soft labels for a large number of unlabeled samples, more image information can be extracted and used for training. In addition, Table 8 shows that the spatial transformation has a greater impact on accuracy compared to the spectral transformation. This may be because the spatial transformation involves four angular changes compared to the spectral flip, thus extracting more knowledge in the network from the large number of unlabeled samples. The spatial spectral transformation of further samples plays an important role in the multilayer information fusion strategy. By fusing information from different angular samples, the model can obtain more discriminative information.

3.6. Efficiency Comparison

In contrast with the efficiency of each method execution, the test times for each method are given in Table 9. From the results, it can be seen that the SVM and 2DCNN tests took the shortest time. The tests used for the SSKD and the SSAD were similar and did not differ too much from the other methods.

4. Discussion

From the experimental results that have been shown in Table 5, Table 6 and Table 7, we can conclude the following

(1): Deep-learning methods 2DCNN and 3DCNN always outperform the traditional methods SVM. Traditional methods of SVM are limited by the inherently shallow structure of the image, making it difficult to extract deeper features. Deep learning can extract deeper image discriminative features through deep neural networks, which can achieve better classification performance. For example, the deep-learning methods 2DCNN and 3DCNN improved the overall accuracy over SVM by 6.5% and 7.33%, respectively, in the UP dataset.
(2): The deep-learning approaches 2DCNN and 3DCNN achieved better classification results on all three datasets compared to the few-shot-learning-based approaches DFSL and RN-FSC. The deep-learning methods (2DCNN, 3DCNN) require a high number of training samples, so they do not perform well with only a few samples. The few-shot-learning approaches enable the models to acquire transferable visual analysis abilities by using a meta-learning training strategy, which allows the models to perform better than the general deep network models when only a small number of labeled samples are provided.
(3): The approaches using soft label SSAD and SSKD performed better overall compared to the traditional methods SVM, deep-learning methods (2DCNN, 3DCNN), and few-shot-learning approaches. With only a small number of labeled samples, the previous methods only utilize a limited number of labeled samples, ignoring the problem of unlabeled sample utilization. The SSAD and SSKD, on the other hand, generate soft labels for the unlabeled samples and feed them into the network for training, fully exploiting the information contained in the unlabeled samples. The actual number of samples used is higher than other deep-learning approaches, allowing the model to extract more discriminative features from the images and achieve better classification results. The problem of a limited number of samples is overcome effectively.
(4): The proposed method SSKD outperformed SSAD on the three test datasets. In terms of overall accuracy, the performance was improved by 4.88%, 7.09% and 4.96% on the three test datasets, respectively. The SSKD outperforms the SSAD in terms of the number and accuracy of soft labels generated, so the SSKD can use more unlabeled samples to train the network and can achieve efficient classification with a limited number of samples.

5. Conclusions

In this paper, we propose an adaptive soft label generation method named SSKD, which is used in conjunction with knowledge distillation. To solve the problem of low classification accuracy with limited samples, the generated soft labels are combined with self-supervised learning, where self-supervised learning uses image rotation as supervised information and soft labels extract feature information in HSI through knowledge distillation to achieve pixel-level classification. The qualitative and quantitative results have demonstrated the effectiveness of the proposed SSKD.

The proposed method still has limitations in unlabeled sample selection, and in the future, we may consider using sample information from other datasets across domains and combining it with knowledge distillation to achieve knowledge transfer. In addition, domain adaptation methods may be utilized to alleviate the domain shift problem in cross-domain scenarios.

Author Contributions

Conceptualization, Q.C. and G.L.; methodology, G.L.; software, Q.C.; validation, Q.C.; formal analysis, Q.C.; investigation, Q.C.; resources, Q.C. and G.Z.; writing—original draft preparation, Q.C.; writing—review and editing, G.L. and Q.C.; visualization, Q.C.; supervision, G.L. and X.D.; funding acquisition, G.L. and X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Shandong Province, China, under Grant No. ZR2020MF041 and the National Natural Science Foundation of China under Grant Nos. 62076143 and 11901325.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Audebert, N.; Le Saux, B.; Lefèvre, S. Deep learning for classification of hyperspectral data: A comparative review. IEEE Geosci. Remote Sens. Mag. 2019, 7, 159–173. [Google Scholar] [CrossRef]
Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral remote sensing data analysis and future challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [Google Scholar] [CrossRef]
Jänicke, C.; Okujeni, A.; Cooper, S.; Clark, M.; Hostert, P.; van der Linden, S. Brightness gradient-corrected hyperspectral image mosaics for fractional vegetation cover mapping in northern California. Remote Sens. Lett. 2020, 11, 1–10. [Google Scholar] [CrossRef]
Ozdemir, A.; Polat, K. Deep learning applications for hyperspectral imaging: A systematic review. J. Inst. Electron. Comput. 2020, 2, 39–56. [Google Scholar] [CrossRef]
Yue, J.; Fang, L.; Rahmani, H.; Ghamisi, P. Self-supervised learning with adaptive distillation for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
Jia, S.; Jiang, S.; Lin, Z.; Li, N.; Xu, M.; Yu, S. A survey: Deep learning for hyperspectral image classification with few labeled samples. Neurocomputing 2021, 448, 179–204. [Google Scholar] [CrossRef]
Jiang, J.; Ma, J.; Chen, C.; Wang, Z.; Cai, Z.; Wang, L. SuperPCA: A superpixelwise PCA approach for unsupervised feature extraction of hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4581–4593. [Google Scholar] [CrossRef]
Ahmad, M.; Shabbir, S.; Roy, S.K.; Hong, D.; Wu, X.; Yao, J.; Khan, A.M.; Mazzara, M.; Distefano, S.; Chanussot, J. Hyperspectral image classification—Traditional to deep models: A survey for future prospects. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2021, 15, 968–999. [Google Scholar] [CrossRef]
Licciardi, G.; Marpu, P.R.; Chanussot, J.; Benediktsson, J.A. Linear versus nonlinear PCA for the classification of hyperspectral data based on the extended morphological profiles. IEEE Geosci. Remote Sens. Lett. 2011, 9, 447–451. [Google Scholar] [CrossRef]
Green, A.A.; Berman, M.; Switzer, P.; Craig, M.D. A transformation for ordering multispectral data in terms of image quality with implications for noise removal. IEEE Trans. Geosci. Remote Sens. 1988, 26, 65–74. [Google Scholar] [CrossRef] [Green Version]
Hyvärinen, A.; Oja, E. Independent component analysis: Algorithms and applications. Neural Netw. 2000, 13, 411–430. [Google Scholar] [CrossRef]
Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef]
Guo, Y.; Han, S.; Li, Y.; Zhang, C.; Bai, Y. K-Nearest Neighbor combined with guided filter for hyperspectral image classification. Procedia Comput. Sci. 2018, 129, 159–165. [Google Scholar] [CrossRef]
Gislason, P.O.; Benediktsson, J.A.; Sveinsson, J.R. Random forests for land cover classification. Pattern Recognit. Lett. 2006, 27, 294–300. [Google Scholar] [CrossRef]
Huang, H.; Chen, M.; Duan, Y. Dimensionality reduction of hyperspectral image using spatial-spectral regularized sparse hypergraph embedding. Remote Sens. 2019, 11, 1039. [Google Scholar] [CrossRef]
Shah, C.; Du, Q. Spatial-Aware Collaboration–Competition Preserving Graph Embedding for Hyperspectral Image Classification. IEEE Geosci. Remote. Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Hughes, G. On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 1968, 14, 55–63. [Google Scholar] [CrossRef]
Mei, X.; Pan, E.; Ma, Y.; Dai, X.; Huang, J.; Fan, F.; Du, Q.; Zheng, H.; Ma, J. Spectral-spatial attention networks for hyperspectral image classification. Remote Sens. 2019, 11, 963. [Google Scholar] [CrossRef]
Shen, Y.; Zhu, S.; Chen, C.; Du, Q.; Xiao, L.; Chen, J.; Pan, D. Efficient deep learning of nonlocal features for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6029–6043. [Google Scholar] [CrossRef]
Liu, B.; Yu, A.; Yu, X.; Wang, R.; Gao, K.; Guo, W. Deep multiview learning for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7758–7772. [Google Scholar] [CrossRef]
Liu, S.; Shi, Q.; Zhang, L. Few-shot hyperspectral image classification with unknown classes using multitask deep learning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5085–5102. [Google Scholar] [CrossRef]
Wang, Q.; Liu, Y.; Xiong, Z.; Yuan, Y. Hybrid Feature Aligned Network for Salient Object Detection in Optical Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5624915. [Google Scholar] [CrossRef]
Wang, P.; Han, K.; Wei, X.S.; Zhang, L.; Wang, L. Contrastive learning based hybrid networks for long-tailed image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2021), Virtual, 19–25 June 2021; pp. 943–952. [Google Scholar]
Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep learning-based classification of hyperspectral data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2094–2107. [Google Scholar] [CrossRef]
Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep convolutional neural networks for hyperspectral image classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef]
Liang, H.; Li, Q. Hyperspectral imagery classification using sparse representations of convolutional neural network features. Remote Sens. 2016, 8, 99. [Google Scholar] [CrossRef]
Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef]
Xie, J.; He, N.; Fang, L.; Ghamisi, P. Multiscale densely-connected fusion networks for hyperspectral images classification. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 246–259. [Google Scholar] [CrossRef]
Paoletti, M.; Haut, J.; Plaza, J.; Plaza, A. Deep learning classifiers for hyperspectral imaging: A review. ISPRS J. Photogramm. Remote Sens. 2019, 158, 279–317. [Google Scholar] [CrossRef]
Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep learning for hyperspectral image classification: An overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef]
Gao, K.; Liu, B.; Yu, X.; Qin, J.; Zhang, P.; Tan, X. Deep relation network for hyperspectral image few-shot classification. Remote Sens. 2020, 12, 923. [Google Scholar] [CrossRef] [Green Version]
Liu, B.; Yu, X.; Yu, A.; Zhang, P.; Wan, G.; Wang, R. Deep few-shot learning for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2018, 57, 2290–2304. [Google Scholar] [CrossRef]
Cao, X.; Yao, J.; Xu, Z.; Meng, D. Hyperspectral image classification with convolutional neural network and active learning. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4604–4616. [Google Scholar] [CrossRef]
Wang, Y.; Mei, J.; Zhang, L.; Zhang, B.; Li, A.; Zheng, Y.; Zhu, P. Self-supervised low-rank representation (SSLRR) for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5658–5672. [Google Scholar] [CrossRef]
Misra, I.; Maaten, L.v.d. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognitionn, Seattle, WA, USA, 14–19 June 2020; pp. 6707–6717. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Shi, C.; Fang, L.; Lv, Z.; Zhao, M. Explainable scale distillation for hyperspectral image classification. Pattern Recognit. 2022, 122, 108316. [Google Scholar] [CrossRef]
Xu, M.; Zhao, Y.; Liang, Y.; Ma, X. Hyperspectral Image Classification Based on Class-Incremental Learning with Knowledge Distillation. Remote Sens. 2022, 14, 2556. [Google Scholar] [CrossRef]
Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. SwinFusion: Cross-domain Long-range Learning for General Image Fusion via Swin Transformer. IEEE/CAA J. Autom. Sin. 2022, 9, 1200–1217. [Google Scholar] [CrossRef]
Zbontar, J.; Jing, L.; Misra, I.; LeCun, Y.; Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 12310–12320. [Google Scholar]
Liu, Y.; Jin, M.; Pan, S.; Zhou, C.; Zheng, Y.; Xia, F.; Yu, P. Graph self-supervised learning: A survey. IEEE Trans. Knowl. Data Eng. 2022. [Google Scholar] [CrossRef]
Shurrab, S.; Duwairi, R. Self-supervised learning methods and applications in medical imaging analysis: A survey. PeerJ Comput. Sci. 2022, 8, e1045. [Google Scholar] [CrossRef]
Akbari, H.; Yuan, L.; Qian, R.; Chuang, W.H.; Chang, S.F.; Cui, Y.; Gong, B. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Adv. Condens. Matter Phys. 2021, 34, 24206–24221. [Google Scholar]
Noroozi, M.; Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Switzerland, 2016; pp. 69–84. [Google Scholar]
Treneska, S.; Zdravevski, E.; Pires, I.M.; Lameski, P.; Gievska, S. GAN-Based Image Colorization for Self-Supervised Visual Feature Learning. Sensors 2022, 22, 1599. [Google Scholar] [CrossRef]
Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised representation learning by predicting image rotations. arXiv 2018, arXiv:1803.07728. [Google Scholar]
Paredes-Vallés, F.; de Croon, G. Back to Event Basics: Self-Supervised Learning of Image Reconstruction for Event Cameras via Photometric Constancy. arXiv 2020, arXiv:2009.08283. [Google Scholar]
Ma, J.; Le, Z.; Tian, X.; Jiang, J. SMFuse: Multi-focus image fusion via self-supervised mask-optimization. IEEE Trans. Comput. Imaging 2021, 7, 309–320. [Google Scholar] [CrossRef]
Feng, Z.; Xu, C.; Tao, D. Self-supervised representation learning by rotation feature decoupling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognitionn, Long Beach, CA, USA, 15–20 June 2019; pp. 10364–10374. [Google Scholar]
Wang, Y.; Mei, J.; Zhang, L.; Zhang, B.; Zhu, P.; Li, Y.; Li, X. Self-supervised feature learning with CRF embedding for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2018, 57, 2628–2642. [Google Scholar] [CrossRef]
Zhu, M.; Fan, J.; Yang, Q.; Chen, T. SC-EADNet: A Self-Supervised Contrastive Efficient Asymmetric Dilated Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–17. [Google Scholar] [CrossRef]
Song, L.; Feng, Z.; Yang, S.; Zhang, X.; Jiao, L. Self-Supervised Assisted Semi-Supervised Residual Network for Hyperspectral Image Classification. Remote Sens. 2022, 14, 2997. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Wang, L.; Yoon, K.J. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3048–3068. [Google Scholar] [CrossRef] [PubMed]
Park, W.; Kim, D.; Lu, Y.; Cho, M. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019; pp. 3967–3976. [Google Scholar]
Tung, F.; Mori, G. Similarity-preserving knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 1365–1374. [Google Scholar]
Zhu, Y.; Wang, Y. Student customized knowledge distillation: Bridging the gap between student and teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 5057–5066. [Google Scholar]
Beyer, L.; Zhai, X.; Royer, A.; Markeeva, L.; Anil, R.; Kolesnikov, A. Knowledge distillation: A good teacher is patient and consistent. arXiv 2021, arXiv:2106.05237. [Google Scholar]
Zhang, L.; Song, J.; Gao, A.; Chen, J.; Bao, C.; Ma, K. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–3 November 2019; pp. 3713–3722. [Google Scholar]
Ji, M.; Shin, S.; Hwang, S.; Park, G.; Moon, I.C. Refine Myself by Teaching Myself: Feature Refinement via Self-Knowledge Distillation. arXiv 2021, arXiv:2103.08273. [Google Scholar]
Van Erven, T.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Overall network structure of the proposed SSKD. The figure SSL is an abbreviation for Self-Supervised Learning.

Figure 2. Feature extractor. The figure shows the structure of a 3-layer feature extractor, with the width, height and depth of the input data denoted by w, h and d. K_size and K_num denote the size and number of convolutional kernels, respectively. The F denotes multi-layer information fusion. The output of each layer of the network is represented by a new color.

Figure 3. Soft Label Generation. The figure shows a, b are the HSI data corresponding to the ground truth image, and

(x_{1}, x_{2})

,

(y_{1}, y_{2})

denote the position coordinates of a, b, respectively. The

e n t

is shorthand for entropy and the details of the Selector are shown in Equations (4)–(7).

Figure 3. Soft Label Generation. The figure shows a, b are the HSI data corresponding to the ground truth image, and

(x_{1}, x_{2})

,

(y_{1}, y_{2})

denote the position coordinates of a, b, respectively. The

e n t

is shorthand for entropy and the details of the Selector are shown in Equations (4)–(7).

Figure 4. False color image and ground truth image of the IP dataset. (a) False color image. (b) ground truth image. (c) Color coding for each category.

Figure 5. False color image and ground truth image of the UP dataset. (a) False color image. (b) Ground truth image. (c) Color coding for each category.

Figure 6. False color image and ground truth image of the KSC dataset. (a) False color image. (b) Ground truth image. (c) Color coding for each category.

Figure 7. Graph of classification results of IP dataset. The highlighted white box in each subfigure is to compare the classification results more clearly. (a) Ground truth (GT). (b) SVM. (c) 2DCNN. (d) 3DCNN. (e) DFSL. (f) RN-FSC. (g) SSAD. (h) SSKD.

Figure 8. Graph of classification results of UP dataset. Similar to Figure 7, the highlighted white box in each subfigure is used to compare the classification results more clearly. (a) Ground truth (GT). (b) SVM. (c) 2DCNN. (d) 3DCNN. (e) DFSL. (f) RN-FSC. (g) SSAD. (h) SSKD.

Figure 9. Graph of classification results of KSC dataset. The colored regions are too small to visually distinguish details; therefore, the boxed regions have been enlarged for a clear comparison. (a) Ground truth (GT). (b) SVM. (c) 2DCNN. (d) 3DCNN. (e) DFSL. (f) RN-FSC. (g) SSAD. (h) SSKD.

Figure 10. Sample size variation and overall precision.

Table 1. Information on the number of samples per category in the IP dataset.

No.	Class Name	Number
1	Alfalfa	46
2	Corn-notill	1428
3	Corn-mintill	830
4	Corn	237
5	Grass-pasture	483
6	Grass-tree	730
7	Grass-pasture-mowed	28
8	Hay-windrowed	478
9	Oats	20
10	Soybean-notill	972
11	Soybean-mintill	2455
12	Soybean-clean	593
13	Wheat	205
14	Woods	1265
15	Buildings-Grass-Trees-Drives	386
16	Stone-Steel-Towers	93
	Total	10,249

Table 2. Information on the number of samples per category in the PaviaU dataset.

No.	Class Name	Number
1	Asphalt	6631
2	Meadows	18,649
3	Gravel	2099
4	Trees	3064
5	Painted metal sheets	1345
6	Bare Soil	5029
7	Bitumen	1330
8	Self-Blocking Bricks	3682
9	Shadows	947
	Total	42,776

Table 3. Information on the number of samples per category in the KSC dataset.

No.	Class Name	Number
1	Scrub	761
2	Willow swamp	243
3	Cabbage palm hammock	256
4	Cabbage palm/oak hammock	252
5	Slash pine	161
6	Oak/broadleaf hammock	229
7	Hardwood swamp	105
8	Graminoid marsh	431
9	Spartina marsh	520
10	Cattail marsh	404
11	Salt marsh	419
12	Mud flats	503
13	Water	927
	Total	5211

Table 4. Use two judgment methods to generate soft label results.

Datasets	Conditions	Soft Label Number	The Correct Number	Precision Judge
UP	$γ$ ≤ 0.085	4062	4060	99.95%
	$α$ ≤ 0.15 or $β$ > 0.5	13,034	13,034	100%
IP	$γ$ ≤ 0.085	14,705	13,978	95.06%
	$α$ ≤ 0.15 or $β$ > 0.5	18,936	18,721	98.92%
KSC	$γ$ ≤ 0.085	4,401	4,174	94.84%
	$α$ ≤ 0.15 or $β$ > 0.5	9338	9315	99.75%

Table 5. The classification accuracy of several different methods on the IP dataset, with five labeled samples for each category. The best accuracy is shown in bold.

Class	SVM	2DCNN	3DCNN	DFSL	RN-FSC	SSAD	SSKD
1	32.86	71.95	44.55	43.80	15.57	98.37	97.56
2	36.53	31.36	39.18	46.67	51.45	69.59	71.80
3	23.80	35.36	36.57	39.27	45.46	66.75	78.43
4	37.79	30.60	29.46	27.40	36.29	94.54	95.8
5	50.78	67.52	56.73	77.39	62.65	73.71	83.48
6	74.32	74.34	74.92	94.59	93.75	99.63	98.97
7	28.55	92.39	26.56	32.31	18.54	100	100
8	75.67	74.37	78.53	97.66	80.84	99.65	100
9	14.91	90.00	33.18	20.38	9.62	100	100
10	39.39	43.77	42.79	58.08	79.27	77.7	86.89
11	44.36	45.45	48.38	71.46	83.71	71.43	85.16
12	26.56	21.51	35.03	32.16	44.85	80.67	67.82
13	79.23	91.75	80.45	72.61	79.34	100	100
14	69.86	50.44	85.42	93.59	83.44	97.7	96.57
15	28.84	35.10	46.89	54.48	45.55	93.52	89.96
16	84.93	86.93	79.07	85.99	49.13	100	100
OA(%)	46.12 ± 5.02	46.99 ± 1.98	52.28 ± 3.09	59.41 ± 4.06	62.22 ± 3.12	80.99 ± 2.76	85.87 ± 1.88
AA(%)	59.69 ± 3.58	58.93 ± 3.77	63.65 ± 4.33	59.24 ± 3.64	54.97 ± 4.41	88.95 ± 1.71	90.82 ± 1.63
Kappa	40.16 ± 5.29	40.70 ± 2.28	46.67 ± 3.77	54.77 ± 4.23	58.06 ± 3.44	78.33 ± 2.41	83.88 ± 2.21

Table 6. The classification accuracy of several different methods on the UP dataset, with five labeled samples for each category. The best accuracy is shown in bold.

Class	SVM	2DCNN	3DCNN	DFSL	RN-FSC	SSAD	SSKD
1	63.44	49.44	77.08	88.17	86.67	79.69	87.44
2	61.99	65.51	68.96	91.95	98.26	77.47	93.16
3	38.98	77.38	67.35	62.82	54.97	92.07	80.4
4	66.44	71.55	75.28	91.79	74.30	95.12	91.27
5	93.93	99.44	99.32	99.99	97.87	100	100
6	40.49	57.50	43.85	45.18	44.10	89.05	83.15
7	39.53	92.28	61.36	52.79	70.76	98.55	99.62
8	63.90	63.43	68.16	68.85	90.05	91.88	98.21
9	99.78	99.34	97.18	98.47	77.54	99.47	99.98
OA(%)	59.08 ± 4.26	65.55 ± 5.09	66.41 ± 1.95	76.73 ± 2.03	77.59 ± 3.61	84.24 ± 2.07	91.33 ± 1.26
AA(%)	69.53 ± 2.88	75.10 ± 2.10	77.24 ± 1.94	77.78 ± 1.53	77.17 ± 2.52	91.48 ± 1	92.57 ± 0.84
Kappa	50.04 ± 3.65	56.89 ± 5.08	58.77 ± 2.97	70.20 ± 2.52	71.65 ± 3.41	79.75 ± 2.49	88.50 ± 1.5

Table 7. The classification accuracy of several different methods on the KSC dataset, with five labeled samples for each category. The best accuracy is shown in bold.

Class	SVM	2DCNN	3DCNN	DFSL	RN-FSC	SSAD	SSKD
1	58.32	58.47	48.18	98.45	54.07	98.63	99.47
2	41.22	54.82	72.69	90.60	72.77	89.92	90.13
3	48.25	32.93	55.58	82.92	58.25	93.89	95.82
4	18.51	11.21	31.68	62.17	49.07	76.65	92.81
5	27.23	20.99	33.81	46.40	68.57	83.12	70.51
6	21.05	24.57	15.18	68.85	58.30	82.74	91.29
7	43.03	51.25	80.50	61.40	100	100	99.50
8	35.36	49.58	54.40	76.84	87.19	93.27	96.19
9	66.06	79.97	89.03	92.64	60.24	99.74	96.41
10	33.53	46.80	53.07	97.71	84.02	35.51	81.83
11	83.26	98.24	93.78	100	54.33	99.11	97.83
12	29.32	45.33	49.30	96.23	93.03	74.03	80.02
13	83.96	88.60	87.74	99.28	99.25	100	100
OA(%)	56.13 ± 3.03	61.47 ± 1.95	63.49 ± 4.71	87.77 ± 1.41	70.62 ± 2.51	88.48 ± 3.96	93.44 ± 2.18
AA(%)	48.88 ± 2.28	55.94 ± 3.38	58.84 ± 4.88	82.58 ± 1.63	72.24 ± 4.36	86.66 ± 3.44	91.68 ± 2.71
Kappa	51.13 ± 3.28	57.3 ± 2.29	59.79 ± 4.98	86.42 ± 1.56	67.02 ± 3.92	87.16 ± 4.42	92.69 ± 2.43

Table 8. The overall classification accuracy for the ablation study conducted on the UP dataset with one to five samples selected from each category.

Methods	1	2	3	4	5
SSKD	71.35 ± 3.74	78.76 ± 2.5	86.56 ± 3.89	89.41 ± 2.41	91.33 ± 1.26
SSKD(-SS)	68.26 ± 3.74	77.32 ± 3.22	83.44 ± 3.62	87.92 ± 2.81	89.53 ± 2.25
SSKD(-KD)	63.61 ± 3.11	70.82 ± 3.02	73.64 ± 3.95	76.33 ± 3.48	79.42 ± 2.82
SSKD(-SPA)	66.83 ± 2.47	74.71 ± 3.94	83.13 ± 2.14	84.21 ± 3.21	86.88 ± 4.84
SSKD(-SPE)	68.10 ± 4.62	76.01 ± 2.41	84.63 ± 2.82	88.14 ± 4.05	89.21 ± 4.82

Table 9. Efficiency comparison on the testing phase of the compared methods (using UP dataset of five samples per class).

Methods	SVM	2DCNN	3DCNN	DFSL	RN-FSC	SSAD	SSKD
Testing (s)	0.61	0.84	6.75	5.71	48.41	10.93	11.21

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chi, Q.; Lv, G.; Zhao, G.; Dong, X. A Novel Knowledge Distillation Method for Self-Supervised Hyperspectral Image Classification. Remote Sens. 2022, 14, 4523. https://doi.org/10.3390/rs14184523

AMA Style

Chi Q, Lv G, Zhao G, Dong X. A Novel Knowledge Distillation Method for Self-Supervised Hyperspectral Image Classification. Remote Sensing. 2022; 14(18):4523. https://doi.org/10.3390/rs14184523

Chicago/Turabian Style

Chi, Qiang, Guohua Lv, Guixin Zhao, and Xiangjun Dong. 2022. "A Novel Knowledge Distillation Method for Self-Supervised Hyperspectral Image Classification" Remote Sensing 14, no. 18: 4523. https://doi.org/10.3390/rs14184523

APA Style

Chi, Q., Lv, G., Zhao, G., & Dong, X. (2022). A Novel Knowledge Distillation Method for Self-Supervised Hyperspectral Image Classification. Remote Sensing, 14(18), 4523. https://doi.org/10.3390/rs14184523

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Knowledge Distillation Method for Self-Supervised Hyperspectral Image Classification

Abstract

1. Introduction

2. Materials and Methods

2.1. Related Work

2.1.1. Self-Supervised Learning

2.1.2. Knowledge Distillation

2.2. Methodology

2.2.1. Self-Supervised Learning Network

2.2.2. Soft Label Generation

2.2.3. Knowledge Distillation

3. Results

3.1. Datasets

3.2. Experimental Setup

3.3. Classification Maps and Categorized Results

3.4. Compared with Different Number of Training Samples

3.5. Ablation Study

3.6. Efficiency Comparison

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI