Training Small Networks for Scene Classification of Remote Sensing Images via Knowledge Distillation

Scene classification, aiming to identify the land-cover categories of remotely sensed image patches, is now a fundamental task in the remote sensing image analysis field. Deep-learning-model-based algorithms are widely applied in scene classification and achieve remarkable performance, but these high-level methods are computationally expensive and time-consuming. Consequently in this paper, we introduce a knowledge distillation framework, currently a mainstream model compression method, into remote sensing scene classification to improve the performance of smaller and shallower network models. Our knowledge distillation training method makes the high-temperature softmax output of a small and shallow student model match the large and deep teacher model. In our experiments, we evaluate knowledge distillation training method for remote sensing scene classification on four public datasets: AID dataset, UCMerced dataset, NWPU-RESISC dataset, and EuroSAT dataset. Results show that our proposed training method was effective and increased overall accuracy (3% in AID experiments, 5% in UCMerced experiments, 1% in NWPU-RESISC and EuroSAT experiments) for small and shallow models. We further explored the performance of the student model on small and unbalanced datasets. Our findings indicate that knowledge distillation can improve the performance of small network models on datasets with lower spatial resolution images, numerous categories, as well as fewer training samples.


Introduction
With the rapid development of remote sensing (RS) techniques, a large number of algorithms have been proposed to automatically process massive earth observation data.Scene classification is the one of fundamental procedures of RS images analysis and of much importance in many RS applications, such as land-use/land-cover (LULC) [1][2][3][4], agriculture [5][6][7][8], forestry [9,10], and hydrology [11].
The core task of scene classification is to identify the land-cover type of remotely sensed image patches automatically.Numerous supervised machine learning algorithms have been used in scene classification.These algorithms can be categorized into the following three types [12]: low-level, mid-level and high-level methods.Low-level methods first extract low-level hand-crafted features, including SIFT (scale invariant feature transform) [13,14], HOG (histogram of oriented gradients) [15], structural texture similarity [16], LBP (local binary patterns) [17], Gabor descriptors [18], etc. Features extracted by methods are utilized to train shallow classifiers such as Support Vector Machines (SVMs) [19] or K-Nearest Neighbor algorithms (KNNs) [20] to identify the categories of scene images.These scene classification methods based on low-level features are efficient on some structures and arrangements, but cannot easily depict the highly diverse and non-homogeneous spatial distributions in images [21].Mid-level methods build scene representations by coding low-level local attributes.The bag-of-visual-words (BoVW) model is one of the most frequently used approaches [22].To improve the classification accuracy, various low-level local descriptors are combined as complemented features in the standard BoVW model [23], Gaussian mixture model (GMM) [24], and other pyramid-based models [25][26][27] for scene classification.In addition, topic models are introduced to combine visual semantic information to encode higher order spatial information between low-level local visual words [28][29][30][31][32]. High-level methods are based on deep learning (DL) models.DL models achieve the state-of-the-art in image recognition, speech recognition, semantic segmentation, object detection, natural language processing [33][34][35][36], and RS image scene classification.Many classic DL models in the field of computer vision (CV) have been shown to be effective in RS scene classification [12,[37][38][39][40][41][42][43][44].Most are based on deep convolutional neural network (CNN) models, such as AlexNet [45], CaffeNet [46], VGGNet [47], deep residual networks (ResNet) [48] and DenseNets [49].Among these approaches, the CNN-based high-level models outperform the state-of-the-art for scene classification tasks in remote sensing [50].They can deal with scenes that are more complex and achieve higher overall accuracy by learning deep visual features from large training datasets, in contrast to shallow models and low-level methods that rely on manual feature extraction [51].
However, deep CNNs contain more parameters to train, thus they cost more computational resources and time on training and predicting.For example, a 102-convolutional-layer CNN model, which contains 42.4 M parameters, costs 14 ms to classify a 224 × 224 × 3 scene image while a simple 4-convolutional-layer CNN model costs 8.77 ms and only contains 1 M parameters, as detailed in Section 3.1 of this paper.This is an unacceptable cost of time and storage space in special situations, such as embedded devices [52][53][54] or during on-orbit processing [55].In contrast, a small and shallow model is fast and uses little space, but will not yield accurate and precise results when trained directly on ground truth data [33].
Under these circumstances, model compression techniques become imperative.Generalized model compression improves the performance of a shallow and fast model by learning a cumbersome, but better performing model, or by simplifying the structure of the cumbersome network.There are three mainstream types of model compression algorithms: network pruning, network quantization and Teacher-Student Training (TST).Network pruning is a technique that reduces the size of networks by removing neurons or weights that are less important based on certain standard [52,[56][57][58], while network quantization attempts to reduce the precision of weights or features [59][60][61].In contrast, TST methods impart knowledge from a teacher model into a student model by learning distributions or outputs of some specific layers [62][63][64][65][66][67][68][69][70].
TST is easily confused with transfer learning.In transfer learning, we first train a base model on a certain dataset and task, and then transfer the learned features to another target network to be trained on different target dataset and task [71,72].A common use of transfer learning in the field of remote sensing is to fine-tune an ImageNet-pretrained model on a remote sensing dataset [2,40,73].In a TST process, however, the teacher and student models are trained on the same dataset.
Knowledge distillation (KD) is one kind of TST method, first defined in [63].In that paper, authors distill knowledge from an ensemble of models into a single smaller model via high-temperature softmax training.In this paper, we introduce the KD into remote sensing scene classification for the first time to improve the performance of small and shallow network model.We then conducted experiments on several public datasets to verify the effectiveness of KD and make a quantitative analysis to explore the optimum parameter settings.We will also discuss performance of KD on different types of datasets.In addition, we tested whether knowledge can still be distilled in the absence of a certain type of training samples or in the absence of sufficient training data sets.For convenience in our work, we simplified the KD training process by only learning from one cumbersome model.The rest of this paper is organized as follows.Section 2 will describe the teacher-student training method and our knowledge distillation framework.In Section 3, results and analysis of experiments on several datasets are detailed.Our conclusion and future work are discussed in Section 4.

Method
As a special case of teacher-student training (TST), knowledge distillation (KD) imitates high-temperature softmax output from the cumbersome teacher model, serving as our training framework.In this section, we first describe the different TST methods, and introduce the KD training methods we adopted in our research.

Teacher-Student Training
TST is one of the mainstream model compression methods.In TST processing, a cumbersome pre-trained model is regarded as a teacher, the untrained small and shallow model is a student.The student model not only learns hard target from the ground truth data, but also matches its output to the output of the teacher model.This is because the output of a softmax layer (soft target) contains more information than one-hot labeled dataset (hard target).
A general TST process first trains the cumbersome model directly on dataset and then train the student model by minimizing the following loss function using mini-batch stochastic gradient descent (MSGD) [74] method: where X denotes a batch of input data, S(X) is the softmax output of student model, Y GT is the ground truth label corresponding to the input X, and S * (X) and T * (X) are the output of a certain layer in student model and teacher model, λ is a non-negative constant.The first term in this loss function is a ground truth constraint.If λ = 0, the supervised information is only provided by the teacher model, instead of the ground truth data.If λ goes higher, the output probability distribution of the trained student model is more like the teacher model.H 1 (X,Y) and H 2 (X,Y) in Equation (1) can be any common loss functions, such as mean squared error (MSE) or categorical crossentropy (CE): where m denotes the batch size, d denotes the size of input vector x, and x ik indicates the kth element of the ith input samples.The simplest and the naive method in TST matches the output probability distribution of the last softmax layer (MS) in the student model to the teacher model.In this case, S * (X) = S(X) and T * (X) is the softmax output of the teacher model.Thus, the loss function of MS mode can be defined as: To better impart knowledge from teacher to student, Bucilua, C., et al. [62] proposed the matching logits (ML) method.Logits are the inputs to the final softmax layer of a network.Here, the discrepancy among categories in the logits form is more significant than the probabilities form.As a result, the CE of the second term in the loss function are replaced by MSE because the value of a logit can be any real number.The loss function in ML mode now becomes: where S logits (X) and T logits (X) are the logits output of their model.In that work, the authors conducted several experiments and verified that the ML mode could improve small models by learning from model ensembles in simple machine learning tasks.
In addition to these methods as mentioned, various researchers in the field have proposed other means for TST training.Huang, Z., et al. [69] proposed an idea of matching the distributions of neuron selectivity patterns (NST) between two networks, where a new loss function through minimizing the maximum mean discrepancy (MMD) is designed to match these distributions.In [70], the authors compressed wide and shallow networks into thin but deeper networks, the FitNet, by learning intermediate-level hints from teacher's hidden layers.Yim [68] transfers the knowledge distilled from the flow between layers, computed by the inner product between features, and generated into the FSP matrix.Different from the previous methods, Chen, T., et al. [64] accelerates the training of a larger network by transferring knowledge from a network to the new larger network.Lopez-Paz, D., et al. [66] combines knowledge distillation with privileged information [75], deriving into a generalized distillation.

Knowledge from Probability Distribution
The last output layer of neural networks is the softmax layer that transforms the logits z i into a probability p i via: However, the normal softmax output always leads to an approximate one-hot vector.An example is shown in Table 1 and Figure 1, which indicates that normal softmax (temperature = 1) made the probability of C2 class tends to one and others tend to zero.The entropy of the output also tends to zero.In practice, a remote sensing scene image consists of several categories of pixels.Thus, information for non-maximum probability categories provides additional supervision for training.1.When T = 1 (normal softmax), the probability of C2 tends to 1 and others tend to 0. If T goes higher, the categorical distribution tends to be more consistent.To improve the discrimination ability and generalizability of a model, Hinton, G., et al. [63] introduced high temperature softmax function instead of normal softmax or logits.High temperature softmax function was first used in the field of reinforcement learning [76], denoted as: where T is the temperature.The normal softmax is a special case if T = 1.Softmax with high temperature could increase the entropy of the categorical vector which helps to learn more knowledge from the probability distribution of a complex scene.An example of categorical probability distributions of high-temperature softmax output with different temperature is shown in Table 1 and Figure 1.C1∼C6 denote six categories, and the example input logit data are listed in the last row of the table.When T = 1 (normal softmax), the probability of C2 tends to 1 and others tend to 0. If T goes higher, the entropy of the categorical distribution becomes higher.It can be inferred from this example that if the student model learns high-temperature softmax output from the teacher model, it will distill more knowledge of categorical probability distribution.The experiments in Section 3 will verify this inference.

KD Training Process
By introducing the high-temperature softmax into our framework in former subsection, we divide the whole training process of knowledge distillation into two procedures.First, train the teacher model directly on dataset, which is shown in Figure 2a.The target is to let T(X), the softmax output of the teacher model, fit the ground truth Y GT .Then, distill the knowledge via high-temperature softmax, as shown in Figure 2b.The student model outputs two branches: high-temperature softmax outputs distill knowledge from the teacher model and the normal softmax outputs learn to match the ground truth label.Thus, the total loss of KD process L KD (X) is: where S T and T T (X) denote the T-temperature softmax output of the student model and the teacher model respectively.In extreme conditions, such as lacking of training samples, the teacher's high-temperature output can even provide supervision to the student model without any ground truth data (set λ = 0).In prediction or production environment, the trained student model only outputs normal softmax result, as shown in Figure 2c.As the higher-temperature softmax output from the teacher model contains different information than the ground truth dataset, our KD framework provides the student model with more categorical information in scene classification tasks.

Experimental Results and Analysis
In this section, to test the performance of our distillation framework, we conducted several experiments on four remote sensing scene classification datasets: AID dataset [51] (Figure 3a), UCMerced dataset [22] (Figure 3b), NWPU-RESISC dataset [77] (Figure 3c) and EuroSAT dataset [2] (Figure 3d).The general information of each dataset is listed in Table 2.For each dataset, we first train a large deep network model and a small shallow one by direct training methods.Then we train the small model by our proposed KD methods.For comparison, other model compression methods including ML were also processed, and we analyzed in detail the experimental results (The implementation of the framework was based on Keras 2.1.1 [78] and TensorFlow 1.4.0 [79]).

Experiments on AID Dataset
To evaluate the performance and robustness of our proposed KD framework for remote sensing image scene classification, we first designed and conducted several experiments on the AID dataset.

Dataset Description
The AID dataset is a large-scale public data set for aerial scene classification, provided by [51].It contains 10,000 manually labeled remote sensing scene images from around the world.All images in the AID dataset were collected from Google Earth (https://www.google.com/earth/).Each is 600 × 600 pixels with RGB three spectral bands.The task in our experiments was to classify all scene images into thirty categories.The specific categories are: airport, bare land, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, park, parking, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, storage tanks, and viaduct.
The dataset was divided into three parts for each category: training data (around 50%), validation data (around 25%) and test data (around 25%).Considered from computational resources, all images were re-sampled from 600 × 600 × 3 to 224 × 224 × 3 pixels.In total, there were 5000 images in the training dataset, 2507 as validation data and 2493 in the test dataset.
In the experiments, overall accuracy (OA), kappa coefficient (K) [80], precision (P), recall (R) and F1-score (F1) of the test dataset were adopted as the accuracy assessment metrics.By introducing the confusion matrix, a table with two rows and two columns that reports the number of true positives (tp), false positives ( f p), false negatives ( f n), and true negatives (tn), we can define these assessment metrics as follows.Overall accuracy (OA) is defined as the number of correctly predicted images divided by the total number of predicted images, denoted as: while N true denotes the number of correctly predicted images, and N stands for the total number of predicted images.Precision (P) and recall (R) for one-class classification are then defined as: and F1-score is the harmonic mean of the precision and recall, which can be calculated by: when it turns to multi-class situation, we use weighted F1-score as our metric: while weight w(k) is determined by the number of samples of each category, F1 k oneclass is correspondingly calculated in each category.Kappa coefficient (K) is another metric that measures inter-rater agreement for categorical items, in multi-class situation, K is defined as: where P e is the hypothetical probability of chance agreement, calculated as: where k denotes the number of categories, and n ki is the number of times rater i predicted category k.

Structure of Networks and Direct Training
We choose the classic 101-layer deep residual network (ResNet-101) [48] as the teacher model (The structure of ResNet-101 can be found at https://github.com/KaimingHe/deep-residualnetworksand http://ethereon.github.io/netscope/#/gist/b21e2aae116dc1ac7b50).For comparison, we designed a shallow and simple CNN with four convolutional layers and only one fully connected layer.To prevent over-fitting, we add a Dropout layer [81] between the convolutional layers and fully connected layers.The specific structure of the student model is detailed in Appendix A.1.
Both two models were trained by the common back-propagation (BP) algorithm [82] with a batch size of 24.We adopted Adadelta [83] as the weights updating optimization method.Each model was trained for 100 epochs.All the experimental results on AID dataset were processed on a desktop PC with Intel Core i7 6700K (4C8T), 32GB RAM and Nvidia GeForce GTX1080 Ti (11264MB memory).
After training one epoch, the validation OA (VOA) will be recorded and the model achieved the highest VOA was used to make a final accuracy assessment on the test dataset.In our experiments, we used the following data augmentation policies for generalization purpose:  3 (The items in bold in each table mean the optimum results of all.).In addition to accuracy assessment metrics, we also recorded the FPS (frames per second) values in Table 3.
There were three key findings as shown in Table 3 The student model learned more knowledge from the teacher model via higher temperature softmax output.Different from T, the effect of λ parameter is not clear.When T is 1, 5, 50, or 100, the bigger λ the better OA achieved.However, if T is set to 10 or 20, the smaller λ value performed better.To further analyze λ, we drew four subfigures in Figure 5 to demonstrate the relationship between our four metrics and the temperature T. The curves shows that the trend of VOA are similar to other metrics whether λ = 0.1 or λ = 1 on AID dataset.

3.
From a macro point of view, KD training methods could improve the performance of a network model, in terms of OA, K or F1.In test data evaluation, it even surpassed the deep teacher model by 60% higher speed and using only 2.4% model parameters.3.
The curves shows that the trend of VOA are similar to other metrics whether λ = 0.1 or λ = 1 on AID dataset.

KD Training on Small Dataset
If a student model could learn knowledge from a teacher model via KD training on the complete dataset, it should also work on a small part of the dataset.To verify this idea, we implemented extra experiments of KD training on a small part of AID dataset.We took 20% training samples (1000 images) and 20% validation data (507 images) of the original AID dataset and evaluated accuracy on the complete test dataset.
The results are shown in Table 4.As shown, small dataset lead to a shorter training time (It includes validation time.)but poorer training results.However, the KD training was still better than direct training under such conditions.In addition, if we decreased λ, the weight of the first term in the loss function (Equation ( 8)), the generalizability of the student model would be greatly enhanced.5. From the results, in the absence of "airport" samples, the KD training process still achieved better F1-score than direct training on complete dataset.From the point of view of the distilled model, it has never seen "airport" before.However, it got high precision (0.913), although the recall is very low (0.2333).If we continued to fine-tune the student model only 10 epochs on the unlabeled data (2493 images with 90 airports, all labels were removed when fine-tuning) via KD, the F1-score of Airport class increased from 0.3717 to 0.7391 and the average F1-score increased by approximately 1.7%.

The Relationship between the Optimal Temperature and the Number of Categories
In KD training process, the temperature (T) is a significant factor.When T = 1, the output probability vector tends to an one-hot vector.However, if T approaches infinity, the output probability of all categories tends to the same, which made the student model hard to learn from the teacher model.Intuitively, the more categories, the more information the probability vector contained in the outputs of a model.Therefore, it is easy to speculate that the optimal temperature for KD training is negatively related to the number of categories.
To evaluate the relationship between the optimal temperature and the number of categories, we conducted extra KD training experiments on three subsets of the AID dataset.These three AID subsets contain 25, 20, and 15 categories separately, constructed by removing 5, 10, and 15 categories in the original AID dataset.In experiments, the λ is a constant with a value of 0.1 and the range of temperature T is [1,100].The structure model and the teacher model are the same as before, except their output softmax layers (The output of the last softmax layer has the same size as the number of categories).
Results of these experiments are shown in Table 6.According to the metrics (OA, K, and F1-score), the optimal temperature for KD training is 10 for the complete AID dataset, 50 for the 25-category subset, 100 for the 20-category and 15-category subset.It is obvious that the optimal T increased as number of categories decreased on AID dataset, which verified our speculation.We detailed a series of experiments to evaluate KD training framework on AID dataset.As shown by the results, we found that KD could increase OA than direct training and ML training, and achieved the optimum performance when T = 10 and λ = 0.1.We then did same experiments on only 20% of training data in the AID dataset.It also showed that KD training was still better than direct training.From these results, we infer that our framework could still distill knowledge of a certain category even there is no sample from that category.In addition, the student model got better results by continuous fine-tuning on unlabeled samples via unsupervised KD training.This further verifies that the knowledge of the probability distribution within classes can be effectively distilled.Moreover, we found the optimal temperature for KD training is negatively related to the number of categories on AID dataset.

Additional Experiments
To evaluate the effectiveness and applicability of KD framework, we conducted additional experiments on three different datasets of remote sensing scenes: UCMerced dataset, NWPU-RESISC dataset and EuroSAT dataset.

Experiments on UCMerced Dataset
The UCMerced dataset is a widely-used remotely sensed image scene dataset, which consists of a total of only 2100 image patches each of size 256 × 256 with a ground sample distance (GSD) of 0.3 m and covering 21 land-cover classes [22].Images in UCMerced dataset were extracted from the USGS National Map Urban Area Imagery collection for various urban areas around the country.We split UCMerced dataset into three parts: 1050 images for training, 525 images for validating and 525 images for testing, respectively.Due to a relatively small number of training samples, we adopt two four-convolutional-layer CNN models as the teacher model and the student model to avoid overfitting.The specific structures of two models are shown in Appendix A.2.The other training settings are the same as the AID experiments discussed in Section 3.1.
As we can see from Table 7, KD training seems to be effective on UCMerced dataset as almost all KD experiment results point to be better than the direct training result of the student model.In accordance with all the metrics, KD training achieves its optimal performance when T = 100 and λ = 1.Similar to our experiment results on small AID dataset, KD method is still practicable to small datasets.8. Performances of KD training methods with different settings of T and λ seem to be close on NWPU-RESISC dataset, it performs better when T = 5 and λ = 0.1.Different from those datasets with fewer categories, NWPU-RESISC contains 45 classes, including land-use and land-cover classes, which is challenging with high within-class diversity and between-class similarity.KD method proves to be effective on NWPU-RESISC dataset with 45 categories.In experiments, we only exploited red, green, blue three spectral bands as the input of networks.The teacher model and the student model (Appendix A.1) are the same as the experiments on AID dataset, except the output layers.
Table 9 shows the results of KD training experiments on EuroSAT dataset.All of these models easily achieve outstanding OA over 90%, and KD training model reaches optimum when T = 100 and λ = 1.Compared with datasets such as AID and NWPU-RESISC, KD methods still seem to work well on EuroSAT dataset that contains smaller-patch-size and lower-spatial-resolution images.

Discussions
Through conducting additional experiments on three other different datasets of remote sensing images with different settings of T and λ, we evaluate the effectiveness of KD method and find that it ends up with different results on different datasets.In general, KD method is practicable to datasets with lower spatial resolution images, numerous categories, as well as fewer training samples.

Conclusions
In this work, we introduced knowledge distillation framework to improve the performance of small neural networks in the field of remote sensing scene image classification.The core concept behind this framework is to let the small model learn the categorical probability distribution from the pre-trained and well-performed cumbersome model via matching high-temperature softmax output.To evaluate this framework, we conducted several experiments on the AID datasets.The experimental results showed that the KD framework was effective and increased overall accuracy (3% in AID experiments, 5% in UCMerced experiments, 1% in NWPU-RESISC and EuroSAT experiments) for small models and knowledge could be well distilled via high-temperature softmax.In experiments on AID dataset, we also found that the KD training framework helped to train networks on small or unbalanced datasets.In addition, based on the experimental results on AID dataset, we initially concluded that if the dataset contains fewer categories, the KD framework needs a larger temperature-value T to achieve better results.Moreover, to test the effectiveness and applicability of our framework, we conducted experiments on three different datasets of remote sensing scenes: UCMerced dataset, NWPU-RESISC dataset and EuroSAT dataset.The results of these additional experiments show that KD method can improve the performance of small network models on datasets with lower spatial resolution images, numerous categories, as well as fewer training samples.
In the future, we plan to investigate how to integrate KD framework with other model compression methods.Another interesting opportunity for future work is to apply KD framework to other fields of remote sensing, such as semantic segmentation and object detection.

Appendix A. The Model Structures Used in This Paper
Appendix A.1.The Student Models in Experiments on AID, NWPU-RESISC, and EuroSAT Datasets The structure of the student CNN model in experiments on AID dataset is shown in Table A1, which has 1,012,830 parameters inside.It contains four convolutional layers, maxpooling layers, a dropout layer and a fully-connected layer.The student models in experiments on NWPU-RESISC and EuroSAT datasets are the same as it, except the last softmax layer.The output of the last softmax layer has the same size as the number of categories.The structure of the teacher model and the student model in experiments on UCMerced dataset are shown in Tables A2 and A3 respectively.The teacher model flattens the 2D feature maps into 1D features by fully-connected layers, while the student model exploits the Global Average Pooling (GAP) policy [84] to save parameters.

Figure 1 .
Figure 1.An example of categorical probability distributions of high-temperature softmax output with different temperature.C1∼C6 denote six categories.The input logit data are listed in the last row of Table1.When T = 1 (normal softmax), the probability of C2 tends to 1 and others tend to 0. If T goes higher, the categorical distribution tends to be more consistent.

Figure 2 .
Figure 2. (a) Train the teacher model directly on dataset; (b) The process of KD training.The student model output two branches: high-temperature softmax output distill knowledge from the teacher model and the normal softmax output learn to match the ground truth label; (c) In prediction mode or production environment, the trained student model only output normal softmax result.

•
random scaling in the range [0.8, 1.2]; • random rotation by [−30, 30] degrees; • random vertically and horizontally flipping.3.1.3.KD Training and Results To analyze the knowledge distillation (KD) methods discussed in Section 2, we conducted a series of experiments by training the student model via KD with different T (1, 5, 10, 20, 50 and 100) and λ (1 and 0.1) parameters.For comparative purposes, we also did experiments by directly training and matching logits (ML) training the student model on the training dataset.The complete results are listed in Table

Figure 4 .
Figure 4.The training curves of the student model in AID experiments.From training curves, we could find that direct training (blue curves) leads to faster convergence but falls into local optima after 50 epochs, while KD training (orange curves) could always reduce the loss.(a) validation loss curve in logarithmic coordinate system; (b) validation OA curve.

Figure 5 .
Figure 5. KD training results of different temperature and λ on AID dataset.These four subfigures demonstrated the relationship between four metrics and the temperature T, according to Table3.The curves shows that the trend of VOA are similar to other metrics whether λ = 0.1 or λ = 1 on AID dataset.
.0 KD training (T = 10, λ = 0.1) 0.7298 0.6554 0.6432 0.6567 KD training (T = 10, λ = 1.0) 0.7416 0.6221 0.6089 0.6234 3.1.5.Remove One Category Instead of training on a small dataset, we then removed all image samples of the "airport" category in training data and validation data in this experiment.The test results after KD training (T = 10 and λ = 0.1) 100 epochs, are shown in Table

Table 1 .
An example of high-temperature softmax output with different temperature.

Table 2 .
General information of datasets.

Table 3 .
Results of Knowledge Distillation training experiments on AID dataset.

Table 4 .
Results of training the student model on part of AID dataset

Table 5 .
Results of Knowledge Distillation training experiments on AID dataset without Airports.

Table 6 .
Results of Knowledge Distillation training experiments with different numbers of category and temperatures on AID dataset.

Table 7 .
Results of Knowledge Distillation training experiments on UCMerced dataset.RESISC is a large-scale benchmark dataset for remote sensing scene classification, covering 45 scene classes with 700 images in each class.Images within each class are with a size of 256 × 256 pixels in the red-green-blue(RGB) color space, while the spatial resolution of those images vary from about 30 to 0.2 m per pixel.The 31,500 images, extracted by experts from Google Earth, cover more than 100 countries and regions all around the world.15,750 of them are training samples, 7875 of them are validation samples, and the rest 7875 images are test samples.The teacher model and the student model are the same as the experiments on AID dataset, except the output layers.Results of KD training experiments on NWPU-RESISC dataset are recorded in Table

Table 8 .
Results of Knowledge Distillation training experiments on NWPU-RESISC dataset.EuroSAT dataset is another widely-used scene classification dataset based on medium-resolution satellite images covering 13 different spectral bands, and consisting of 10 different land-use and land-cover classes.It contains 27,000 images, each image patch measures 64 × 64 pixels, with a ground sample distance (GSD) varying from 10 to 60 m.Data in EuroSAT dataset is gathered from satellite images of cities in over 30 European countries.Like previous experiments, EuroSAT dataset was split into three parts: 16,200 images for training, 5400 images for validating and 5400 images for testing, respectively.

Table 9 .
Results of Knowledge Distillation training experiments on EuroSAT dataset.

Table A1 .
The structure of the student CNN model in experiments on AID dataset.Appendix A.2.The Teacher Model and the Student Model in Experiments on UCMerced Dataset

Table A2 .
The structure of the teacher CNN model in experiments on UCMerced dataset.

Table A3 .
The structure of the student CNN model in experiments on UCMerced dataset.