Next Article in Journal
Mapping Arctic Bottomfast Sea Ice Using SAR Interferometry
Next Article in Special Issue
Multilayer Perceptron Neural Network for Surface Water Extraction in Landsat 8 OLI Satellite Images
Previous Article in Journal
Auto-Extraction of Linear Archaeological Traces of Tuntian Irrigation Canals in Miran Site (China) from Gaofen-1 Satellite Imagery
Previous Article in Special Issue
Supervised Classification High-Resolution Remote-Sensing Image Based on Interval Type-2 Fuzzy Membership Function
Article

Training Small Networks for Scene Classification of Remote Sensing Images via Knowledge Distillation

State Key Laboratory of Information Engineering in Surveying Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2018, 10(5), 719; https://doi.org/10.3390/rs10050719
Received: 3 April 2018 / Revised: 29 April 2018 / Accepted: 3 May 2018 / Published: 7 May 2018

Abstract

Scene classification, aiming to identify the land-cover categories of remotely sensed image patches, is now a fundamental task in the remote sensing image analysis field. Deep-learning-model-based algorithms are widely applied in scene classification and achieve remarkable performance, but these high-level methods are computationally expensive and time-consuming. Consequently in this paper, we introduce a knowledge distillation framework, currently a mainstream model compression method, into remote sensing scene classification to improve the performance of smaller and shallower network models. Our knowledge distillation training method makes the high-temperature softmax output of a small and shallow student model match the large and deep teacher model. In our experiments, we evaluate knowledge distillation training method for remote sensing scene classification on four public datasets: AID dataset, UCMerced dataset, NWPU-RESISC dataset, and EuroSAT dataset. Results show that our proposed training method was effective and increased overall accuracy (3% in AID experiments, 5% in UCMerced experiments, 1% in NWPU-RESISC and EuroSAT experiments) for small and shallow models. We further explored the performance of the student model on small and unbalanced datasets. Our findings indicate that knowledge distillation can improve the performance of small network models on datasets with lower spatial resolution images, numerous categories, as well as fewer training samples.
Keywords: knowledge distillation; scene classification; convolutional neural networks (CNNs); remote sensing; deep learning knowledge distillation; scene classification; convolutional neural networks (CNNs); remote sensing; deep learning

1. Introduction

With the rapid development of remote sensing (RS) techniques, a large number of algorithms have been proposed to automatically process massive earth observation data. Scene classification is the one of fundamental procedures of RS images analysis and of much importance in many RS applications, such as land-use/land-cover (LULC) [1,2,3,4], agriculture [5,6,7,8], forestry [9,10], and hydrology [11].
The core task of scene classification is to identify the land-cover type of remotely sensed image patches automatically. Numerous supervised machine learning algorithms have been used in scene classification. These algorithms can be categorized into the following three types [12]: low-level, mid-level and high-level methods. Low-level methods first extract low-level hand-crafted features, including SIFT (scale invariant feature transform) [13,14], HOG (histogram of oriented gradients) [15], structural texture similarity [16], LBP (local binary patterns) [17], Gabor descriptors [18], etc. Features extracted by methods are utilized to train shallow classifiers such as Support Vector Machines (SVMs) [19] or K-Nearest Neighbor algorithms (KNNs) [20] to identify the categories of scene images. These scene classification methods based on low-level features are efficient on some structures and arrangements, but cannot easily depict the highly diverse and non-homogeneous spatial distributions in images [21].Mid-level methods build scene representations by coding low-level local attributes. The bag-of-visual-words (BoVW) model is one of the most frequently used approaches [22]. To improve the classification accuracy, various low-level local descriptors are combined as complemented features in the standard BoVW model [23], Gaussian mixture model (GMM) [24], and other pyramid-based models [25,26,27] for scene classification. In addition, topic models are introduced to combine visual semantic information to encode higher order spatial information between low-level local visual words [28,29,30,31,32]. High-level methods are based on deep learning (DL) models. DL models achieve the state-of-the-art in image recognition, speech recognition, semantic segmentation, object detection, natural language processing [33,34,35,36], and RS image scene classification. Many classic DL models in the field of computer vision (CV) have been shown to be effective in RS scene classification [12,37,38,39,40,41,42,43,44]. Most are based on deep convolutional neural network (CNN) models, such as AlexNet [45], CaffeNet [46], VGGNet [47], deep residual networks (ResNet) [48] and DenseNets [49]. Among these approaches, the CNN-based high-level models outperform the state-of-the-art for scene classification tasks in remote sensing [50]. They can deal with scenes that are more complex and achieve higher overall accuracy by learning deep visual features from large training datasets, in contrast to shallow models and low-level methods that rely on manual feature extraction [51].
However, deep CNNs contain more parameters to train, thus they cost more computational resources and time on training and predicting. For example, a 102-convolutional-layer CNN model, which contains 42.4 M parameters, costs 14 ms to classify a 224 × 224 × 3 scene image while a simple 4-convolutional-layer CNN model costs 8.77 ms and only contains 1 M parameters, as detailed in Section 3.1 of this paper. This is an unacceptable cost of time and storage space in special situations, such as embedded devices [52,53,54] or during on-orbit processing [55]. In contrast, a small and shallow model is fast and uses little space, but will not yield accurate and precise results when trained directly on ground truth data [33].
Under these circumstances, model compression techniques become imperative. Generalized model compression improves the performance of a shallow and fast model by learning a cumbersome, but better performing model, or by simplifying the structure of the cumbersome network. There are three mainstream types of model compression algorithms: network pruning, network quantization and Teacher-Student Training (TST). Network pruning is a technique that reduces the size of networks by removing neurons or weights that are less important based on certain standard [52,56,57,58], while network quantization attempts to reduce the precision of weights or features [59,60,61]. In contrast, TST methods impart knowledge from a teacher model into a student model by learning distributions or outputs of some specific layers [62,63,64,65,66,67,68,69,70].
TST is easily confused with transfer learning. In transfer learning, we first train a base model on a certain dataset and task, and then transfer the learned features to another target network to be trained on different target dataset and task [71,72]. A common use of transfer learning in the field of remote sensing is to fine-tune an ImageNet-pretrained model on a remote sensing dataset [2,40,73]. In a TST process, however, the teacher and student models are trained on the same dataset.
Knowledge distillation (KD) is one kind of TST method, first defined in [63]. In that paper, authors distill knowledge from an ensemble of models into a single smaller model via high-temperature softmax training. In this paper, we introduce the KD into remote sensing scene classification for the first time to improve the performance of small and shallow network model. We then conducted experiments on several public datasets to verify the effectiveness of KD and make a quantitative analysis to explore the optimum parameter settings. We will also discuss performance of KD on different types of datasets. In addition, we tested whether knowledge can still be distilled in the absence of a certain type of training samples or in the absence of sufficient training data sets. For convenience in our work, we simplified the KD training process by only learning from one cumbersome model. The rest of this paper is organized as follows. Section 2 will describe the teacher-student training method and our knowledge distillation framework. In Section 3, results and analysis of experiments on several datasets are detailed. Our conclusion and future work are discussed in Section 4.

2. Method

As a special case of teacher-student training (TST), knowledge distillation (KD) imitates high-temperature softmax output from the cumbersome teacher model, serving as our training framework. In this section, we first describe the different TST methods, and introduce the KD training methods we adopted in our research.

2.1. Teacher-Student Training

TST is one of the mainstream model compression methods. In TST processing, a cumbersome pre-trained model is regarded as a teacher, the untrained small and shallow model is a student. The student model not only learns hard target from the ground truth data, but also matches its output to the output of the teacher model. This is because the output of a softmax layer (soft target) contains more information than one-hot labeled dataset (hard target).
A general TST process first trains the cumbersome model directly on dataset and then train the student model by minimizing the following loss function using mini-batch stochastic gradient descent (MSGD) [74] method:
L T S T ( X ) = λ · H 1 ( S ( X ) , Y G T ) + H 2 ( S * ( X ) , T * ( X ) )
where X denotes a batch of input data, S ( X ) is the softmax output of student model, Y G T is the ground truth label corresponding to the input X, and S * ( X ) and T * ( X ) are the output of a certain layer in student model and teacher model, λ is a non-negative constant. The first term in this loss function is a ground truth constraint. If λ = 0 , the supervised information is only provided by the teacher model, instead of the ground truth data. If λ goes higher, the output probability distribution of the trained student model is more like the teacher model. H 1 (X,Y) and H 2 (X,Y) in Equation (1) can be any common loss functions, such as mean squared error (MSE) or categorical crossentropy (CE):
M S E ( x , y ) = 1 m i = 1 m k = 1 d x i k y i k 2
C E ( x , y ) = 1 m i = 1 m k = 1 d x i k log y i k + 1 x i k log 1 y i k
where m denotes the batch size, d denotes the size of input vector x , and x i k indicates the kth element of the ith input samples.
The simplest and the naive method in TST matches the output probability distribution of the last softmax layer (MS) in the student model to the teacher model. In this case, S * ( X ) = S ( X ) and T * ( X ) is the softmax output of the teacher model. Thus, the loss function of MS mode can be defined as:
L M S ( X ) = λ · C E ( S ( X ) , Y G T ) + C E ( S ( X ) , T ( X ) )
To better impart knowledge from teacher to student, Bucilua, C., et al. [62] proposed the matching logits (ML) method. Logits are the inputs to the final softmax layer of a network. Here, the discrepancy among categories in the logits form is more significant than the probabilities form. As a result, the CE of the second term in the loss function are replaced by MSE because the value of a logit can be any real number. The loss function in ML mode now becomes:
L M L ( X ) = λ · C E ( S ( X ) , Y G T ) + M S E ( S l o g i t s ( X ) , T l o g i t s ( X ) )
where S l o g i t s ( X ) and T l o g i t s ( X ) are the logits output of their model. In that work, the authors conducted several experiments and verified that the ML mode could improve small models by learning from model ensembles in simple machine learning tasks.
In addition to these methods as mentioned, various researchers in the field have proposed other means for TST training. Huang, Z., et al. [69] proposed an idea of matching the distributions of neuron selectivity patterns (NST) between two networks, where a new loss function through minimizing the maximum mean discrepancy (MMD) is designed to match these distributions. In [70], the authors compressed wide and shallow networks into thin but deeper networks, the FitNet, by learning intermediate-level hints from teacher’s hidden layers. Yim [68] transfers the knowledge distilled from the flow between layers, computed by the inner product between features, and generated into the FSP matrix. Different from the previous methods, Chen, T., et al. [64] accelerates the training of a larger network by transferring knowledge from a network to the new larger network. Lopez-Paz, D., et al. [66] combines knowledge distillation with privileged information [75], deriving into a generalized distillation.

2.2. Knowledge Distillation Framework

2.2.1. Knowledge from Probability Distribution

The last output layer of neural networks is the softmax layer that transforms the logits z i into a probability p i via:
p i = s o f t m a x ( z i ) = exp ( z i ) j exp ( z j )
However, the normal softmax output always leads to an approximate one-hot vector. An example is shown in Table 1 and Figure 1, which indicates that normal softmax (temperature = 1) made the probability of C2 class tends to one and others tend to zero. The entropy of the output also tends to zero. In practice, a remote sensing scene image consists of several categories of pixels. Thus, information for non-maximum probability categories provides additional supervision for training.
To improve the discrimination ability and generalizability of a model, Hinton, G., et al. [63] introduced high temperature softmax function instead of normal softmax or logits. High temperature softmax function was first used in the field of reinforcement learning [76], denoted as:
p i = exp ( z i / T ) j exp ( z j / T )
where T is the temperature. The normal softmax is a special case if T = 1 .
Softmax with high temperature could increase the entropy of the categorical vector which helps to learn more knowledge from the probability distribution of a complex scene. An example of categorical probability distributions of high-temperature softmax output with different temperature is shown in Table 1 and Figure 1. C1∼C6 denote six categories, and the example input logit data are listed in the last row of the table. When T = 1 (normal softmax), the probability of C2 tends to 1 and others tend to 0. If T goes higher, the entropy of the categorical distribution becomes higher. It can be inferred from this example that if the student model learns high-temperature softmax output from the teacher model, it will distill more knowledge of categorical probability distribution. The experiments in Section 3 will verify this inference.

2.2.2. KD Training Process

By introducing the high-temperature softmax into our framework in former subsection, we divide the whole training process of knowledge distillation into two procedures. First, train the teacher model directly on dataset, which is shown in Figure 2a. The target is to let T ( X ) , the softmax output of the teacher model, fit the ground truth Y G T . Then, distill the knowledge via high-temperature softmax, as shown in Figure 2b. The student model outputs two branches: high-temperature softmax outputs distill knowledge from the teacher model and the normal softmax outputs learn to match the ground truth label. Thus, the total loss of KD process L K D ( X ) is:
L K D ( X ) = λ · C E ( S ( X ) , Y G T ) + C E ( S T ( X ) , T T ( X ) )
where S T and T T ( X ) denote the T-temperature softmax output of the student model and the teacher model respectively. In extreme conditions, such as lacking of training samples, the teacher’s high-temperature output can even provide supervision to the student model without any ground truth data (set λ = 0 ). In prediction or production environment, the trained student model only outputs normal softmax result, as shown in Figure 2c.
As the higher-temperature softmax output from the teacher model contains different information than the ground truth dataset, our KD framework provides the student model with more categorical information in scene classification tasks.

3. Experimental Results and Analysis

In this section, to test the performance of our distillation framework, we conducted several experiments on four remote sensing scene classification datasets: AID dataset [51] (Figure 3a), UCMerced dataset [22] (Figure 3b), NWPU-RESISC dataset [77] (Figure 3c) and EuroSAT dataset [2] (Figure 3d). The general information of each dataset is listed in Table 2.
For each dataset, we first train a large deep network model and a small shallow one by direct training methods. Then we train the small model by our proposed KD methods. For comparison, other model compression methods including ML were also processed, and we analyzed in detail the experimental results (The implementation of the framework was based on Keras 2.1.1 [78] and TensorFlow 1.4.0 [79]).

3.1. Experiments on AID Dataset

To evaluate the performance and robustness of our proposed KD framework for remote sensing image scene classification, we first designed and conducted several experiments on the AID dataset.

3.1.1. Dataset Description

The AID dataset is a large-scale public data set for aerial scene classification, provided by [51]. It contains 10,000 manually labeled remote sensing scene images from around the world. All images in the AID dataset were collected from Google Earth (https://www.google.com/earth/). Each is 600 × 600 pixels with RGB three spectral bands. The task in our experiments was to classify all scene images into thirty categories. The specific categories are: airport, bare land, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, park, parking, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, storage tanks, and viaduct.
The dataset was divided into three parts for each category: training data (around 50%), validation data (around 25%) and test data (around 25%). Considered from computational resources, all images were re-sampled from 600 × 600 × 3 to 224 × 224 × 3 pixels. In total, there were 5000 images in the training dataset, 2507 as validation data and 2493 in the test dataset.
In the experiments, overall accuracy ( O A ), kappa coefficient (K) [80], precision (P), recall (R) and F1-score ( F 1 ) of the test dataset were adopted as the accuracy assessment metrics. By introducing the confusion matrix, a table with two rows and two columns that reports the number of true positives ( t p ), false positives ( f p ), false negatives ( f n ), and true negatives ( t n ), we can define these assessment metrics as follows. Overall accuracy ( O A ) is defined as the number of correctly predicted images divided by the total number of predicted images, denoted as:
O A = N t r u e N
while N t r u e denotes the number of correctly predicted images, and N stands for the total number of predicted images. Precision (P) and recall (R) for one-class classification are then defined as:
P = t p t p + f p , R = t p t p + f n
and F1-score is the harmonic mean of the precision and recall, which can be calculated by:
F 1 o n e c l a s s = 2 1 / R + 1 / P = 2 · P · R P + R
when it turns to multi-class situation, we use weighted F1-score as our metric:
F 1 = k = 1 K w ( k ) · F 1 o n e c l a s s k , w ( k ) = n k N
while weight w ( k ) is determined by the number of samples of each category, F 1 o n e c l a s s k is correspondingly calculated in each category. Kappa coefficient (K) is another metric that measures inter-rater agreement for categorical items, in multi-class situation, K is defined as:
K = O A P e 1 P e
where P e is the hypothetical probability of chance agreement, calculated as:
P e = 1 N 2 · k n k 1 n k 2
where k denotes the number of categories, and n k i is the number of times rater i predicted category k.

3.1.2. Structure of Networks and Direct Training

We choose the classic 101-layer deep residual network (ResNet-101) [48] as the teacher model (The structure of ResNet-101 can be found at https://github.com/KaimingHe/deep-residual-networks and http://ethereon.github.io/netscope/#/gist/b21e2aae116dc1ac7b50). For comparison, we designed a shallow and simple CNN with four convolutional layers and only one fully connected layer. To prevent over-fitting, we add a Dropout layer [81] between the convolutional layers and fully connected layers. The specific structure of the student model is detailed in Appendix A.1.
Both two models were trained by the common back-propagation (BP) algorithm [82] with a batch size of 24. We adopted Adadelta [83] as the weights updating optimization method. Each model was trained for 100 epochs. All the experimental results on AID dataset were processed on a desktop PC with Intel Core i7 6700K (4C8T), 32GB RAM and Nvidia GeForce GTX1080 Ti (11264MB memory). After training one epoch, the validation OA (VOA) will be recorded and the model achieved the highest VOA was used to make a final accuracy assessment on the test dataset. In our experiments, we used the following data augmentation policies for generalization purpose:
  • random scaling in the range [0.8, 1.2];
  • random rotation by [−30, 30] degrees;
  • random vertically and horizontally flipping.

3.1.3. KD Training and Results

To analyze the knowledge distillation (KD) methods discussed in Section 2, we conducted a series of experiments by training the student model via KD with different T (1, 5, 10, 20, 50 and 100) and λ (1 and 0.1) parameters. For comparative purposes, we also did experiments by directly training and matching logits (ML) training the student model on the training dataset. The complete results are listed in Table 3 (The items in bold in each table mean the optimum results of all.). In addition to accuracy assessment metrics, we also recorded the FPS (frames per second) values in Table 3.
There were three key findings as shown in Table 3:
  • KD training is effective. It could increase OA by approximately 3%, compared to the direct training way. However, ML training did not seem to work or even reduce the OA. For further analysis, we draw the validation loss curves and VOA curves of these three types of training methods, which are shown in Figure 4a,b. From training curves, we could find that direct training leads to faster convergence (50 epochs) but falls into local optima while KD training could always reduce the loss.
  • The student model learned more knowledge from the teacher model via higher temperature softmax output. Different from T, the effect of λ parameter is not clear. When T is 1, 5, 50, or 100, the bigger λ the better OA achieved. However, if T is set to 10 or 20, the smaller λ value performed better. To further analyze λ , we drew four subfigures in Figure 5 to demonstrate the relationship between our four metrics and the temperature T. The curves shows that the trend of VOA are similar to other metrics whether λ = 0.1 or λ = 1 on AID dataset.
  • From a macro point of view, KD training methods could improve the performance of a network model, in terms of OA, K or F1. In test data evaluation, it even surpassed the deep teacher model by 60% higher speed and using only 2.4% model parameters.

3.1.4. KD Training on Small Dataset

If a student model could learn knowledge from a teacher model via KD training on the complete dataset, it should also work on a small part of the dataset. To verify this idea, we implemented extra experiments of KD training on a small part of AID dataset. We took 20% training samples (1000 images) and 20% validation data (507 images) of the original AID dataset and evaluated accuracy on the complete test dataset.
The results are shown in Table 4. As shown, small dataset lead to a shorter training time (It includes validation time.) but poorer training results. However, the KD training was still better than direct training under such conditions. In addition, if we decreased λ , the weight of the first term in the loss function (Equation (8)), the generalizability of the student model would be greatly enhanced.

3.1.5. Remove One Category

Instead of training on a small dataset, we then removed all image samples of the “airport” category in training data and validation data in this experiment. The test results after KD training ( T = 10 and λ = 0.1 ) 100 epochs, are shown in Table 5.
From the results, in the absence of “airport” samples, the KD training process still achieved better F1-score than direct training on complete dataset. From the point of view of the distilled model, it has never seen “airport” before. However, it got high precision (0.913), although the recall is very low (0.2333). If we continued to fine-tune the student model only 10 epochs on the unlabeled data (2493 images with 90 airports, all labels were removed when fine-tuning) via KD, the F1-score of Airport class increased from 0.3717 to 0.7391 and the average F1-score increased by approximately 1.7%.

3.1.6. The Relationship between the Optimal Temperature and the Number of Categories

In KD training process, the temperature (T) is a significant factor. When T = 1 , the output probability vector tends to an one-hot vector. However, if T approaches infinity, the output probability of all categories tends to the same, which made the student model hard to learn from the teacher model. Intuitively, the more categories, the more information the probability vector contained in the outputs of a model. Therefore, it is easy to speculate that the optimal temperature for KD training is negatively related to the number of categories.
To evaluate the relationship between the optimal temperature and the number of categories, we conducted extra KD training experiments on three subsets of the AID dataset. These three AID subsets contain 25, 20, and 15 categories separately, constructed by removing 5, 10, and 15 categories in the original AID dataset. In experiments, the λ is a constant with a value of 0.1 and the range of temperature T is [1, 100]. The structure model and the teacher model are the same as before, except their output softmax layers (The output of the last softmax layer has the same size as the number of categories).
Results of these experiments are shown in Table 6. According to the metrics (OA, K, and F1-score), the optimal temperature for KD training is 10 for the complete AID dataset, 50 for the 25-category subset, 100 for the 20-category and 15-category subset. It is obvious that the optimal T increased as number of categories decreased on AID dataset, which verified our speculation.

3.1.7. Evaluating Our Proposed KD Method on AID Dataset

We detailed a series of experiments to evaluate KD training framework on AID dataset. As shown by the results, we found that KD could increase OA than direct training and ML training, and achieved the optimum performance when T = 10 and λ = 0.1 . We then did same experiments on only 20% of training data in the AID dataset. It also showed that KD training was still better than direct training. From these results, we infer that our framework could still distill knowledge of a certain category even there is no sample from that category. In addition, the student model got better results by continuous fine-tuning on unlabeled samples via unsupervised KD training. This further verifies that the knowledge of the probability distribution within classes can be effectively distilled. Moreover, we found the optimal temperature for KD training is negatively related to the number of categories on AID dataset.

3.2. Additional Experiments

To evaluate the effectiveness and applicability of KD framework, we conducted additional experiments on three different datasets of remote sensing scenes: UCMerced dataset, NWPU-RESISC dataset and EuroSAT dataset.

3.2.1. Experiments on UCMerced Dataset

The UCMerced dataset is a widely-used remotely sensed image scene dataset, which consists of a total of only 2100 image patches each of size 256 × 256 with a ground sample distance (GSD) of 0.3 m and covering 21 land-cover classes [22]. Images in UCMerced dataset were extracted from the USGS National Map Urban Area Imagery collection for various urban areas around the country. We split UCMerced dataset into three parts: 1050 images for training, 525 images for validating and 525 images for testing, respectively. Due to a relatively small number of training samples, we adopt two four-convolutional-layer CNN models as the teacher model and the student model to avoid overfitting. The specific structures of two models are shown in Appendix A.2. The other training settings are the same as the AID experiments discussed in Section 3.1.
As we can see from Table 7, KD training seems to be effective on UCMerced dataset as almost all KD experiment results point to be better than the direct training result of the student model. In accordance with all the metrics, KD training achieves its optimal performance when T = 100 and λ = 1. Similar to our experiment results on small AID dataset, KD method is still practicable to small datasets.

3.2.2. Experiments on NWPU-RESISC Dataset

NWPU-RESISC is a large-scale benchmark dataset for remote sensing scene classification, covering 45 scene classes with 700 images in each class. Images within each class are with a size of 256 × 256 pixels in the red-green-blue(RGB) color space, while the spatial resolution of those images vary from about 30 to 0.2 m per pixel. The 31,500 images, extracted by experts from Google Earth, cover more than 100 countries and regions all around the world. 15,750 of them are training samples, 7875 of them are validation samples, and the rest 7875 images are test samples. The teacher model and the student model are the same as the experiments on AID dataset, except the output layers.
Results of KD training experiments on NWPU-RESISC dataset are recorded in Table 8. Performances of KD training methods with different settings of T and λ seem to be close on NWPU-RESISC dataset, it performs better when T = 5 and λ = 0.1. Different from those datasets with fewer categories, NWPU-RESISC contains 45 classes, including land-use and land-cover classes, which is challenging with high within-class diversity and between-class similarity. KD method proves to be effective on NWPU-RESISC dataset with 45 categories.

3.2.3. Experiments on EuroSAT Dataset

EuroSAT dataset is another widely-used scene classification dataset based on medium-resolution satellite images covering 13 different spectral bands, and consisting of 10 different land-use and land-cover classes. It contains 27,000 images, each image patch measures 64 × 64 pixels, with a ground sample distance (GSD) varying from 10 to 60 m. Data in EuroSAT dataset is gathered from satellite images of cities in over 30 European countries. Like previous experiments, EuroSAT dataset was split into three parts: 16,200 images for training, 5400 images for validating and 5400 images for testing, respectively. In experiments, we only exploited red, green, blue three spectral bands as the input of networks. The teacher model and the student model (Appendix A.1) are the same as the experiments on AID dataset, except the output layers.
Table 9 shows the results of KD training experiments on EuroSAT dataset. All of these models easily achieve outstanding OA over 90%, and KD training model reaches optimum when T = 100 and λ = 1. Compared with datasets such as AID and NWPU-RESISC, KD methods still seem to work well on EuroSAT dataset that contains smaller-patch-size and lower-spatial-resolution images.

3.2.4. Discussions

Through conducting additional experiments on three other different datasets of remote sensing images with different settings of T and λ , we evaluate the effectiveness of KD method and find that it ends up with different results on different datasets. In general, KD method is practicable to datasets with lower spatial resolution images, numerous categories, as well as fewer training samples.

4. Conclusions

In this work, we introduced knowledge distillation framework to improve the performance of small neural networks in the field of remote sensing scene image classification. The core concept behind this framework is to let the small model learn the categorical probability distribution from the pre-trained and well-performed cumbersome model via matching high-temperature softmax output. To evaluate this framework, we conducted several experiments on the AID datasets. The experimental results showed that the KD framework was effective and increased overall accuracy (3% in AID experiments, 5% in UCMerced experiments, 1% in NWPU-RESISC and EuroSAT experiments) for small models and knowledge could be well distilled via high-temperature softmax. In experiments on AID dataset, we also found that the KD training framework helped to train networks on small or unbalanced datasets. In addition, based on the experimental results on AID dataset, we initially concluded that if the dataset contains fewer categories, the KD framework needs a larger temperature-value T to achieve better results. Moreover, to test the effectiveness and applicability of our framework, we conducted experiments on three different datasets of remote sensing scenes: UCMerced dataset, NWPU-RESISC dataset and EuroSAT dataset. The results of these additional experiments show that KD method can improve the performance of small network models on datasets with lower spatial resolution images, numerous categories, as well as fewer training samples.
In the future, we plan to investigate how to integrate KD framework with other model compression methods. Another interesting opportunity for future work is to apply KD framework to other fields of remote sensing, such as semantic segmentation and object detection.

Author Contributions

G.C. designed the whole framework and experiments. He also wrote the paper. X.Z. guided the algorithm design. X.T., Y.C. help organize the paper and performed the experimental analysis. F.D., K.Z. help write python scripts of our framework. Y.G. and Q.W. contributed to the discussion of the design. G.C. drafted the manuscript, which was revised by all authors. All authors read and approved the submitted manuscript.

Acknowledgments

This work is supported in part by the funding from the LIESMARS Special Research Funding and in part by the Fundamental Research Funds for the Central Universities. The authors would like to thank Gui-Song Xia from State Key Laboratory for Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University for providing the awesome remote sensing scene classification dataset AID. The authors would also like to thank the developers in the Keras and Tensorflow developer communities for their open source deep learning frameworks.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. The Model Structures Used in This Paper

Appendix A.1. The Student Models in Experiments on AID, NWPU-RESISC, and EuroSAT Datasets

The structure of the student CNN model in experiments on AID dataset is shown in Table A1, which has 1,012,830 parameters inside. It contains four convolutional layers, maxpooling layers, a dropout layer and a fully-connected layer. The student models in experiments on NWPU-RESISC and EuroSAT datasets are the same as it, except the last softmax layer. The output of the last softmax layer has the same size as the number of categories.
Table A1. The structure of the student CNN model in experiments on AID dataset.
Table A1. The structure of the student CNN model in experiments on AID dataset.
Layer typeAttributesOutput SizeParameters
Input224, 224, 30
Conv2Dfilters: 64, kernel size: (3, 3), activation: ReLU224, 224, 641792
MaxPooling2Dpool size: (2, 2)112, 112, 640
Conv2Dfilters: 64, kernel size: (3, 3), activation: ReLU112, 112, 6436,928
MaxPooling2Dpool size: (2, 2)56, 56, 640
Conv2Dfilters: 128, kernel size: (3, 3), activation: ReLU56, 56, 12873,856
MaxPooling2Dpool size: (2, 2)28, 28, 1280
Conv2Dfilters: 128, kernel size: (3, 3), activation: ReLU28, 28, 128147,584
MaxPooling2Dpool size: (2, 2)14, 14, 1280
Flatten25,0880
Dropoutdrop rate: 0.325,0880
Denseunits: 3030752,670
Softmax300
Total1,012,830

Appendix A.2. The Teacher Model and the Student Model in Experiments on UCMerced Dataset

The structure of the teacher model and the student model in experiments on UCMerced dataset are shown in Table A2 and Table A3 respectively. The teacher model flattens the 2D feature maps into 1D features by fully-connected layers, while the student model exploits the Global Average Pooling (GAP) policy [84] to save parameters.
Table A2. The structure of the teacher CNN model in experiments on UCMerced dataset.
Table A2. The structure of the teacher CNN model in experiments on UCMerced dataset.
Layer TypeAttributesOutput SizeParameters
Input224, 224, 30
Conv2Dfilters: 64, kernel size: (3, 3), activation: ReLU224, 224, 641792
MaxPooling2Dpool size: (2, 2)112, 112, 640
Conv2Dfilters: 128, kernel size: (3, 3), activation: ReLU112, 112, 12873,856
MaxPooling2Dpool size: (2, 2)56, 56, 1280
Conv2Dfilters: 256, kernel size: (3, 3), activation: ReLU56, 56, 256295,168
MaxPooling2Dpool size: (2, 2)28, 28, 2560
Conv2Dfilters: 512, kernel size: (3, 3), activation: ReLU28, 28, 5121,180,160
MaxPooling2Dpool size: (2, 2)14, 14, 5120
Flatten100,3520
Dropoutdrop rate: 0.3100,3520
Denseunits: 21212,107,413
Softmax210
Total3,658,389
Table A3. The structure of the student CNN model in experiments on UCMerced dataset.
Table A3. The structure of the student CNN model in experiments on UCMerced dataset.
Layer TypeAttributesOutput SizeParameters
Input224, 224, 30
Conv2Dfilters: 32, kernel size: (3, 3), activation: ReLU224, 224, 32896
MaxPooling2Dpool size: (2, 2)112, 112, 320
Conv2Dfilters: 32, kernel size: (3, 3), activation: ReLU112, 112, 329248
MaxPooling2Dpool size: (2, 2)56, 56, 320
Conv2Dfilters: 64, kernel size: (3, 3), activation: ReLU56, 56, 6418,496
MaxPooling2Dpool size: (2, 2)28, 28, 640
Conv2Dfilters: 64, kernel size: (3, 3), activation: ReLU28, 28, 6436,928
MaxPooling2Dpool size: (2, 2)14, 14, 640
Dropoutdrop rate: 0.314, 14, 640
Conv2Dfilters: 21, kernel size: (1, 1)14, 14, 210
GlobalAveragePooling2D211365
Softmax210
Total66,933

References

  1. Estoque, R.C.; Murayama, Y.; Akiyama, C.M. Pixel-based and object-based classifications using high- and medium-spatial-resolution imageries in the urban and suburban landscapes. Geocarto Int. 2015, 30, 1113–1129. [Google Scholar] [CrossRef]
  2. Helber, P.; Bischke, B.; Dengel, A.; Borth, D. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. arXiv, 2017; arXiv:1709.00029. [Google Scholar]
  3. Zhang, X.; Wang, Q.; Chen, G.; Dai, F.; Zhu, K.; Gong, Y.; Xie, Y. An object-based supervised classification framework for very-high-resolution remote sensing images using convolutional neural networks. Remote Sens. Lett. 2018, 9, 373–382. [Google Scholar] [CrossRef]
  4. Chen, G.; Zhang, X.; Wang, Q.; Dai, F.; Gong, Y.; Zhu, K. Symmetrical Dense-Shortcut Deep Fully Convolutional Networks for Semantic Segmentation of Very-High-Resolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 1633–1644. [Google Scholar] [CrossRef]
  5. Gualtieri, J.A.; Cromp, R.F. Support vector machines for hyperspectral remote sensing classification. In Proceedings of the 27th AIPR Workshop: Advances in Computer-Assisted Recognition, Washington, DC, USA, 14–16 October 1998; International Society for Optics and Photonics: Bellingham, WA, USA, 1999; Volume 3584, pp. 221–233. [Google Scholar]
  6. Duro, D.C.; Franklin, S.E.; Dubé, M.G. A comparison of pixel-based and object-based image analysis with selected machine learning algorithms for the classification of agricultural landscapes using SPOT-5 HRG imagery. Remote Sens. Environ. 2012, 118, 259–272. [Google Scholar] [CrossRef]
  7. Cheriyadat, A.M. Unsupervised Feature Learning for Aerial Scene Classification. IEEE Trans. Geosci. Remote Sens. 2014, 52, 439–451. [Google Scholar] [CrossRef]
  8. Peña, J.; Gutiérrez, P.; Hervás-Martínez, C.; Six, J.; Plant, R.; López-Granados, F. Object-Based Image Classification of Summer Crops with Machine Learning Methods. Remote Sens. 2014, 6, 5019–5041. [Google Scholar] [CrossRef]
  9. Lu, D.; Li, G.; Moran, E.; Kuang, W. A comparative analysis of approaches for successional vegetation classification in the Brazilian Amazon. GISci. Remote Sens. 2014, 51, 695–709. [Google Scholar] [CrossRef]
  10. De Chant, T.; Kelly, M. Individual object change detection for monitoring the impact of a forest pathogen on a hardwood forest. Photogramm. Eng. Remote Sens. 2009, 75, 1005–1013. [Google Scholar] [CrossRef]
  11. Dribault, Y.; Chokmani, K.; Bernier, M. Monitoring Seasonal Hydrological Dynamics of Minerotrophic Peatlands Using Multi-Date GeoEye-1 Very High Resolution Imagery and Object-Based Classification. Remote Sens. 2012, 4, 1887–1912. [Google Scholar] [CrossRef]
  12. Hu, F.; Xia, G.S.; Hu, J.; Zhang, L. Transferring Deep Convolutional Neural Networks for the Scene Classification of High-Resolution Remote Sensing Imagery. Remote Sens. 2015, 7, 14680–14707. [Google Scholar] [CrossRef]
  13. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  14. Yang, Y.; Newsam, S. Comparing SIFT descriptors and Gabor texture features for classification of remote sensed imagery. In Proceedings of the 15th IEEE International Conference on Image Processing (ICIP 2008), San Diego, CA, USA, 12–15 October 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1852–1855. [Google Scholar]
  15. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
  16. Risojević, V.; Babić, Z. Aerial image classification using structural texture similarity. In Proceedings of the 2011 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Bilbao, Spain, 14–17 December 2011; pp. 190–195. [Google Scholar]
  17. Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
  18. Risojević, V.; Momić, S.; Babić, Z. Gabor descriptors for aerial image classification. In International Conference on Adaptive and Natural Computing Algorithms; Springer: Berlin, Germany, 2011; pp. 51–60. [Google Scholar]
  19. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  20. Altman, N.S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 1992, 46, 175–185. [Google Scholar]
  21. Yang, Y.; Newsam, S. Geographic image retrieval using local invariant features. IEEE Trans. Geosci. Remote Sens. 2013, 51, 818–832. [Google Scholar] [CrossRef]
  22. Yang, Y.; Newsam, S. Bag-of-visual-words and Spatial Extensions for Land-use Classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS ’10), San Jose, CA, USA, 2–5 November 2010; ACM: New York, NY, USA, 2010; pp. 270–279. [Google Scholar]
  23. Chen, L.; Yang, W.; Xu, K.; Xu, T. Evaluation of local features for scene classification using VHR satellite images. In Proceedings of the 2011 Joint Urban Remote Sensing Event (JURSE), Munich, Germany, 11–13 April 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 385–388. [Google Scholar]
  24. Perronnin, F.; Dance, C. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’07), Minneapolis, MN, USA, 17–22 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1–8. [Google Scholar]
  25. Yang, Y.; Newsam, S. Spatial pyramid co-occurrence for image classification. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1465–1472. [Google Scholar]
  26. Lazebnik, S.; Schmid, C.; Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 2, pp. 2169–2178. [Google Scholar]
  27. Chen, Y.; Zhao, X.; Jia, X. Spectral-Spatial Classification of Hyperspectral Data Based on Deep Belief Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2381–2392. [Google Scholar] [CrossRef]
  28. Bosch, A.; Zisserman, A.; Muñoz, X. Scene classification via pLSA. In Proceedings of the 9th European Conference on Computer Vision—ECCV 2006, Graz, Austria, 7–13 May 2006; pp. 517–530. [Google Scholar]
  29. Lienou, M.; Maitre, H.; Datcu, M. Semantic annotation of satellite images using latent Dirichlet allocation. IEEE Geosci. Remote Sens. Lett. 2010, 7, 28–32. [Google Scholar] [CrossRef]
  30. Kusumaningrum, R.; Wei, H.; Manurung, R.; Murni, A. Integrated visual vocabulary in latent Dirichlet allocation–based scene classification for IKONOS image. J. Appl. Remote Sens. 2014, 8, 083690. [Google Scholar] [CrossRef]
  31. Zhong, Y.; Cui, M.; Zhu, Q.; Zhang, L. Scene classification based on multifeature probabilistic latent semantic analysis for high spatial resolution remote sensing images. J. Appl. Remote Sens. 2015, 9, 095064. [Google Scholar] [CrossRef]
  32. Zhong, Y.; Zhu, Q.; Zhang, L. Scene classification based on the multifeature fusion probabilistic topic model for high spatial resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6207–6222. [Google Scholar] [CrossRef]
  33. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  34. Bengio, I.G.Y.; Courville, A. Deep Learning; Book in preparation for MIT Press.
  35. Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2017, 77, 354–377. [Google Scholar] [CrossRef]
  36. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed]
  37. Mnih, V. Machine Learning for Aerial Image Labeling. Ph.D. Thesis, University of Toronto, Toronto, ON, Canada, 2013. [Google Scholar]
  38. Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep Learning-Based Classification of Hyperspectral Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2094–2107. [Google Scholar] [CrossRef]
  39. Marmanis, D.; Wegner, J.D.; Galliani, S.; Schindler, K.; Datcu, M.; Stilla, U. Semantic segmentation of aerial images with an ensemble of CNSS. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 3, 473–480. [Google Scholar] [CrossRef]
  40. Nogueira, K.; Penatti, O.A.B.; Santos, J.A.D. Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Recognit. 2017, 61, 539–556. [Google Scholar] [CrossRef]
  41. Zhang, X.; Chen, G.; Wang, W.; Wang, Q.; Dai, F. Object-Based Land-Cover Supervised Classification for Very-High-Resolution UAV Images Using Stacked Denoising Autoencoders. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3373–3385. [Google Scholar] [CrossRef]
  42. Liu, Y.; Huang, C. Scene classification via triplet networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 11, 220–237. [Google Scholar] [CrossRef]
  43. Li, W.; Fu, H.; Yu, L.; Gong, P.; Feng, D.; Li, C.; Clinton, N. Stacked Autoencoder-based deep learning for remote-sensing image classification: a case study of African land-cover mapping. Int. J. Remote Sens. 2016, 37, 5632–5646. [Google Scholar] [CrossRef]
  44. Zhang, M.; Hu, X.; Zhao, L.; Lv, Y.; Luo, M.; Pang, S. Learning Dual Multi-Scale Manifold Ranking for Semantic Segmentation of High-Resolution Images. Remote Sens. 2017, 9, 500. [Google Scholar] [CrossRef]
  45. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; MIT Press Ltd.: Cambridge, MA, USA, 2012; pp. 1097–1105. [Google Scholar]
  46. Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; ACM: New York, NY, USA, 2014; pp. 675–678. [Google Scholar]
  47. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv, 2014; arXiv:1409.1556. [Google Scholar]
  48. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas Valley, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar] [CrossRef]
  49. Huang, G.; Liu, Z.; Weinberger, K.Q.; van der Maaten, L. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; Volume 1, p. 3. [Google Scholar]
  50. Zhang, L.; Zhang, L.; Du, B. Deep Learning for Remote Sensing Data: A Technical Tutorial on the State of the Art. IEEE Geosci. Remote Sens. Mag. 2016, 4, 22–40. [Google Scholar] [CrossRef]
  51. Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
  52. Chen, W.; Wilson, J.; Tyree, S.; Weinberger, K.; Chen, Y. Compressing neural networks with the hashing trick. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2285–2294. [Google Scholar]
  53. Zhao, W.; Fu, H.; Luk, W.; Yu, T.; Wang, S.; Feng, B.; Ma, Y.; Yang, G. F-CNN: An FPGA-based framework for training Convolutional Neural Networks. In Proceedings of the 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP), London, UK, 6–8 July 2016; pp. 107–114. [Google Scholar] [CrossRef]
  54. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv, 2017; arXiv:1704.04861. [Google Scholar]
  55. Cao, C.; De Luccia, F.J.; Xiong, X.; Wolfe, R.; Weng, F. Early on-orbit performance of the visible infrared imaging radiometer suite onboard the Suomi National Polar-Orbiting Partnership (S-NPP) satellite. IEEE Trans. Geosci. Remote Sens. 2014, 52, 1142–1156. [Google Scholar] [CrossRef]
  56. He, Y.; Zhang, X.; Sun, J. Channel Pruning for Accelerating Very Deep Neural Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  57. Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning convolutional neural networks for resource efficient transfer learning. arXiv, 2016; arXiv:1611.06440. [Google Scholar]
  58. Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv, 2015; arXiv:1510.00149. [Google Scholar]
  59. Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv, 2016; arXiv:1606.06160. [Google Scholar]
  60. Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or −1. arXiv, 2016; arXiv:1602.02830. [Google Scholar]
  61. Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. XNOR-Net: ImageNet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision (ECCV’16), Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin, Germany, 2016; pp. 525–542. [Google Scholar]
  62. Bucilua, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’06), Philadelphia, PA, USA, 20–23 August 2006; ACM: New York, NY, USA, 2006; pp. 535–541. [Google Scholar] [CrossRef]
  63. Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv, 2015; arXiv:1503.02531. [Google Scholar]
  64. Chen, T.; Goodfellow, I.; Shlens, J. Net2net: Accelerating learning via knowledge transfer. arXiv, 2015; arXiv:1511.05641. [Google Scholar]
  65. Ba, J.; Caruana, R. Do Deep Nets Really Need to be Deep? In Advances in Neural Information Processing Systems 27; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; pp. 2654–2662. [Google Scholar]
  66. Lopez-Paz, D.; Bottou, L.; Schölkopf, B.; Vapnik, V. Unifying distillation and privileged information. arXiv, 2015; arXiv:1511.03643. [Google Scholar]
  67. Hu, Z.; Ma, X.; Liu, Z.; Hovy, E.; Xing, E. Harnessing deep neural networks with logic rules. arXiv, 2016; arXiv:1603.06318. [Google Scholar]
  68. Yim, J.; Joo, D.; Bae, J.; Kim, J. A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4133–4141. [Google Scholar]
  69. Huang, Z.; Wang, N. Like What You Like: Knowledge Distill via Neuron Selectivity Transfer. arXiv, 2017; arXiv:1707.01219. [Google Scholar]
  70. Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv, 2014; arXiv:1412.6550. [Google Scholar]
  71. Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
  72. Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks. In Proceedings of the Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 3320–3328. [Google Scholar]
  73. Marmanis, D.; Datcu, M.; Esch, T.; Stilla, U. Deep learning earth observation classification using ImageNet pretrained networks. IEEE Geosci. Remote Sens. Lett. 2016, 13, 105–109. [Google Scholar] [CrossRef]
  74. Li, M.; Zhang, T.; Chen, Y.; Smola, A.J. Efficient Mini-Batch Training for Stochastic Optimization; ACM Press: New York, NY, USA, 2014; pp. 661–670. [Google Scholar] [CrossRef]
  75. Vapnik, V.; Izmailov, R. Learning using privileged information: similarity control and knowledge transfer. J. Machine Learn. Res. 2015, 16, 55. [Google Scholar]
  76. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; Volume 1. [Google Scholar]
  77. Cheng, G.; Han, J.; Lu, X. Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
  78. Chollet, F. Keras; GitHub: San Francisco, CA, USA, 2015. [Google Scholar]
  79. Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M. TensorFlow: A system for large-scale machine learning. arXiv, 2016; arXiv:1605.08695. [Google Scholar]
  80. Thompson, W.D.; Walter, S.D. A reappraisal of the kappa coefficient. J. Clin. Epidemiol. 1988, 41, 949–958. [Google Scholar] [CrossRef]
  81. Srivastava, N.; Hinton, G.E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  82. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  83. Zeiler, M.D. ADADELTA: An adaptive learning rate method. arXiv, 2012; arXiv:1212.5701. [Google Scholar]
  84. Lin, M.; Chen, Q.; Yan, S. Network in network. arXiv, 2013; arXiv:1312.4400. [Google Scholar]
Figure 1. An example of categorical probability distributions of high-temperature softmax output with different temperature. C1∼C6 denote six categories. The input logit data are listed in the last row of Table 1. When T = 1 (normal softmax), the probability of C2 tends to 1 and others tend to 0. If T goes higher, the categorical distribution tends to be more consistent.
Figure 1. An example of categorical probability distributions of high-temperature softmax output with different temperature. C1∼C6 denote six categories. The input logit data are listed in the last row of Table 1. When T = 1 (normal softmax), the probability of C2 tends to 1 and others tend to 0. If T goes higher, the categorical distribution tends to be more consistent.
Remotesensing 10 00719 g001
Figure 2. (a) Train the teacher model directly on dataset; (b) The process of KD training. The student model output two branches: high-temperature softmax output distill knowledge from the teacher model and the normal softmax output learn to match the ground truth label; (c) In prediction mode or production environment, the trained student model only output normal softmax result.
Figure 2. (a) Train the teacher model directly on dataset; (b) The process of KD training. The student model output two branches: high-temperature softmax output distill knowledge from the teacher model and the normal softmax output learn to match the ground truth label; (c) In prediction mode or production environment, the trained student model only output normal softmax result.
Remotesensing 10 00719 g002
Figure 3. Sample images in (a) AID dataset; (b) UCMerced dataset; (c) NWPU-RESISC dataset; and (d) EuroSAT dataset.
Figure 3. Sample images in (a) AID dataset; (b) UCMerced dataset; (c) NWPU-RESISC dataset; and (d) EuroSAT dataset.
Remotesensing 10 00719 g003
Figure 4. The training curves of the student model in AID experiments. From training curves, we could find that direct training (blue curves) leads to faster convergence but falls into local optima after 50 epochs, while KD training (orange curves) could always reduce the loss. (a) validation loss curve in logarithmic coordinate system; (b) validation OA curve.
Figure 4. The training curves of the student model in AID experiments. From training curves, we could find that direct training (blue curves) leads to faster convergence but falls into local optima after 50 epochs, while KD training (orange curves) could always reduce the loss. (a) validation loss curve in logarithmic coordinate system; (b) validation OA curve.
Remotesensing 10 00719 g004
Figure 5. KD training results of different temperature and λ on AID dataset. These four subfigures demonstrated the relationship between four metrics and the temperature T, according to Table 3. The curves shows that the trend of VOA are similar to other metrics whether λ = 0.1 or λ = 1 on AID dataset.
Figure 5. KD training results of different temperature and λ on AID dataset. These four subfigures demonstrated the relationship between four metrics and the temperature T, according to Table 3. The curves shows that the trend of VOA are similar to other metrics whether λ = 0.1 or λ = 1 on AID dataset.
Remotesensing 10 00719 g005
Table 1. An example of high-temperature softmax output with different temperature.
Table 1. An example of high-temperature softmax output with different temperature.
TemperatureC1C2C3C4C5C6Entropy 1
10.00001.00000.00000.00000.00000.00000.0000
1000.05230.85660.00220.03200.01600.04090.8764
2000.13380.54160.02750.10460.07410.11831.9934
5000.16870.29510.08960.15290.13310.16062.4898
logits 304.63 25.01 620.91 353.81 422.91 329.22
1 Log2-based entropy.
Table 2. General information of datasets.
Table 2. General information of datasets.
DatasetResolutionSizeCategory CountTraining SamplesValidation SamplesTest Samples
AID0.5–8 m600 × 60030500025072493
UCMerced0.3 m256 × 256211050525525
NWPU-RESISC0.2–3 m256 × 2564515,75078757875
EuroSAT10–60 m64 × 641016,20054005400
Table 3. Results of Knowledge Distillation training experiments on AID dataset.
Table 3. Results of Knowledge Distillation training experiments on AID dataset.
VOAOAKF1FPSParameters
student modeldirect training0.84660.82950.82340.8299113.961,012,830
ML training λ
10.71400.70840.69800.7098
1000.81610.80870.80180.8070
KD trainingT λ
110.83730.84480.83920.8438
10.10.82890.82870.82260.8282
510.85160.84800.84260.8484
50.10.85240.83630.83050.8348
1010.85000.83670.83090.8352
100.10.85600.85800.85300.8593
2010.84160.83550.82970.8347
200.10.84840.83830.83260.8353
5010.84040.83350.82760.8336
500.10.84000.83070.82470.8308
10010.85360.84840.84300.8478
1000.10.84440.83110.82510.8304
teacher model (ResNet-101)0.88880.85240.84710.854271.3142,437,278
Table 4. Results of training the student model on part of AID dataset
Table 4. Results of training the student model on part of AID dataset
DatasetTraining MethodVOAOAKF1Training Time per Epoch (s)
100%direct training0.84660.82950.82340.8299128.1
20%direct training0.72190.58720.57280.592932.0
KD training (T = 10, λ = 0.1)0.72980.65540.64320.6567
KD training (T = 10, λ = 1.0)0.74160.62210.60890.6234
Table 5. Results of Knowledge Distillation training experiments on AID dataset without Airports.
Table 5. Results of Knowledge Distillation training experiments on AID dataset without Airports.
Category NameDirect TrainingKD Training without AirportFine-Tuned on Unlabeled Data
PRF1PRF1PRF1
airport0.73590.86670.79590.91300.23330.37170.72340.75560.7391
average0.84210.82950.82990.84610.83630.83060.85880.84200.8438
Table 6. Results of Knowledge Distillation training experiments with different numbers of category and temperatures on AID dataset.
Table 6. Results of Knowledge Distillation training experiments with different numbers of category and temperatures on AID dataset.
Student ModelT30 Categories25 Categories20 Categories15 Categories
OAKF1OAKF1OAKF1OAKF1
direct training0.8300.8230.8300.8400.8330.8400.8700.8630.8700.8780.8690.876
KD training10.8290.8230.8280.8370.8300.8370.8630.8560.8630.8840.8750.883
50.8360.8310.8350.8370.8300.8330.8720.8650.8720.8690.8600.869
100.8580.8530.8590.8460.8390.8460.8680.8610.8670.8790.8700.879
200.8380.8330.8350.8410.8350.8410.8710.8640.8700.8800.8710.880
500.8310.8250.8310.8510.8450.8500.8700.8630.8710.8710.8610.870
1000.8310.8250.8300.8390.8320.8390.8730.8660.8730.8900.8820.890
teacher model0.8520.8470.8540.8700.8650.8710.8800.8740.8800.8940.8860.893
Table 7. Results of Knowledge Distillation training experiments on UCMerced dataset.
Table 7. Results of Knowledge Distillation training experiments on UCMerced dataset.
OAKF1FPSParameters
student modeldirect training0.68380.66800.6710334.4066,933
KD trainingT λ
110.69330.67800.6918
10.10.71050.69600.7079
510.72000.70600.7201
50.10.69330.67800.6820
1010.71240.69800.7092
100.10.69900.68400.6871
2010.68380.66800.6787
200.10.70670.69200.7063
5010.71810.70400.7098
500.10.70670.69200.6993
10010.73520.72200.7338
1000.10.67240.65600.6691
teacher model0.84380.83600.8402137.123,658,389
Table 8. Results of Knowledge Distillation training experiments on NWPU-RESISC dataset.
Table 8. Results of Knowledge Distillation training experiments on NWPU-RESISC dataset.
OAKF1FPSParameters
student_modeldirect training0.78960.78480.7894303.671,389,165
KD trainingT λ
110.79110.78640.7904
10.10.77050.76530.7705
510.79450.78990.7923
50.10.80000.79550.7983
1010.79450.78990.7940
100.10.79260.78790.7912
2010.79390.78920.7933
200.10.78730.78250.7862
5010.79070.78600.7891
500.10.78930.78450.7880
10010.79150.78680.7907
1000.10.79120.78650.7892
teacher model (ResNet-101)0.87030.86740.869288.9042,468,013
Table 9. Results of Knowledge Distillation training experiments on EuroSAT dataset.
Table 9. Results of Knowledge Distillation training experiments on EuroSAT dataset.
OAKF1FPSParameters
student modeldirect training0.93330.92580.9336440.00511,050
KD trainingT λ
110.93630.92910.9360
10.10.93630.92910.9362
510.93760.93050.9376
50.10.93000.92210.9299
1010.93520.92790.9352
100.10.93350.92600.9334
2010.93890.93200.9388
200.10.92830.92030.9286
5010.93200.92430.9319
500.10.93910.93220.9389
10010.94300.93650.9429
1000.10.93980.93300.9397
teacher model (ResNet-101)0.94740.94150.947198.9342,396,298
Back to TopTop