Remote Sensing Image Scene Classiﬁcation via Label Augmentation and Intra-Class Constraint

: In recent years, many convolutional neural network (CNN)-based methods have been proposed to address the scene classiﬁcation tasks of remote sensing images. Since the number of training samples in RS datasets is generally small, data augmentation is often used to expand the training set. It is, however, not appropriate when original data augmentation methods keep the label and change the content of the image at the same time. In this study, label augmentation (LA) is presented to fully utilize the training set by assigning a joint label to each generated image, which considers the label and data augmentation at the same time. Moreover, the output of images obtained by different data augmentation is aggregated in the test process. However, the augmented samples increase the intra-class diversity of the training set, which is a challenge to complete the following classiﬁcation process. To address the above issue and further improve classiﬁcation accuracy, Kullback–Leibler divergence (KL) is used to constrain the output distribution of two training samples with the same scene category to generate a consistent output distribution. Extensive experiments were conducted on widely-used UCM, AID and NWPU datasets. The proposed method can surpass the other state-of-the-art methods in terms of classiﬁcation accuracy. For example, on the challenging NWPU dataset, competitive overall accuracy (i.e., 91.05%) is obtained with a 10% training ratio.


Introduction
With the advancement of imaging technology, remote sensing (RS) images have a higher resolution than before. At present, RS images have been used in many research domains, including object detection [1][2][3], image retrieval [4,5], change detection [6,7], land use classification [8] and environmental monitoring [9]. The RS image scene classification task, which tries to allocate a scene category to each RS image on the basis of its semantic information, has great significance in practical applications.
In the earlier researches, scene classification was accomplished by using the low-level features, including color histograms (CH) [10], texture [11,12] and scale invariant feature transform (SIFT) [13]. However, these methods relied on engineering skills and experts' experiences to construct feature representations, which have limitations in describing abundant scene information.
To resolve the limitation of the low-level feature-based classification methods, many methods, which aggregate the extracted local low-level visual features to generate mid-level scene representation, have been proposed to achieve good performance on the scene classification task. As one of the most commonly used methods based on mid-level visual features, Bag-of-visual-words (BoVW) [14] used the k-means clustering to obtain a visual dictionary and then performs feature encoding to generate mid-level visual features. The BoVW and its improved version have been widely used to complete scene classification [15][16][17]. In addition, there are some other classical methods based on mid-level features, such as spatial pyramid matching (SPM) [18], improved fisher kernel (IFK) [19] and vectors of locally aggregated descriptors (VLAD) [20].
However, the aforementioned methods, using low-level and mid-level features extracted from RS images, are not in a deep manner and cannot represent the semantic information of images very well [21][22][23]. Recently, deep learning methods, especially CNN, have shown excellent performance in computer vision tasks because of their strong feature extraction ability. Moreover, RS image scene classification belongs to the high-level task in image processing, which is closely related to computer vision. At an early age, RS images have a low resolution, and the scenes to be classified are the large-area land cover, which is different from the natural images used in computer vision that focus on small-scale objects. Therefore, it has difficulty introducing deep learning-based methods to RS image scene classification. However, now RS image has a high spatial resolution, and the difference between natural image and RS image has been reduced, which provides the feasibility of introducing computer vision into remote sensing image processing.
Recently, many CNN-based methods have been introduced to complete scene classification tasks [24][25][26][27][28]. Instead of using low-level and mid-level features, CNN-based methods can extract deep features from RS images by hierarchical feature extraction. In addition, most of CNN-based methods use the models pre-trained on ImageNet [29], such as AlexNet [30], VGG [31], ResNet [32] and DenseNet [33]. Hu et al. [24] verified the effectiveness of CNN models by using features extracted from convolutional layers. In [27], Li et al. proposed a novel filter bank to capture local and global features at the same time to improve classification performance. Moreover, the influence of different training strategies on classification performance has been studied. There were three training strategies, including utilizing pre-trained CNN models as features extractors, utilizing pre-trained CNN models for fine-tuning, and fully trained models. The experimental results demonstrated that the fine-tuning strategy could obtain higher classification accuracy compared with the other two strategies [8,34].
To further improve the classification performance of CNN-based methods, some other advanced technologies, such as attention mechanism and feature fusion, have been introduced. The attention mechanism was introduced to enable deep models to focus on the important regions in RS images [35][36][37][38]. In [35], Bi et al. combined the attention pooling and dense blocks to extract features from different levels. In [36], Wang et al. proposed the recurrent attention structure to make models focus on important regions and high-level features. In addition, feature fusion was also an important method in scene classification [39][40][41][42][43]. In [40], Yu et al. combined the saliency coded and CNN model to complete the feature-level fusion. In [42], Lu et al. proposed a coding module and an advanced fusion strategy to make full use of intermediate features.
Although most existing CNN-based methods have achieved good performance on scene classification, the limited number of training images is still an important challenge. Deep learning methods rely on abundant labeled samples (e.g., ImageNet), but for remote sensing data, the acquisition of labeled samples is difficult. Compared with ImageNet, the number of images in RS datasets is small, which easily leads to overfitting. Even though the parameters of the pre-trained model are used for initialization, the overfitting problem still exists. In order to alleviate the overfitting, data augmentation has been widely used to expand the training set. However, when using data augmentation, most of the existing methods keep the original label while changing the content of the image, which is inappropriate. To address the above issue and fully use the training samples, we propose a method titled label augmentation that considers the label and data augmentation at the same time.
Moreover, label augmentation provides more training samples having accurate category information while also increasing intra-class diversity among the training set. It is necessary to impose a constraint on intra-class diversity to improve classification accuracy when using label augmentation. Therefore, we propose an intra-class constraint that utilizes KL divergence to constrain the output distribution of two RS images with the same scene category to reduce the intra-class diversity.
In this study, two methods are proposed to enhance the classification performance of RS images. First, the label augmentation is proposed to make full use of training samples, and then we utilize KL divergence to reduce the intra-class diversity among training sets caused by label augmentation. The major contributions of this study are summarized as follows: • In order to fully use RS images, we use label augmentation (LA) to obtain more accurate category information by assigning a joint label to each generated image.

•
To solve the intra-class diversity of training set caused by label augmentation, we use KL divergence to impose a constraint on the output distribution of two images with the same scene category.

•
We combine the label augmentation and intra-class constraint to further improve the remote sensing image classification accuracy. The generalization ability of the proposed method is evaluated and discussed.
The remainder of this paper is organized as follows. Section 2 introduces the proposed methods, including label augmentation, intra-class constraint and their combination. The used datasets, experimental setup, and results are shown in Section 3. The experimental results are discussed and analyzed in Section 4. Finally, we conclude this paper in Section 5.

Methods
In this section, we describe the proposed methods in detail, including label augmentation, intra-class constraint and their combination.

Label Augmentation-Based RS Image Scene Classification
The label augmentation can be seen as an improvement of data augmentation. In the scene classification task, data augmentation can expand the training set to alleviate the overfitting problem. However, when an image is transformed by different data augmentation methods, the newly generated image keeps its original label, which imposes invariance to transformations on the classifier. To address the above issue, we propose the label augmentation, which considers the scene category and transformation of the remote sensing image at the same time and assigns a joint label to each generated image to effectively use training samples.
In Figure 1, we take the rotation transformations of a sample as an example to show the difference between data augmentation and label augmentation. We rotate the original image by 0 • , 90 • , 180 • and 270 • to generate four images. When we use data augmentation, all 4 images are labeled with the category airplane. When using label augmentation, the rotation information is added to the label. In this manner, a joint label for each generated image can be obtained [44]. Therefore, the category is expanded four-fold. The labels of four images are (airplane, 0 • ), (airplane, 90 • ), (airplane, 180 • ) and (airplane, 270 • ). Label augmentation can make each remote sensing image obtain more accurate category information than data augmentation. With the increase in categories, the design of the classifier needs to be adjusted. We use x to represent a training sample, and it has a label y ∈ {1, 2, . . . , C}, where C is the number of categories. When we use data augmentation, the loss function can be expressed as: where x i is obtained by applying the rotation transformation to the original training sample x. L CE represents the cross-entropy loss function, and ω represents the parameters of network f . However, when applying the label augmentation, the joint label is used to increase category information, and the loss function of label augmentation can be expressed as: where N is the number of images obtained by label augmentation. In Figure 1, we use 4 rotation transformations to obtain 4 images, so N is equal to 4. There is a 4-fold increase in categories, so the dimension of the fully connected layer (FC) is also expanded 4-fold to C × N. We show the training and test process of the label augmentation in Figure 2. During the training process, we train the model using the standard cross-entropy loss and update the parameters ω of the network f using a backpropagation algorithm. During the test process, we first aggregate outputs by calculating the average of N output vectors as follows: Then, we can calculate the softmax probability by: With the label augmentation, we can use a single CNN model to identify scene category and rotation transformation at the same time, which fully uses the training set to enhance classification performance.

Intra-Class Constraint for RS Image Scene Classification
Although label augmentation can provide more training samples having accurate category information, it also increases the intra-class diversity among the training set. The intra-class diversity has an important influence on test accuracy. To address this issue, we utilize KL divergence as a regularization term to impose intra-constraint. In detail, we input two training samples with the same category into the network and obtain their output distribution. Since the input images have the same category, their output distribution should be similar. Therefore, we calculate the KL divergence of the two output distributions as the regularization term.
The framework of intra-class constraint is shown in Figure 3. During the training process, in addition to the current training sample x 1 , we input another randomly selected sample x 2 with the same category as sample x 1 into the network. The KL divergence is used to match the output distribution of two training samples. The output distribution of input image is defined as follows: where T is the distillation temperature, which is used to soften the output distribution. Then, we propose the following KL divergence regularization term to impose intra-constraint: The total loss can be obtained by combining cross-entropy loss function and KL divergence regularization term: where L CE represents the cross-entropy loss, and α is the coefficient of the KL divergence.
We set α to 1 to indicate that cross-entropy loss and KL divergence regularization term have same importance in our method. Moreover, we only perform backpropagation for the training sample x 1 . The training sample x 2 is only used to calculate the value of KL divergence, and the process of backpropagation is not necessary.

Combination of LA and Intra-Class Constraint for RS Image Scene Classification
In this sub-section, we combine the above two methods, i.e., LA and Intra-class Constraint, to further enhance the classification performance. Figure 4 displays the overall framework of the proposed method. Similar to the intra-class constraint, we have a current input image x 1 and a randomly selected image x 2 with the same category as sample x 1 . By applying label augmentation to samples x 1 and x 2 , the number of images and categories has increased by N times. As shown in Figure 4, for each input image, we use 4 rotation transformations to obtain four images with the joint label. Then, we calculate the KL divergence of these four pairs of images. The total loss can be calculated by combining loss L LA and KL divergence: Moreover, the backpropagation process is only for samples x 1 and its augmented samples, and the training sample x 2 is only used to calculate the value of KL divergence.
Algorithm 1 shows the workflow of the proposed method. The combination of label augmentation and intra-class constraint enables us to fully use the training set to enhance the classification performance. sample a batch (x, y) from the training set 7.
sample another batch ( , y) with the same category y from the training set 8.
apply LA to each image in two batches to generate four new images and their joint labels 9.
calculate the cross-entropy loss 10.
obtain output distributions of the two samples with the same location in two batches 11.
calculate the KL divergence of two output distributions 12.
update parameters w by minimizing the loss 10. Test: 11. apply LA to each image in the test set 12.
aggregate the output and calculate the softmax probability 14. end

Datasets Description
In order to verify the effectiveness of the label augmentation and intra-class constraint, we carry out experiments on three public RS image datasets: the UC Merced (UCM) dataset [14], the AID dataset [45] and the NWPU-RESISC45 dataset [28].
The UCM dataset contains 2100 remote sensing images and is divided into 21 scene categories. There are 100 images in each category, and each image consists of 256 × 256 pixels, and the spatial resolution is 0.3 m. Figure 5 shows some selected images and their corresponding scene categories. The AID dataset consists of 10,000 remote sensing images, which are assigned to 30 scene categories. Each scene contains between 220 and 420 images. Each image has 600 × 600 pixels, and the spatial resolution is between 8 m and 0.5 m. Some selected images and their corresponding scene categories are shown in Figure 6.

Datasets Description
In order to verify the effectiveness of the label augmentation and intra-class constraint, we carry out experiments on three public RS image datasets: the UC Merced (UCM) dataset [14], the AID dataset [45] and the NWPU-RESISC45 dataset [28].
The UCM dataset contains 2100 remote sensing images and is divided into 21 scene categories. There are 100 images in each category, and each image consists of 256 × 256 pixels, and the spatial resolution is 0.3 m. Figure 5 shows some selected images and their corresponding scene categories. The AID dataset consists of 10,000 remote sensing images, which are assigned to 30 scene categories. Each scene contains between 220 and 420 images. Each image has 600 × 600 pixels, and the spatial resolution is between 8 m and 0.5 m. Some selected images and their corresponding scene categories are shown in Figure 6. The NWPU-RESISC45 dataset is collected from Google Earth, and it contains 31,500 images, all of which have 256 × 256 pixels. These RS images belong to 45 scene categories. There are 700 images in each category, and the spatial resolution is between 30 m and 0.2 m. Some images and their corresponding scene categories are displayed in Figure 7.

Experimental Setup
For the UCM dataset, we select 80% of images to create a training set. For the AID dataset, we use 20% and 50% training ratios to create a training set. The training ratio of the NWPU-RESISC45 dataset is set to 10% and 20%. For the above three datasets, samples other than the training set constitute the test set.
The overall accuracy (OA) and confusion matrix are used to evaluate the performance of the proposed method. The OA reflects the classification performance of the CNN model. The confusion matrix is a way of accuracy evaluation. It is expressed in a matrix with n rows and n columns, where n represents the number of scene categories in each dataset. The confusion matrix reflects the relationship between the predicted result and label, and the number of correctly classified images is distributed diagonally in the confusion matrix.
We select the ResNet18 model [32], which is pretrained on ImageNet, as the backbone network. The last fully connected layer of the pretrained ResNet18 model is replaced with a new fully connected layer which is used to obtain the final classification results of the RS image classification task. In the experiments, two label augmentation (LA) methods are used, including color permutation and rotation. For the rotation operation, the rotation angles are set to 0, 90, 180 and 270 degrees. For the color permutation, there are three transformations, including RGB, GBR and BRG.
In the training stage, we first resize all images to 224 × 224 and then feed them into the network. Then, the stochastic gradient descent (SGD) is used to update the model parameters with a batch size of 32. We set the training epochs and initial learning rate to 40 and 0.01, respectively. In addition, as the training progresses, the learning rate gradually decreases.
We carry out the experiments on the Windows 10 system with a 3.4GHz i5-7500CPU and an NVIDIA GeForce GTX 1070Ti GPU. The PyTorch [46] open-source framework and python programming language are used to implement the proposed methods.

Results of the AID Dataset
Since label augmentation can be considered as an improvement of the data augmentation, we first compared the label augmentation with data augmentation. We ran experiments five times with the random training set to obtain the final test results with the form of accuracy ± standard deviation. In Table 1, when using a 20% and 50% training ratio, the test results of data augmentation using color permutation were lower than fine-tuned ResNet18 by 1.65% and 0.75%, respectively. However, when applying label augmentation to color permutation, it had an improvement of 1.17% and 1.18% in terms of classification accuracy over the fine-tuned ResNet18, respectively. The above results indicated that the remote sensing images are sensitive to color permutation, and it is improper to directly assign the original label to the new image generated by the color permutation. The color transformation changed the content of the image, which increased the complexity of the classification task.
The results of label augmentation, including color permutation and rotation, were better than data augmentation, which indicates that it is reasonable to assign a joint label to each newly generated remote sensing image. The joint label can consider scene category and transformation at the same time and provide more accurate category information to improve the classification accuracy. In detail, the classification accuracy of LA with rotation transformation was higher than color permutation, which demonstrates that providing more direction and position information for the network can promote the improvement of classification accuracy. Therefore, in the following two datasets, UCM and NWPU-RESISC45, rotation transformation will be used as the way of label augmentation.
In addition, aiming at the problem of intra-class diversity caused by label augmentation, we utilized KL divergence to constrain the output distribution of different samples with the same scene category. In addition to combining ResNet18 + LA with KL divergence, we conducted an ablation study to explore the effect of using KL divergence alone on ResNet18. The experimental results are shown in Table 2. When applying KL divergence to ResNet18 and ResNet18 + LA, it has an improvement of 0.62% and 0.85% in terms of classification accuracy, using a 20% training ratio, respectively. When using a 50% training ratio, the results show a similar trend, which proves the effectiveness of using KL divergence to constrain intra-class diversity among training sets. Figure 8 shows the confusion matrices of the ResNet18 + LA (rotation) + KL for the AID dataset. One can see that the correctly classified test samples are distributed diagonally in the two confusion matrices, and most scenes can be classified with an accuracy of more than 90%. Table 3 shows the comparison of our results with other advanced methods. One can see that our proposed method, ResNet18 + LA (rotation) + KL, achieved 94.98% and 96.52% classification accuracy, under the training ratio of 20% and 50%, respectively, which is higher than most previous methods. Rotation Invariance Regularization (RIR) [47] proposes a deep Siamese CNN with rotation invariance regularization that combines a regularization constraint with the cross-entropy loss, which can obtain results similar to our method.

Results of UCM Dataset
According to Table 4, by combining KL divergence with ResNet18 and ResNet18 + LA, it has an enhancement of 1.16% and 0.24% in terms of test accuracy, under the training ratio of 80%, respectively. Compared with data augmentation, the improvement obtained by the label augmentation is 0.75%. Moreover, the ResNet18 + LA (rotation) + KL can achieve the highest classification accuracy of 99.21%. As can be seen from the confusion matrices in Figure 9, every scene can be easily distinguished from the others, and almost all categories can be correctly identified with the classification accuracy of 1. For some categories that are easy to confuse, such as Overpass and Freeway, after applying the KL divergence, the classification accuracy of category Freeway has an improvement of 5%. These results demonstrate that the ResNet18 + LA + KL improves the overall accuracy by imposing KL divergence on the output distributions.
The overall classification comparison on the UCM is displayed in Table 5. The classification accuracy of 99.21% can be obtained by the proposed method ResNet18 + LA (rotation) + KL, which is higher than most previous methods. DCNN [46] also solves the problem of intra-class diversity among remote sensing images by metric learning, and RIR [48] uses the rotation transformation and Siamese CNN to increase the robustness for the remote sensing images scene classification. The combination of label augmentation and intra-class constraint is superior to the above two methods. However, our classification performance is slightly lower than method ACNet [37] that uses the dual-branch structure and attention technique. Table 5. The overall accuracy comparison of the UCM dataset. The best results are shown in bold.

Results of the NWPU Dataset
The results of the label augmentation and intra-class constraint are given in Table 6. Similar to the experimental results of the above two datasets, the method ResNet18 + LA (rotation) and ResNet18 + LA (rotation) + KL achieve higher classification accuracy than baseline, which validates the effectiveness and robustness of our proposed methods.  It is clear that the number of correctly classified images is distributed diagonally in the confusion matrix in Figure 10. According to Table 7, by combining the label augmentation and KL divergence, we can obtain 91.05% and 93.60% classification accuracy, using 10% and 20% training ratios, respectively. Moreover, the ResNet18 + LA (rotation) + KL shows an excellent classification performance than most methods in Table 7.

Experimental Analysis
As can be seen from the results of the AID dataset and the NWPU dataset, the classification accuracy has an improvement with the increase in the training ratio, indicating that the number of training sets has an important influence on the training model. The label augmentation proposed in this paper assigns a joint label to each new image obtained by the input transformation, i.e., rotation transformation. By applying the label augmentation, we can obtain more accurate category information and make the model use the training samples more effectively, which significantly improves the classification accuracy. In addition, by using KL divergence to constrain the output distribution of the two images with the same category, the classification performance can be further improved, indicating that the use of KL divergence can solve the problem of intra-class diversity caused by label augmentation to some extent. We also compare our methods with other advanced methods. DCNN [46] uses metric learning to solve the problem of intra-class diversity among remote sensing images. RIR [48] uses the rotation transformation and Siamese CNN to increase the robustness of the CNN model. The ResNet18 + LA + KL achieves the highest classification results in most cases compared with the above two methods, which validates the effectiveness of our method.

Parameter Sensitivity Analysis
The influence of the factor T on KL divergence is analyzed. The factor T is an important parameter, which decides the soft degree of the output distribution. We set the value of T as {1, 2, 3, 4} and conduct experiments on the AID dataset.
In Figure 11, when the value of T is set to 2, the ResNet18 + LA + KL obtains the highest test accuracy.

Analysis of Softmax Scores
The softmax scores reflect the probability of the image belonging to each category. We use a bar chart to display the softmax scores. Figure 12 shows the softmax scores of two test samples obtained by ResNet18 + LA and ResNet18 + LA + KL. The first test sample belongs to category 0 (airplane), but the prediction obtained by ResNet18 + LA is category 34 (runway). The reason is that the airplane area occupies a small proportion of the image, while the runway area occupies most of the image. Therefore, it is easy to confuse this image with category runway. When applying KL divergence, we can provide the constraint on output distribution, and the prediction obtained by ResNet18 + LA is category 0 (airplane), which is the correct result. The second test sample belongs to category 27 (palace), and it is similar to category 7 (church). We can see that the prediction of ResNet18 + LA + KL obtains the correct classification result.
The softmax scores of two test samples obtained by ResNet18 and ResNet18 + KL are shown in Figure 13. By using the KL divergence to impose constraint, the intra-class diversity is decreased, which results in the indirect decrease in between-class similarity. The first test sample in Figure 13 belongs to category 1 (bareland). One can see that the probabilities of bareland and desert obtained by ResNet18 are close, indicating that it is easy to confuse bareland and desert. When the KL divergence is applied to ResNet18, the distance between feature distributions of different categories is increased due to the intra-class constraints, and the prediction of ResNet18 + KL obtains the correct classification result. The result of the second test sample has the same trend. The results in Figures 12 and 13 validate the effectiveness of using the KL divergence.

Analysis of the Comparsion Experiment
We use the same data augmentation strategy and the same model to give a fair comparison. In the comparison experiments, the RIR [48] also uses the same rotation operation to expand the training set. In the case of using the same model (e.g., ResNet50), we compare our results with the RIR [48] on the NWPU dataset. The results are shown in Table 8. We can obtain 93.37% and 95.26% classification accuracy by the ResNet50, using 10% and 20% training ratios, respectively. Compared with the RIR [49], it has an enhancement of 1.32% and 1.2% in terms of classification accuracy, which proves the effectiveness of the proposed method.

Analysis of Running Time and Computational Complexity
In order to analyze the computational cost of the proposed method, we conduct the following four experiments on the AID dataset: ResNet18, ResNet18 + DA (rotation), ResNet18 + LA (rotation), and ResNet18 + LA (rotation) + KL. The experimental results are shown in Table 9.  Table 9, one can see that the application of data augmentation does not significantly increase the training time since data augmentation does not change the total number of training samples. When we use the label augmentation, there is a four-fold increase in training samples, and the dimension of the fully connected layer is also expanded four-fold. Therefore, the training time is longer than the data augmentation due to the increase in training samples and network parameters.
We calculate the number of floating-point operations (FLOPs) for different models. The number of FLOPs on ResNet18 and ResNet18 + DA (rotation) is 1820.90 M. However, when we use the label augmentation, the dimension of the fully connected layer is expanded four-fold. Therefore, the number of FLOPs on ResNet18 + LA (rotation) is increased to 1820.95 M. For the ResNet18 + LA (rotation) + KL, the network needs to calculate the output of two samples with the same category, so the number of FLOPs on ResNet18 + LA (rotation) + KL is 3641.90 M.
We also calculate the number of parameters in the model. For the ResNet18 and ResNet18 + DA (rotation), the last fully connected layer of the pretrained ResNet18 model is replaced with a new fully connected layer which is used to obtain the final classification results of the RS image classification task. The number of parameters in ResNet18 and ResNet18 + DA (rotation) is 11.19 M. For the ResNet18 + LA (rotation) and ResNet18 + LA (rotation) + KL, the dimension of the fully connected layer is expanded four-fold. Therefore, the number of parameters in ResNet18 + LA (rotation) and ResNet18 + LA (rotation) + KL is increased to 11.24 M.

Discussion of the Generalization Ability of the Proposed Method
Considering that the label augmentation uses four fixed rotation angles (e.g., 0 • , 90 • , 180 • and 270 • ) to expand the training set, we introduce a small random angle α for four fixed rotation angles to increase the generalization ability of the network, which is called improved label augmentation (ILA). In addition, we double the number of samples for each rotation angle (e.g., 0 • ± α, 90 • ± α, 180 • ± α and 270 • ± α) to further expand the training set. In the experiments, the small random angle α is set to 10. The results on the NWPU dataset are shown in Table 10. From Table 10, when using the improved label augmentation, we can obtain 91.81% and 93.91% classification accuracy, using 10% and 20% training ratios, respectively. Compared with the ResNet18 + LA (rotation) + KL, it has an enhancement of 0.76% and 0.31% in terms of classification accuracy. The above results demonstrated that the improved label augmentation could enhance the generalization ability of the network to further improve the classification performance.

Visualization of Feature Embeddings Using T-SNE
The T-SNE algorithm can map the features in the high-dimensional space to the low-dimensional space while retaining the characteristics of features [51]. We extract the 512-dimensional feature vector from the penultimate layer of the network, and then we map it to 2-dimension space. Figure 14 shows the T-SNE result of different methods on the AID dataset with a 20% training ratio. In Figure 14, each color represents a scene category, and one can see that the distinguishability between categories is increased by using KL divergence.

Conclusions
In this study, how to efficiently use remote sensing images to address the scene classification task was explored, and then label augmentation and intra-class constraint were proposed to improve the classification performance. We selected ResNet18 as the backbone network to perform experiments on the three RS datasets. By applying label augmentation, we considered the label and data augmentation at the same time, which can obtain higher classification accuracy than data augmentation. Then, KL divergence was used to solve the intra-class diversity caused by label augmentation. The combination of label augmentation and intra-class constraint was superior to other excellent methods in classification accuracy. The experimental results in this paper demonstrated that great performance gains could be obtained by making full use of data even without complex algorithms. There is great prospect to study how to improve scene classification performance from the data level.