Early Labeled and Small Loss Selection Semi-Supervised Learning Method for Remote Sensing Image Scene Classiﬁcation

: The classiﬁcation of aerial scenes has been extensively studied as the basic work of remote sensing image processing and interpretation. However, the performance of remote sensing image scene classiﬁcation based on deep neural networks is limited by the number of labeled samples. In order to alleviate the demand for massive labeled samples, various methods have been proposed to apply semi-supervised learning to train the classiﬁer using labeled and unlabeled samples. However, considering the complex contextual relationship and huge spatial differences, the existing semisupervised learning methods bring different degrees of incorrectly labeled samples when pseudo-labeling unlabeled data. In particular, when the number of labeled samples is small, it affects the generalization performance of the model. In this article, we propose a novel semi-supervised learning method with early labeled and small loss selection. First, the model learns the characteristics of simple samples in the early stage and uses multiple early models to screen out a small number of unlabeled samples for pseudo-labeling based on this characteristic. Then, the model is trained in a semi-supervised manner by combining labeled samples, pseudo-labeled samples, and unlabeled samples. In the training process of the model, small loss selection is used to further eliminate some of the noisy labeled samples to improve the recognition accuracy of the model. Finally, in order to verify the effectiveness of the proposed method, it is compared with several state-of-the-art semi-supervised classiﬁcation methods. The results show that when there are only a few labeled samples in remote sensing image scene classiﬁcation, our method is always better than previous methods.


Introduction
With advances in drone technology and high-resolution vision sensors, remote sensing plays a key role in obtaining all data without on-site inspections [1]. Hundreds of remote sensing satellites are now in orbit, acquiring a vast amount of information about the Earth's surface every day. In this sense, remote sensing data processing may be considered a big data problem because of the large amount of data to be processed, diversity [2][3][4], and generation speed. The recent emergence of cloud computing has expanded the possibilities of remote sensing. In the field of high-resolution remote sensing (HRRS) image processing, scene classification methods that can be used to solve practical problems, such as maps and monitoring land types and urban planning, have become active research hotspots [5][6][7].
During the past few years, deep learning models, especially convolutional neural networks (CNNs), have received extensive attention in the field of scene classification [8][9][10]. However, CNNs usually require a large number of high-quality labeled samples in the training phase. Unfortunately, collecting the labeled data of training scene images is timeand energy-consuming [11]. In contrast, the acquisition of unlabeled images is much easier compared to acquiring a manually annotated dataset by experts and engineers.
In this case, the semi-supervised learning (SSL) methods have been introduced to jointly utilize labeled and unlabeled data in the context of HRRS images. For example, a semi-supervised generative framework is proposed in [12], which uses a residual network (Resnet) [13] and very deep CNNs (VGG) [14] as the feature extractors, uses the co-trainingbased self-labeled method to select and identify unlabeled data, and uses discriminatory evaluation to enhance the classification of the confusion classes with similar visualized features. In [15], the authors propose an SSL method for HRRS classification based on CNNs and ensemble learning, the effective ResNet is adopted to extract preliminary HRRS image features, and the strategy of ensemble learning is utilized to establish discriminative image representations by exploring the intrinsic information of all labeled and unlabeled data; finally, supervised learning is performed for scene classification. Although the above methods have made progress as well in semi-supervised scene classification, they need to use network ensembles to train multiple networks instead of one. It cannot be ignored that these methods still require a certain number of labeled samples.
It is well known in the machine learning community that SSL methods based on consistency regularization [16,17] and mixing regularization [18,19] have proven to be simple while effective, achieving a number of state-of-the-art results in the field of natural images over the last few years. Consistency regularization is driven by encouraging consistent predictions that two different augmentations of the same unsupervised image should lead to similar prediction probabilities. Mixing regularization is inspired by MixUp [20], which uses a blending factor from the beta distribution to blend pairs of images and corresponding ground truth labels. Interpolation Consistency Training (ICT) [18] uses MixUp on a pair of pseudo-label unlabeled images, whose class probabilities are predicted by the exponential moving average (EMA) of the training model and through consistent regularization to ensure that the prediction results of the training model on the mixed images are the same as the EMA model. MixMatch [19] works by estimating low-extropy labels for data-augmented unlabeled examples, mixing labeled and unlabeled data using MixUp, and then training an SSL classifier to output consistent predictions about the linear interpolation of the data.
Contrary to natural images, however, HRRS images contain complex contextual relationships and large differences in object scale and are often affected by the camera angle, the direction of objects, illumination, and atmospheric conditions, which can result in high intra-class variations and in low inter-class variations [21,22]. Therefore, SSL techniques based on consistent regularization methods are unable to achieve good generalization performance in remote sensing images with few labeled samples (for example, only one or five samples per category). Moreover, as the training process deepens, the neural network will memorize the unlabeled data together with the false pseudo-labels, which affects the recognition accuracy of the model [23,24].
To cope with this problem, in this paper, we propose an early labeled and small loss selection semi-supervised learning method for aerial scene classification, namely ELSLS-SSL. Early pre-training models are used to label partially unlabeled samples for HRRS images, inspired by the early-learning regularization [23]. First, we initialize multiple independent ResNet networks with different parameters, combine the MixMatch SSL method to independently train the model through only a few epochs, and obtain multiple ResNet network models with different parameters after training, which are used to labeled unlabeled samples. Then, the pseudo-labeled unlabeled samples are divided into low-noise labeled datasets and unlabeled datasets through high-probability sample selection. Finally, SSL is carried out with the labeled data, pseudo-labeled data, and unlabeled data based on small loss selection by combination with a pseudo-labeled loss function. The experiments on the AID and NWPU-RESISC45 datasets show that by adding the pseudo-labeled data, filtered and labeled by multiple ResNet models trained in the early stage to SSL, the trained neural network can achieve higher classification accuracy.
The rest of this article is organized as follows. Section 2 introduces related work. The proposed method is described in Section 3. The experiments are described in Section 4, which is followed by discussions of the method with further experiments in Section 5. Finally, Section 6 concludes this article.

Related Work
In this section, we briefly review existing works on semi-supervised learning methods and learning using noisy labels.
Semi-supervised learning is a kind of weakly supervised learning. Its main idea is to optimize the model by combining a large amount of unlabeled data with a small amount of labeled data in the process of training the model. Generally speaking, semisupervised learning is a hybrid between supervised and unsupervised learning. It combines the advantages of the two. It can use labeled data for supervised learning, and can spontaneously generate labels for unlabeled data during the training process to optimize the model. In recent years, semi-supervised learning has made great progress. Interested readers can consult the following surveys and books [25][26][27].
During the past few years, Google Research has published a series of papers on semi-supervised learning methods, including MixMatch [19], ReMixMatch [28], and Fix-Match [29]. MixMatch combines consistency regularization with data augmentation, entropy minimization, and MixUp. Based on MixMatch, ReMixMatch [28] adopts distribution alignment and augmentation anchoring strategies. It encourages the distribution of a model's aggregated class predictions to match the marginal distribution of ground-truth class labels. For each given unlabeled input, it generates multiple strongly augmented versions and combines the pseudo-labels generated by the weakly augmented version to train the model; FixMatch [29] is the same as ReMixMatch in that it uses the model to predict weakly augmented unlabeled images to generate pseudo-labels. However, it will only be retained if the model produces high-confidence predictions. The model is then trained to predict pseudo-labels when inputting a strongly augmented version of the same image.
These semi-supervised learning methods have the following characteristics. A small number of labeled samples and a large number of unlabeled samples are used in the training process, and as the number of labeled samples increases, the recognition performance is significantly improved. In this case, this article screens unlabeled samples, and selects some samples for pseudo-labeling, in the hope of adding to the labeled samples, so as to improve the recognition accuracy of the model.
Considering how to screen samples is one of the main foci of this article. At the same time, in the process of semi-supervised learning, it is inevitable that pseudo-labeled samples are mislabeled, which also affects the recognition accuracy of the model. The study of false labels in training data belongs to the problem of noisy label learning. Most existing methods for training CNNs with noisy labels seek to correct the loss function. The most popular method can be understood as a relabeling method, such as modeling with directed graphical models [30], a knowledge graph [31], and improving the bootstraping method by exploiting the dimensionality of the feature subspace [32]. The second type of method tends to clean and separate the training data and use the clean samples after separation for model training [33][34][35].
Since our main task is semi-supervised learning, a large number of unlabeled samples are optimized through dynamic labeling during the training process, so the above two main methods are not suitable for our task. In the training process, our method adopts a small loss selection method to dynamically select pseudo-labeled samples to filter out the incorrect samples as much as possible and improve the accuracy of the model.

Methodology
SSL methods aim to improve the model's performance by leveraging unlabeled data. Current state-of-the-art SSL methods can be seen as noisy learning of pseudo-labeled data. When trained on noisy labels, deep neural networks have been observed to first fit the training data with clean labels during an early learning phase, before eventually memorizing the examples with false labels [23]. Inspired by this idea, we propose an SSL method that uses early training models to label data. An overview of the method is shown in Figure 1, where the training procedure includes three phases: unlabeled sample labeling with early training multi-models, high-probability sample selection, and retraining. Here, f θ 1 , f θ 2 , and f θ 3 are three pre-trained networks using a labeled dataset and unlabeled dataset based on the MixMatch method in a few epochs. f θ is the final network retrained by using the labeled dataset, low-noise dataset, and sub-unlabeled dataset through small loss selection based on the MixMatch method.

Retraining under small loss selection
High probability sample selection f  Figure 1. Framework of the proposed multi-screening unlabeled sample semi-supervised learning method.

Early Training Multi-Models for Unlabeled Sample Labeling
For aerial scene classification, let D L = {(x i , y i )} N L i=1 denote the set of labeled training data, where x i is the i-th sample, y i ∈ {0, 1} C is the one-hot label over C classes, and N L is the total number of labeled samples. Similarly, the set of unlabeled data can be represented , where u i is the i-th unlabeled sample, and N U is the number of unlabeled samples. More formally, given a model with parameters θ, based on MixMatch [19], the combined loss L for SSL is computed as: where H(a, b) is the cross-entropy between distributions a and b, p(x; θ) is the model's output softmax probability for class c, X and U are transformed from a batch of labeled data and unlabeled data through MixUp [20], and λ U are hyperparameters.
Deep networks tend to learn clean samples faster than noisy samples [36], and we assume that this phenomenon also exists in SSL. Although SSL technology can improve the generalization performance of the model to a certain extent, when there is little labeled data, the model will be affected by the noisy data in the unlabeled data. Our goal is to find a model or a combination of multiple models P = {p(u; θ i )} M i=1 to label unlabeled samples. These models can not only learn clean data in SSL but also avoid overfitting noisy data during the early training process of the model. Then, they select relatively clean labeled samples from pseudo-labeled samples and convert them into one-hot labeled samples. Finally, the selected one-hot labeled samples are added to the SSL training process as low-noise labeled data, which are different from labeled data and unlabeled data, to train a new model with better generalization performance.
In order to select low-noise labeled data with high labeled quality, according to the early-learning phenomenon [23], we adopt the MixMatch SSL method, using labeled data and unlabeled data to train the model for a few epochs. However, using an independent model to select low-noise samples, and then combining low-noise samples to train a new model, may cause confirmation bias. Intuitively, two or multiple networks can filter different types of errors brought by noisy pseudo-labels since they have different learning abilities. Therefore, we use different initialization parameters and sample input order to independently and repeatedly train multiple models and use the predicted mean of these models for unlabeled samples as pseudo-labels for the samples. For an unlabeled sample u (u ∈ D U ), we setŷ where θ i is the parameter of the early training model obtained by training only a few epochs using all data and different initialization parameters for the i-th time. Finally, we obtain the pseudo-labeled dataset The early training multi-models for the unlabeled sample labeling process is shown in Figure 2, where f θ 1 , f θ 2 , and f θ 3 are three pre-trained networks using the labeled dataset and unlabeled dataset based on the MixMatch method in a few epochs. These three pre-trained models are used to pseudo-label unlabeled samples.

Unlabeled dataset
Pseudo-labeled dataset Unlabeled data selection

High-Probability Sample Selection
In this section, we propose a method for screening pseudo-labeled samples. The specific process is shown in Figure 3. It uses the early training multi-model to pseudolabel the unlabeled samples, and according to the pseudo-labeling results, it sorts each category one by one according to the predicted probability. Finally, the top-ranked samples are selected as low-label noise samples to train the new model. In Section 3.1, through preliminary training, we obtained multiple scene classification models with different parameters and used these models to pseudo-label unlabeled samples. Intuitively, the pseudo-labeled dataset D P U = {(u i ,ŷ i )} N U i=1 obtained in Section 3.1 contains a large amount of incorrectly labeled data and cannot be directly used to train the model. However, the performance of a CNN is better if the training data become less noisy. We aim to select some low-noise data in the pseudo-labeled dataset D P U to optimize the classification model. From the view of [35], CNNs tend to learn simple patterns first, then gradually memorize all samples. For unlabeled data, if the pseudo-labeling results of most models are the same, they should be correctly labeled based on this observation; we select the low-noise pseudo-label samples from D P U as follows: where N s is the number of samples selected in each category. In other words, in the pseudolabeled dataset D P U , we select the top N s samples with the highest predicted probability ofŷ in each category to form the low-noise pseudo-labeled dataset Specifically, we convert the pseudo-label into the low-noise pseudo-labeled dataset D P s and into a one-hot label.

Retraining and Small Loss Selection
After obtaining the low-noise pseudo-labeled dataset D P s , we use labeled D L , lownoise pseudo-labeled D P s , and unlabeled datasets D P u to train a new CNN model based on the MixMatch semi-supervised learning method. In order to make better use of the low-noise pseudo-labeled dataset, we rewrite the loss function as where U P s is transformed from labeled data, low-noise pseudo-labeled data, and unlabeled data through MixUp [20], and λ P s are hyperparameters. In order to ensure that the three datasets are mixed using MixUp, in the actual training of the model, we use the same number of labeled data and low-noise pseudo-labeled data in each mini-batch. The number of unlabeled data is the sum of the number of labeled data and low-noise pseudolabeled data. However, in the pseudo-labeled samples, there is still a certain amount of incorrectly labeled data. The addition of these mislabeled data to the training process can have a certain impact on the generalization performance of the model and reduce the accuracy of the model. Moreover, semi-supervised learning gradually labels unlabeled samples with the training process. This leads to the fact that the wrong pseudo-labeled samples are not always wrong during the training process, so the wrongly labeled samples cannot be eliminated through a simple one-time screening. According to [35], small loss samples are likely to be ones that are correctly labelled. Thus, in the training process of our semisupervised learning model, for pseudo-labeled data, if we train our model using only small-loss pseudo-labeled samples in each batch of data, a certain number of incorrectly labeled samples would be eliminated.
In order to reduce the influence of incorrectly labeled samples on the model, we apply a small loss criterion to select relatively correct pseudo-labeled samples, as shown in Figure 4. In our algorithm, we use training loss in Equation (6) to minimize the impact of incorrectly labeled samples on the model. Specifically, there is no wrong label in the labeled sample. During the entire training process, only the last two parts of Equation (6) have error flags, and the last part (L U ) has more wrongly labeled samples. Specifically, we conduct small-loss selection in a batch as follows: where R(t s ) and R(t u ) are thresholds to control the number of incorrectly labeled samples to be screened.
τ, τ , T k is the total training epoch, t is the currently training epoch, and τ is a hyperparameter. At the begining of training, we keep more small-loss data (with a large R(t)) in each batch since deep networks would fit clean data first.  After obtaining the small-loss instances, we calculate the average loss on these examples for further backpropagation:

Experimental Results
In this section, we introduce the experimental setup, including the dataset, network architecture, training setup, and metrics. Then, we compare our proposed method with some state-of-the-art approaches by using the NWPU-RESISC45 and AID datasets.

Dataset
Two public aerial image datasets, NWPU-RESISC45 [37] and the Aerial Image Dataset (AID) [38], are used in the experimental section. NWPU-RESISC45 is a very large-scale benchmark for remote sensing scene classification that was created by Northwestern Polytechnical University (NWPU). AID contains samples of various resolutions from different sensors, which is extremely challenging and is one of the most commonly used datasets for evaluating scene classification algorithms.

Network Architecture
Resnet50 pre-trained on ImageNet was used as the backbone of our network architecture. The last 1000 dimensional fully connected (FC) layer of Resnet50 was replaced by a C dimensional FC layer, while C was the number of classes for the training dataset.

Training Setup
For our experiments, we used a batch of 16 images and 200 batches as an epoch. The early training multiple models had only trained 10 epochs and the final model had trained 120 epochs by using the labeled dataset, low-noise pseudo-labeled dataset, and unlabeled dataset, where an Adam optimizer was employed with a learning rate of 3 × 10 −5 for all models. The selecting number of samples N s = 4, and the number of early training models M = 3. All the baseline SSL methods were trained by an Adam optimizer with a 3 × 10 −5 learning rate by 120 epochs, and the learning rate remained constant during the training phase. Finally, we conducted 3 independent experiments on each dataset and recorded the average accuracy of each independent experiment as the final recognition accuracy of the SLL method.
All experiments were carried out on a computer equipped with an Intel CPU i7 10700k, an NVIDIA GeForce RTX 2080Ti, and 16 GB DDR4 memory. The operating system was Ubuntu 18.04 and the running software was Python 3.7.

Metrics
Accuracy and the confusion matrix were used as evaluation metrics in our scene classification experiment. The accuracy was calculated as the number of correctly classified samples divided by the total number of samples. The advantage of the confusion matrix is that it can clearly show all the errors between different categories and the different degrees of confusion of the model to the samples.

Experiment on NWPU-RESISC45 Dataset
Our proposed method was evaluated on two large-scale datasets. The first dataset was the NWPU-RESISC45 dataset, which contains 31,500 remote sensing images of 45 categories, extracted from Google Earth by experts in the field of remote sensing, with a spatial resolution of approximately 30 m to 0.2 m per pixel. Each scene class in NWPU-RESISC45 contains 700 images, which are set to 256 × 256 pixels in the RGB color space. Figure 5 shows an example image of each class in the NWPU-RESISC45 dataset. For the NWPU-RESISC45 dataset, we first randomly selected 20% of samples from each category as the test set. Then, in order to verify the effectiveness of our semi-supervised learning method under differently labeled samples, we randomly selected 1, 2, 3, and 5 samples in each category as the labeled dataset, and the remaining samples as the unlabeled dataset.  The proposed method was compared with other state-of-the-art methods, including label propagation [15], EL + LR [15], Mean-teacher [16], ICT [18], and MixMatch [19]. Label propagation is a typical graph-based semi-supervised method, which can be flexibly migrated for different tasks and has good performance. EL + LR adopts ensemble learning (EL) to establish discriminative image representations by exploring the intrinsic information of all available data, and uses supervised learning to perform logistic regression (LR)-based scene classification. Mean-teacher [16] applies the moving average of model parameters to the teacher model, generates proxy labels for each unlabeled sample, and calculates consistency loss and supervision loss. Based on Mean-teacher, ICT [18], and MixMatch [19], we used MixUp mixed data in the training process of the semi-supervised learning of the model to improve the accuracy of the model. We experimented with the above method under the same training sample. Figures 6 and 7 and Table 1 show all the experimental results.  4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43    As can be seen from Table 1, in the case of a small number of labeled samples, the model recognition accuracy of all methods is greatly improved with the increase in labeled samples. The sufficient diversity of the NWPU-RESISC45 dataset and the characteristics of the variants have brought enough challenges to the accuracy of the model. Among all the comparison methods, only our method has an accuracy rate of more than 90% when there are five labeled samples in each category. Figures 6 and 7 show the confusion matrix obtained on NWPU-RESISC45 by our method and the MixMatch method, respectively, with three labeled samples for each category. From 45 different scene classes from a large confusion matrix, the problem of mislabeled samples occurs more frequently. It can be seen from Figure 7 that the accuracy of the MixMatch method for scene image recognition in several categories of airport, church, freeway, medium residential, palace, tennis court, and wetland is very low, and the accuracy of individual scenes is almost 0. However, after optimization by our proposed method, under the same training sample conditions, the model has greatly improved the recognition accuracy of several scene categories: airport, church, freeway, medium residential, and tennis courts. Among them, the improvement of medium residential and tennis courts is particularly apparent. However, the model still has obvious deficiencies in the identification of individual categories (palaces), and most of the palace samples are identified as commercial areas. We believe that this is due to the small number of labeled samples and the random selection of labeled samples for each category. In this case, the random sample selected for a single category is not representative, resulting in low classification accuracy. To verify this conjecture, we randomly selected samples in the same way and retrained the model. Experiments showed that, under this sample selection condition, there are always some categories that have low recognition accuracy. In future research, we will consider optimizing the classification effect by increasing the selection of a single sample for each category. In general, it is worthy of affirmation that the comparison between Figures 6 and 7 can fully show that our method has a higher accuracy rate in almost all categories compared to the MixMatch method.

Experiment on AID Dataset
The second dataset was the Aerial Image Dataset (AID). It has a number of 10,000 images and is divided into 30 classes, which are collected from Google Earth imagery, with the pixel resolution changing from approximately half a meter to 8 m, and the size was fixed as 600 × 600 pixels. The number of images in each category varies from 220 to 420. Figure 8 shows one example image in the AID dataset for each class. Table 2 shows the detailed information of the image numbers in each semantic class for the AID dataset. The samples in AID are collected from different remote sensing sensors, so the samples come from multiple sources. The pixel resolution of the samples in the dataset has changed from 8 m to approximately 0.5 m. Each image has a fixed size to cover scene categories of different resolutions, which increases the difficulty for the model to classify the sample scenes. For the AID dataset, we first randomly selected 70 samples from each category, with a total of 2100 samples as the test set. Then, in order to verify the effectiveness of our semi-supervised learning method under differently labeled samples, as with the experiment on the NWPU-RESISC45 dataset, we randomly selected 1, 2, 3, and 5 samples in each category as the labeled dataset, and the remaining samples as the unlabeled dataset.  The proposed method was compared with other state-of-the-art methods, including label propagation [15], EL + LR [15], Mean-teacher [16], ICT [18], and MixMatch [19]. The results are given in Table 3. As can be seen from Table 3, for the AID dataset, our method performs much better than its comparisons with all labeled samples per category. From the perspective of recognition accuracy, our method has an accuracy of more than 90% in the case of three and five labeled samples in each category. Among the comparison methods, only the MixMatch method has an accuracy rate exceeding 90% when there are five labeled samples in each category. This is sufficient to show the superior performance of our method in the case of a small number of labeled samples.  To further show the effectiveness of our proposed methods, Figure 9 shows the confusion matrix obtained on AID by our method, with three labeled samples for each category. The recognition accuracy comparison of each category of our method and MixMatch when there are three labeled data for each category is shown in Figure 10, which details the improvement of our method relative to the accuracy of MixMatch in each category. As shown in Figures 9 and 10, our method has a certain improvement in the recognition accuracy of almost every category. However, as with the NWPU-RESISC45 dataset, for individual categories, the recognition effect of our algorithm is not satisfactory. It can be seen from Figure 9 that the algorithm has a very poor recognition effect on the category of schools. From the comparison results in Figure 10, the recognition accuracy of our method for schools is slightly lower than that of MixMatch. In Figure 9, the model's recognition accuracy of farmland and forest is 96% and 100%, respectively. The model has a very low recognition accuracy rate for schools, and most of the school samples are classified as medium residential by the model. From Figure 10, it can be found that the difference between farmland and forest is very large, while the difference between school and medium residential is very small. This shows that our model can fully learn and extract the characteristics of the two types of samples of farm and forest, and accurately classify the two types of samples when the labeled samples are limited. However, for schools and medium residential areas, with small differences between classes, there are still certain shortcomings, making the classification effect unsatisfactory. In follow-up research, we will explore ways to improve the model's recognition of individual category samples.  5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Discussion
In this section, we discuss the results of our method on two datasets. Finally, through sensitivity analysis, we analyze the effects of the number of training epochs for the early training of multiple models N E , the number of early training models M, and the number of samples of each category N s selected from the pseudo-labeled dataset on the performance of the final model.

Discussion of Experimental Results
Experiments on two datasets show that our method can greatly improve the recognition performance of the semi-supervised learning model in scene classification in the case of a small number of labeled samples. It is undeniable that there are still poor recognition effects of individual categories in the two experiments. This is in contrast to the remote sensing image scene classification task, where some categories have a higher degree of similarity between classes and a greater degree of intra-class difference. Figure 11 shows samples of two categories, commercial areas and places in the NWPU-TESISC45 dataset, and samples of two categories in the AID dataset, medium residential and schools, and each category shows four samples. It can be seen that the inter-class differences between these samples are extremely low. This causes the model to produce errors in pseudo-labeling unlabeled data, and the existing semi-supervised learning methods (including our method) fail to detect such pseudo-labeling errors in time, which affects the model's recognition accuracy of individual categories of samples. Moreover, according to the experimental results, it is not difficult to find that, in the experiments on the two datasets, the accuracy of the Mean-teacher method is very low. This is also strongly related to the small number of labeled samples. However, the MixMatch method and the ICT method have a significant improvement over the Meanteacher method. Through the overall comparison of these three methods, it is not difficult to see that MixUp has a great advantage in improving the accuracy of the model for the mixing of samples in the semi-supervised learning algorithm. Through sample mixing, the labels of pseudo-labeled samples can be smoothed so as to reduce the impact of incorrectly labeled samples on the performance of the model to a certain extent. This also shows that incorrectly labeled samples will have a certain impact on the model, especially when the number of labeled samples is very small. Thus, this also proves that our method has a positive effect on improving the classification performance of the model by adding some low-noise labeled samples and screening out some of the mislabeled samples with large losses.

Influence of Parameters on Performance of Proposed Method
In this section, the AID dataset is used as an example to analyze the influences of three important parameters, namely N E , M, and N s , on the performance of the final training model under the condition of two labeled samples for each category. Table 4 shows the changes in accuracy with the N E (the number of training epochs for early training of multiple models) changing over a wide range of values when the other two are fixed, M = 3 and N s = 4. It can be seen from Table 4 that the performance of the model is best when N E = 10. Our method is to train multiple models in the first stage, and then divide the unlabeled samples by these multiple models and select a small part of the samples to simply label them as low-noise labeled samples. Through labeled samples, low-noise labeled samples, and unlabeled samples, the model is retrained in a semi-supervised learning manner. The model obtained in the second stage is the final classification model. The N E number in Table 4 represents the number of early multi-model training and has nothing to do with the number of epochs of the second-stage model training. Our results show that when N E is smaller, the recognition result of the final model is better. After a few epochs of simple training, the models first learn some simple samples, and they learn for each category. With the deepening of the model training level, in the case of small labeled samples, the model tends to classify most unlabeled samples into several easy-to-learn categories, which affects the generalization performance of the model. Therefore, the number of iterations N E for early training of the multi-sample model should not be too great. For N S shown in Table 5, we can see that the number of low-noise samples selected by our method through the pre-training model is not as large as possible. When the number of samples selected by the model for each category is less than four, the performance of the model increases as the number of selected samples increases. When the model selects too many samples for each sample, there are too many falsely labeled samples in the selected pseudo-labeled samples, resulting in too much labeling noise of the pseudolabeled samples, which affects the classification performance of the model. Therefore, for our method, the best classification performance can be achieved when four pseudo-labeled samples are selected for each category. At the same time, when the number of selected samples is 0, it can be regarded as an ablation experiment using only the small loss selection method. The results show that when only the small loss selection method is used, the classification accuracy is 75.87%, which is better than the MixMatch method. This proves that our small loss selection method is effective. For M shown in Table 6, the four early training models have the best results, and the three models yield the second-best. In order to reduce computing resources and ensure that the performance of the three models does not differ from that of the four models, we used the three models to label unlabeled data in the comparison experiment.

Discussion of the Computational Complexity
In this section, we discuss the computational complexity of our method and other semi-supervised learning methods. The main configuration of the computer used has been explained in the experimental section. In the same environment, we intuitively compared the computational complexity by calculating the time spent in training and verification of different methods. The results are shown in Table 7. It can be seen from Table 7 that, compared with other comparison methods, our method is in the same order of magnitude as other methods in terms of time complexity and space complexity. Since our method involves the early training and screening of samples during the training process, the training time is increased, but compared with other methods, the overall training time is not significantly greater than other methods. The model is mainly used for image recognition, i.e., the model test time is more reasonable in actual use. All methods in this article are implemented through the Resnet50 model, so the test time is basically the same. Regarding the average time of each image, they all take around 4.60 ms.

Conclusions
In this paper, we have presented an early labeled and small loss selection semisupervised learning method to reduce the demand for labeled samples in remote sensing image scene classification. A simple method is used to select unlabeled data labeled with early pre-training models that only train in a few epochs, and the selected pseudo-labeled data are combined with labeled data and unlabeled data to train a new classification model under the small loss selection. This method can greatly improve the classification performance of the model. The experimental results on the AID and the NWPU-RESISC45 datasets show the superior performance of our method. In the experiment, we also found that, for a very small number of remote sensing image samples, because the difference between remote sensing image categories is not obvious, the existing semi-supervised learning methods cannot classify well in the case of a small number of labeled samples. In future research, we hope to explore the use of active learning to improve the generalization performance of the model for a very small number of samples in the case of selectively labeling a small number of samples in order to further improve the classification accuracy of the model and try to migrate the proposed method to polarimetric SAR images.  Data Availability Statement: The experiment in this paper uses public datasets, so no data are reported in this work.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.