Prototype Calibration with Feature Generation for Few-Shot Remote Sensing Image Scene Classiﬁcation

: Few-shot classiﬁcation of remote sensing images has attracted attention due to its important applications in various ﬁelds. The major challenge in few-shot remote sensing image scene classiﬁcation is that limited labeled samples can be utilized for training. This may lead to the deviation of prototype feature expression, and thus the classiﬁcation performance will be impacted. To solve these issues, a prototype calibration with a feature-generating model is proposed for few-shot remote sensing image scene classiﬁcation. In the proposed framework, a feature encoder with self-attention is developed to reduce the inﬂuence of irrelevant information. Then,the feature-generating module is utilized to expand the support set of the testing set based on prototypes of the training set, and prototype calibration is proposed to optimize features of support images that can enhance the repre-sentativeness of each category features. Experiments on NWPU-RESISC45 and WHU-RS19 datasets demonstrate that the proposed method can yield superior classiﬁcation accuracies for few-shot remote sensing image scene classiﬁcation.


Introduction
Remote sensing images captured by satellites contain rich information of land-cover targets, which have been widely applied in kinds of scenarios such as road detection [1], ecological monitoring [2], disaster prediction, and other fields [3,4]. Scene classification of remote sensing images aims to classify novel images into corresponding categories based on the captured information [5,6], which has become an important research direction at present. Few-shot scene classification prefers to classify images into corresponding categories with limited samples [7], which is of great importance since limited labeled samples can be acquired in various applications. Few-shot classification of remote sensing images has great application prospects in environmental monitoring, biological protection, resource development and so on, which can greatly reduce the requirements of field research and manual annotation.
Deep learning has achieved admirable performance in traditional image classification [8,9], which generally need a large number of labeled data for training. Deep neural networks perform well with complex network structure and sufficient prior knowledge [10,11]. However, if the labeled training set is inadequate, there would be obvious overfitting of the deep model and deviation of feature expression. As for various remote sensing applications, it is hard to acquire labeled samples, and collecting annotated data is quite time-consuming [12]. In addition, the training process of deep neural network consumes a lot of computing resources, which needs to be re-trained if some hyper-parameters are not set properly.
Inspired by the procedure of humans connecting unknown things with prior knowledge, few-shot learning has been developed for tasks of limited supervised information, which has the ability to recognize novel categories using only a few annotated samples [7]. Few-shot learning methods can be roughly divided into three categories: meta learning, metric learning [13], and transfer learning [14]. Meta learning is also known as learning to learn [15], which performs the role of guiding the learning of new tasks. As for metric learning methods, the similarity of two images is compared to estimate whether they are belonging to the same category, in which the higher the similarity, the greater the possibility of belonging to the same category. Different metric learning methods have been proposed [16], where various distance measures are utilized for few-shot classification. Transfer learning is proposed to apply prior knowledge from relevant tasks towards novel tasks on the premise of pre-trained models.
The above-mentioned methods pay attention to training a well-performed classifier or a robust learning model for few-shot learning, which ignores the significance of feature expression. As for scene classification of remote sensing images, the classifier may not perform well when the learned features cannot effectively represent the corresponding category [17]. Furthermore, in the tasks of few-shot remote sensing image scene classification, high similarity of images between categories, and great differences in images within the same category are great issues [18]. Some examples can be found from Figure 1. It is observed from Figure 1a that these images reflect intra-class consistency, which are wellsuited to few-shot classification. In Figure 1b, images belonging to the beach category have different colors, and images in the palace category show different textures, which reflects the phenomenon of large intraclass variances. In Figure 1c, it is obvious that images of freeway and railway have similar texture features, and images of lake and golf course have almost consistent backgrounds, which indicates that there are high similarities between categories. Moreover, there may be diverse objects in a remote sensing images, which also influences the performance of scene classification. These issues seriously affect the classification performance, and effective feature representation for remote sensing images should be considered.
Considering the characteristics and large scales of remote sensing images, some deep networks [19] have been utilized to extract features in order to obtain more semantic information in the early research, such as VGG16, AlexNet, and so on. However, it is hard to train the deep networks with insufficient annotated samples, especially in fewshot situation. Most recent proposed methods pay more attention to enhance features. In DLA-MatchNet [20], channel attention and spatial attention are introduced to reduce the influence of noise, and a learnable measurement is also used in order to improve the ability of the classifier. In RS-MetaNet [21], a meta-training strategy is developed to learn a generalized distribution for few-shot scene classification. Morever, a feature pyramid network [22] is also introduced to fuse feature maps extracted from different convolutional layers, which aims to improve the representation ability of sample features. These mentioned models have been proven to improve the accuracies of remote sensing image scene classification, but the performance is not obvious when only one or a few labeled samples of each category can be provided. In this paper, a prototype calibration with a feature-generating model is proposed for few-shot remote sensing image scene classification, which aims to enhance the representation of each category feature. In the proposed framework, a pre-training strategy with generalizing loss and fine tuning is utilized to train a robust feature extractor, and self-attention layers are constructed to reduce the influence of background. Then, feature generation is utilized to expand the support set of the testing set based on prototypes of the training set, and prototype calibration is proposed to optimize features of support images. Finally, the expanded support set with modified prototypes is imported to a the logic regression (LR) classifier, and predicted results of images on the query set are obtained. The major contributions of this paper can be generally summarized as follows: • A prototype calibration with a feature-generating model is proposed for few-shot remote sensing image scene classification, which is able to make full use of prior knowledge to expand the support set and modify prototype of each category. It enhances the expression ability of prototype features, which can overcome issues of intraclass variances and interclass similarity in remote-sensing images. • Self-attention layers are developed to enhance target information, which can reduce the influence of irrelevant information. It is developed to solve the problem of high similarities of background between categories in remote sensing images. • Experimental results on two public remote sensing image scene classification datasets demonstrate the efficacy of our proposed model, which outperforms other state-ofthe-art few-shot classification methods.
The rest of this paper is organized as follows. Section 2 shows the related works. Section 3 introduces the proposed methodology in detail. Section 4 reports the experimental results and the analysis. Conclusions are finally summarized in Section 5.

Related Works
In this section, we will review related works of this work. The classical and popular methods will be highlighted.

Few-Shot Learning
Few-shot learning is a novel machine learning paradigm that aims to learn from limited examples [7], which can achieve great performance via knowledge transfer. In many applications, labeled data are usually difficult to collect, and sample labeling is time-consuming. In the early study, semi-supervised learning has been developed, which only need a small number of labeled samples; at the same time, a large number of unlabeled samples can be also utilized during the training [23,24]. The major difference beetween semi-supervised learning a few-shot learning is that unlabeled samples are also insufficient for few-shot learning [25]. Few-shot learning has become a hot direction in the field of deep learning recently, which tends to be more suitable for various applications.
Currently, few-shot learning models are mainly supervised learning methods, which are able to learn classifiers with just a few labeled samples of each category [26]. Meta learning [22], also known as learning to learn, develops the concept of episode to solve the few-shot learning tasks, where the data is decomposed into different meta tasks in the episodic training. Meta learning based few-shot methods can learn the generalization ability of the classifier in the case of category variations, which is able to conduct the classification without changing the existing model when facing new categories in the testing stage. Coskun et al. [27] proposed a meta-learning based few-shot learning model, which can extract the distinctive and domain invariant features for action recognition. Gradient optimization [28] and metric learning [29] are also developed for few-shot learning. As for gradient optimization based few-shot learning models, the main idea is to learn a superior initialization of model parameters that can be optimized by a few gradient steps in novel tasks, in which long short term memory (LSTM) [30] and recurrent neural network (RNN) [31] are commonly utilized. As for metric-learning-based few-shot learning models, the major idea is to learn a unified metric or matching function; e.g., a relation network [32] is proposed to measure features by neural networks. Dong et al. [33] developed a new metric loss function for few-shot learning, which can enlarge the distance between different categories and reduce the distance of the same categories. Few-shot learning has been developed for many computer vision tasks, including image classification [34,35], object detection, segmentation [36], and so on.

Remote Sensing Image Scene Classification
Remote sensing image scene classification aims to categorize the images into various land-cover and land-use classes, which is a fundamental task and widely applies in many remote sensing applications. Recently, deep neural networks have been developed for scene classification of remote sensing images; in particular, convolutional neural network (CNN) based models have been proposed [37,38]. Cheng et al. [39] proposed a discriminative CNN (D-CNN) to solve the diversity of target information in the metric learning framework. Zhang et al. [40] proposed a CNN-CapsNet model for remote sensing image scene classification, in which CNN is utilized for feature extraction and CapsNet is designed for classification. Wang et al. [18] proposed an end-to-end attention recurrent CNN model for scene classification, which is able to obtain high-level features and discard the noncritical information. Sun et al. [41] proposed a gated bidirectional network based on CNN for scene classification, which aggregates the hierarchical features and reduces the interference information. Rafael Pires et al. [42] investigated the scene classification performance of CNN with transfer learning, which demonstrated the effectiveness of transfer learning from natural images to remote sensing images. Xie et al. [43] developed a remote sensing image scene classification model with label augmentation, in which Kullback-Leibler divergence is utilized as the intra-class constraint to restrict the distribution of training data. Shi et al. [44] proposed a lightweight CNN based on attention-oriented multi-branch feature fusion for remote sensing image scene classification.
For practical applications, labeled remote sensing images are quite limited, and thus the scale of data is still not enough from the perspective of deep learning. To address this problem, few-shot learning has been introduced in remote sensing image scene classification, which aims to train a model that can quickly adapt to novel categories using only a few labeled examples [45]. Moreover, there are issues of intraclass variances and interclass similarity in remote sensing images.Thus, effective feature expression is of great importance for few-shot classification of remote sensing images.

Methodology
The proposed prototype calibration with the feature-generating model for few-shot remote sensing image scene classification is depicted in Figure 2.

Problem Formulation
Few-shot remote sensing image scene classification can be regarded as a series of Cway K-shot N-query tasks, which stand for classifying N unlabeled samples of C different categories using only K labeled samples. Here, K samples with labels and N samples without labels are selected from each category, and thus a support set is built by C × K labeled samples and a query set is built by C × N samples. Specifically, the number of support samples is equal to the number of categories when K is set to 1. The overall dataset for remote sensing image scene classification can be divided into three subsets, namely training set, validation set, and testing set. The procedures are composed of the training process, the validation process, and the testing process, which are introduced as follows.
Different from traditional supervised learning methods, few-shot learning does not have enough labeled samples to learn comprehensive prior knowledge. Thus, knowledge transfer and internal relationship learning of the same category samples become significant in few-shot remote sensing image scene classification. The support set S = {(x i , y i )} (i = 1, 2, ..., C × K) and query set Q = (x j , y j ) (j = 1, 2, ..., C × N) are selected from the training set, where x i denotes the ith sample and y i denotes the corresponding label. In the training process, the network parameters are updated through gradient descent of loss function, and a well-trained feature extractor can be finally learned.
The major purpose of the validation process is to verify the robustness of the deep model. Similar to the training process, the validation set is split into the support set and query set, where labels of the query set are predicted based on the support set. Only the forward process is conducted during the validation process, which means that no back propagation is calculated to optimize the deep model.
As for few-shot remote sensing image scene classification, all the categories that appear in the testing process are novel, which means categories in the testing set are different from those of the training and validation sets. Labels of the testing set are predicted by the trained model with the support set that has only K labeled samples of each category. The category with respect to the highest predicted probability is selected as the sample label.

Pre-Training with Generalizing Loss
Our feature encoder is composed of several convolutional blocks. As for each remote sensing image with the shape of T × W × H, a rotation operation is conducted in order to extend the dataset. Here, T stands for the channel of features, W represents the width of features, and H represents the hight of features, rotating 90 degrees at a time for each image. To optimize the feature encoder, prediction loss function and rotation loss function are utilized as the generalizing loss in this work. The functions of prediction loss and rotation loss are defined as below where L p stands for the prediction loss, L r stands for the rotation loss, and f L denotes the cross entropy function [46]. x denotes the input data, f E represents the feature encoder, f c represents the full connection layer for classification, and y is the one hot label with respect to x. x r stands for the rotated image, and y r is the corresponding label. Therefore, the generalizing loss of the feature encoder can be summarized as below where L F is the overall generalizing loss, which is the weighted combination of L p and L r , and γ denotes the weighted parameter.

Fine Tuning with Sample Shuffle
After optimizing the feature encoder through numerous rounds, the deep model has the ability of feature extraction. Fine tuning is developed to improve the relevance between categories, where samples with corresponding labels are randomly shuffled. The shuffled samples are utilized to fine-tune the feature extractor, which are fused with the original samples. The function to fuse the shuffled sample and the original sample is written as below where x c denotes the fused sample, x stands for the original sample, x s stands for the shuffled sample, and λ is the fused parameter.
In order to enhance the robustness of the feature extractor, the fused samples x c are imported to the deep model for fine tuning. The loss L can be calculated as follows: where f L denotes the cross entropy function [46], y p stands for the predicted result from the network corresponding to the fused sample x c , y represents the original label of x, and y s means the shuffled label of x s . Using this loss function, relations between y and y s can be extracted, which aims to enhance the relevance of features between categories.
The overall feature encoder is optimized based on generalizing loss and then fine tuned with sample shuffle. Through generalizing loss, the feature encoder is able to extract features from remote sensing images. Furthermore, fine tuning with sample shuffle can subsequently enhance the robustness of the feature encoder, in which relations between categories tend to be included in the features.

Self-Attention Layers
In remote sensing image scene classification, images of different categories may have consistent backgrounds, which seriously affects the classification performance. Thus, background interference should be solved during feature learning. In this work, we introduce two ideas to strengthen the feature expression, one is to add self-attention layers, and the other is to deepen network layers. Self-attention layers are developed to enhance the object information, which aims to reduce the impact of background. At the same time, in order to deepen the layer of the feature encoder, skip connection is introduced to construct residual convolutional blocks.

Self-Attention
Remote sensing images are generally composed of foreground and background. For scene classification, background information of remote sensing images may introduce interference, which influences the performance of few-shot scene classification. In order to address this problem, self-attention layers are developed to strengthen the foreground information and weaken the background information. The self-attention module consists of three convolutional layers, where the structure is shown in Figure 3. As shown in Figure 3, the input features with the shape of B × T × W × H are extracted from the feature extractor, where B means the batch size, T stands for the channel, W represents the width, and H represents the hight. The features are reshaped to B × T × D by the transformation operation, where D is equal to W × H. Self-attention map is a three-dimensional vector, where the first two values denote the position of the pixel, and the third value stands for the attention weight. Followed by [47], each input has its own query q i , key k i and value v i , and thus the attention map is multiplied by q i and k i and normalized by a softmax function as below where a i denotes the attention map. Thus, the self-attention features can be obtained by multiplying with the attention map, which is defined as follows: where f a i denotes the self-attention features, which makes the target information more salient and reduces the impact of background.

Residual Self-Attention
In few-shot classification, gradient explosion or gradient vanishing may appear with the increase in network depth. To resolve this issue, skip connection is applied in the proposed network, whose core idea is to calculate the output by a linear superposition and a nonlinear transformation of the input data. The function of skip connection is listed as where x stands for the input image, and f E (·) denotes the feature encoder, which is composed of several convolutional layers. It is obvious that features from different layers are directly added to the final output, where features of higher layer can be regarded as fitting residual features of the earlier layers. Therefore, we construct the residual self-attention module by combining skip connection and self-attention, which is applied for feature learning. The overall formula is as follows: where R f a i denotes the residual self-attention features, f a i means self-attention features, andf (x i ) stands for original features extracted with skip connection. Therefore, the overall framework of our feature encoder is depicted in Figure 4.

Prototype Calibration with Feature Generation
In few-shot remote sensing image scene classification, it may lead to a large deviation in the determination of the category prototype, since a quite limited number of labeled images can be utilized, especially when there is only one support image of each category. Therefore, we consider making full use of prior knowledge (prototype and covariance on the training set as the benchmark) to expand the support samples, and then using the forward-predicted results to calibrate the prototype of each category.

Feature Generation
Having limited labeled samples leads to the absence of prior knowledge. In order to take full advantage of information of labeled samples, feature generation is developed based on the characteristics of the training data. During the testing, mean and covariance of features for each category should be calculated firstly, and the mean of features denotes the prototype of each category. The calculation function [48] can be defined as follows: where µ i denotes the prototype of the ith category, Σ i stands for the covariance of the ith category, N i represents the sample number of the ith category, and R f a j stands for the extracted feature corresponding to x i . Afterwards, to enhance the performance of the proposed deep model, feature generation is developed to yield a certain number of features corresponding to the support set. That is to say, the support set of the testing set is extended by feature generation, which is based on prototypes and covariances of each category feature. For each category of the support set, we calculate the prototype of each category on the support set firstly and then select the closest prototypes of the training set based on the Euclidean distance. After that, prototype and covariance of these selected categories are utilized as the benchmark for feature generation, which can be obtained as below whereμ i andΣ i stand for the mean and covariance of the benchmark, respectively, and µ j is the jth closest prototype of the training set and Σ j is the corresponding covariance. α denotes the covariance parameter, and M stands for the number of the selected prototypes from the training set. After the benchmark of mean features and variances are obtained, features of the testing set can be extended. Suppose each dimension of features obeys Gaussian distribution, a certain number of features can be generated to extend the support set of the testing set.

Prototype Calibration
The prototype of each category reflects consistent features of remote sensing images, which is of great signification for few-shot classification. At the same time, prototypes of different categories have relevance, which means the prototype of a category can be transferred to other categories.
In order to strengthen the expression ability of prototypes, prototype calibration is proposed to make full use of the knowledge on the expanded support set. There is some prior knowledge on the expanded support set, which can be used to classify unlabeled samples. Considering the relevances of prototypes of different categories, we utilize the expanded support set to re-train the classifier, and the classifier is applied to predict the labels of the query set. Features corresponding to samples with high prediction probabilities are selected to calibrate the prototypes. The formula of prototype calibration can be defined as followsμ whereμ i denotes the calibrated prototype, andμ i denotes the previously obtained prototype. R f a i stands for features of the selected sample on the query set, U denotes the number of selected samples, and β is the balance factor.
After calibrating all the prototypes of the support set, the re-trained logic regression (LR) classifier and the expanded support set are applied for few-shot classification, and labels of the query set can be predicted. Using prototype calibration with feature generation, it is able to modify prototypes in global view, which has a great contribution for few-shot remote sensing image scene classification.

Results and Discussions
In this section, we evaluate our proposed method on two public datasets for few-shot remote sensing image scene classification, and experimental results are reported as follows.

Dataset
NWPU-RESISC45 dataset [49] is an available benchmark for remote sensing image scene classification, which is created by Northwestern Polytechnic University. The dataset contains 31,500 images with 256 × 256 pixels, including 45 scene categories and 700 images of each category. The categories include airplane, airport, baseball field, basketball court and other scenarios. The whole dataset is divided into three subsets: training set with 25 categories, validation set with 10 categories, and testing set with 10 categories. In order to fit our designed feature encoder for feature extraction, all the images are reshaped into 84 × 84.
The WHU-RS19 dataset [50] is also an available benchmark for remote sensing image scene classification, which is released by Wuhan University. It contains 19 categories of scene images. For each category, the sample number is greater or equal to 50, and the total number of the dataset is 1005. The whole dataset is divided into three subsets: a training set with nine categories, a validation set with five categories, and testing set with five categories. In the experiments, all the images are also reshaped to 84 × 84 to fit our designed feature encoder. Details of the two public datasets are reported in Table 1.

Parameter Setting
In the pre-training process, the weighted parameter γ is set to 0.4, and the parameter λ is also set to 0.4. In self-attention layers, the kernel size of convolutional layers is 1 × 1, and the residual network is utilized as our feature encoder. Structure and parameters of the feature encoder are reported in Figure 5. In prototype calibration part, the number of the selected prototypes M is set to 2, the number of selected samples on the query set U is set to 3, and the balance factor β is set to 0.5. In addition, logic regression (LR) is utilized as the classifier, whose max iteration is set to 1000. Hyper-parameters are selected based on the experience, which follows the related research of scene classification based on deep neural networks [18]. We utilize the PyTorch framework to implement the proposed prototype calibration with the feature-generating model, which is ran on NVIDIA GeForce RTX 2080Ti GPU and Intel(R) Xeon(R) Silver 4114 CPU.

Experimental Results on NWPU-RESISC45 Dataset
Few-shot scene classification results on the NWPU-RESISC45 data are reported in Table 2, in which accuracies are calculated by averaging the results of 600 episodes randomly generated on the testing set. In the experiments, seven other few-shot scene classification methods are utilized for comparison, where averaged results are reported. From the table, it is clearly seen that our proposed network performs best with accuracies of 85.07% and 72.05% in the cases of five-way five-shot and five-way one-shot, exceeding the accuracies of DLA-MatchNet with 3.44% and 3.25% improvements, respectively. Compared with Relation Network, the proposed network has 6.45% and 5.70% improvements in five-way one-shot and five-way five-shot, respectively. Moreover, our proposed method surpasses Meta-SGD with 9.25% improvement in five-way five-shot case and 11.39% improvement in five-way one-shot case, respectively. In addition, our proposed method yields higher accuracies than other compared approaches. Therefore, it can be demonstrated that the proposed approach can make full use of limited information from few-shot data, which improves the performance of remote sensing image scene classification.
The proposed model is able to enhance the feature representation through prototype calibration, which overcomes the deviation of prototypes caused by a small number of labeled samples. In addition, self-attention is developed for feature extraction, which reduces the influence of background information. Table 2. Classification accuracy of 5-way 1-shot and 5-way 5-shot on the NWPU-RESISC45 dataset.

Experimental Results on WHU-RS19 Dataset
Experimental comparisons are also conducted on WHU-RS19 dataset, and the results are reported in Table 3. All the results are obtained with 600 episodes iteration. It is observed that the proposed few-shot classification method yields the accuracy of 72.41% and 85.26% in five-way one-shot case and five-way five-shot case, respectively. From Table 3, we can find that the proposed deep network yields superior accuracies over DLA-MatchNet, with 4.14% improvement in five-way one-shot case and 5.37% improvement in five-way five-shot case. Table 3. Classification accuracy of 5-way 1-shot and 5-way 5-shot on the WHU-RS19 dataset.

Ablation Study
To better analyze our proposed model on few-shot remote sensing image scene classification, we conduct an ablation study to analyze the effect of each module of our proposed model, which is reported as follows.

Effect of Pre-Training Strategy
In our pre-training strategy, fine tuning is developed to improve relevance between categories. To demonstrate the effect of fine tuning, experiments with and without fine tuning are compared and analyzed, where multiple cases with different number of shots are set for comparison. Results on the NWPU-RESISC45 dataset are depicted in Figure 6. It can be seen from Figure 6 that the accuracy of five-way five-shot is only 70.94% without fine tuning, which is obviously lower than our model by about 14%. Similarly, in the setting of five-way one-shot, accuracy without fine tuning is 54.06%, which is lower than our model about 18%. The proposed model with fine tuning yields higher accuracies in different number of shots. It is verified that fine tuning utilized in our pre-training strategy has a great contribution to improve classification performance, which can train the feature encoder with strong ability to extract features.

Effect of Self-Attention Layers
In our proposed model, self-attention layers are developed to enhance the target information and reduce the influence of background. To prove the effect of self-attention, the proposed network with self-attention layers and without self-attention layers are compared on two datasets. The compared results are shown in Table 4. It can be seen from the results that the proposed network with self-attention yields higher accuracies in the case of five-way one-shot, with about 2% improvement over the network without self-attention layers. This result indicates that self-attention has a contribution to few-shot remote sensing image scene classification, since it reduces the influence of irrelevant information for classification.

Discussions of Parameters for Feature Generation
Before prototype calibration, feature generation is developed to yield a certain number of features corresponding to the support set. Parameters for feature generation are discussed in this subsection. In feature generation, the closest prototypes of the training set are selected as the benchmark for feature generation, where the number of the closest prototypes is a hyper-parameter. Accuracies with different number of selected prototypes are depicted in Figure 7, where the number ranges from 1 to 7. It is clearly seen from the results that, in the five-way one-shot case, accuracies achieve the highest on the both NWPU-RESISC45 dataset and WHU-RS19 dataset in the condition of selecting two closest prototypes. In five-way five-shot case, accuracies also become optimal when the number of selected prototypes is set to two. Therefore, two closest prototypes in the training set are selected for feature generation in the experiments.
After prototype selecting, mean and covariance of these selected prototypes are calculated as the benchmark for feature generation, where a parameter α is utilized in the calculation. In order to analyze the influence of α, classification accuracies with the changing of α in five-way one-shot case are tested on WHU-RS19 and NWPU-RESISC45 dataset. Curves of classification accuracies changing with the parameter α are shown in Figure 8. It can be seen from Figure 8 that few-shot classification accuracy achieves the best results when α = 0. When α is greater than 0, the accuracy decreases with the increasing of α, and the decrease is especially obvious from α = 0 to α = 0.1. This indicates that there is no need to modify the covariance by the parameter α, at the same time; it also illustrates that covariances between the testing set and the selected prototypes of the training set are similar.

Effect of Prototype Calibration
In our model, prototype calibration is proposed to enhance the representative ability of different categories. To verify the effectiveness of prototype calibration, few-shot classification comparisons with different prototype calibration strategies are performed. Experimental results of different prototype calibration strategies are shown in Figure 9, where U = 0 can be regarded as the model without prototype calibration, and cases of U greater than 0 can be regarded as the model with prototype calibration. As shown in Figure 9, the accuracy with U = 0 is the lowest, which demonstrates that the proposed prototype calibration contributes to improving the classification performance. The accuracy increases with the increasing of U, and it tends to be stable when U is larger than 5. It can be concluded that prototype calibration with five selected images from the query set is able to strengthen the expression ability of prototypes, where five images with the highest prediction probabilities are selected. Therefore, during prototype calibration, five selected images from query set are utilized to adjust prototypes, which can not only enhance the few-shot classification accuracy but also rationally use the computing resources.

Effect of Feature Generation
In order to illustrate the effect of feature generation, comparisons with different feature generation strategies are performed in Figure 10. Accuracies with the increasing of generated features are tested in the case of five-way one-shot. When the number of generated features is equal to 0, it can be regarded as the model without feature generation. Furthermore, cases with the number of generated features greater than 0 can be regarded as the model with feature generation. It is observed that the accuracy increases steadily with the increasing of generated samples, and the accuracy is the lowest when the number is equal to 0, which demonstrates that feature generation can improve the classification performance. Feature generation has the ability to alleviate the impact of lack of prior information in few-shot remote sensing image scene classification. To better understand the effect of feature generation, we visualize the embeddings of support set, query set, and expanded support set, which can be found in Figure 11. It can be seen that feature distributions of the expanded support set and the query set are similar. Therefore, 500 features are generated in our model to improve the few-shot classification performance.

Overview
After the above experiments of the ablation study, we made the following comparisons, where composite scenarios were compared to verify the effectiveness of the proposed model. Six composite cases are compared with our overall model in Table 5 , where FG denotes the deep model only with feature generation, PC denotes the deep model only with prototype calibration, SA denotes the deep model only with self-attention, SA + PC denotes the deep model with self-attention and prototype calibration, SA + FG denotes the deep model with self-attention and feature generation, and PC + FG denotes the deep model with prototype calibration and feature generation.
From the experimental results in Table 5, it is observed that the proposed model combining self-attention, prototype calibration, and feature generation achieves the highest classification accuracy. Moreover, the proposed model with prototype calibration and feature generation is slightly worse than our overall framework, which indicates that the proposed prototype calibration with the feature-generating module makes a great contribution to improving few-shot remote sensing image scene classification. The explanation for the results is that prototype calibration can effectively modify the deviation of prototypes when only a few samples are available, and feature generation can make full use of prior knowledge to expand the support set. In summary, the proposed prototype calibration with the feature generation model performs optimally in both five-way five-shot and five-way one-shot cases.

Conclusions
In this paper, a prototype calibration with a feature-generating model is proposed for few-shot remote sensing image scene classification. Experiments on two datasets demonstrate that the proposed method can yield superior few-shot classification results. In our proposed framework, prototype calibration module is proven to enhance the expression ability of prototype features, and feature generation can make full use of prior knowledge to expand the support set. Moreover, self-attention layers are verified to reduce the influence of irrelevant information on few-shot classification, and the pre-training strategy is effective to train a robust feature encoder.
In few-shot remote sensing image scene classification, mislabeled samples have a great influence on the classification. Understanding how to overcome the influence of mislabeled samples is an interesting research point that will be researched in the future.