SAFFNet: Self-Attention-Based Feature Fusion Network for Remote Sensing Few-Shot Scene Classiﬁcation

: In real applications, it is necessary to classify new unseen classes that cannot be acquired in training datasets. To solve this problem, few-shot learning methods are usually adopted to recognize new categories with only a few (out-of-bag) labeled samples together with the known classes available in the (large-scale) training dataset. Unlike common scene classiﬁcation images obtained by CCD (Charge-Coupled Device) cameras, remote sensing scene classiﬁcation datasets tend to have plentiful texture features rather than shape features. Therefore, it is important to extract more valuable texture semantic features from a limited number of labeled input images. In this paper, a multi-scale feature fusion network for few-shot remote sensing scene classiﬁcation is proposed by integrating a novel self-attention feature selection module, denoted as SAFFNet. Unlike a pyramidal feature hierarchy for object detection, the informative representations of the images with different receptive ﬁelds are automatically selected and re-weighted for feature fusion after reﬁning network and global pooling operation for a few-shot remote sensing classiﬁcation task. Here, the feature weighting value can be ﬁne-tuned by the support set in the few-shot learning task. The proposed model is evaluated on three publicly available datasets for few shot remote sensing scene classiﬁcation. Experimental results demonstrate the effectiveness of the proposed SAFFNet to improve the few-shot classiﬁcation accuracy signiﬁcantly compared to other few-shot methods and the typical multi-scale feature fusion network.

However, in real applications, it is hard yet necessary to classify new scene categories which are not in the class list of a training dataset. To classify unseen classes, traditional methods are to retrain classification models based on newly collected unseen training datasets that contain the existing data and the additional labeled samples of new classes. However, these methods require expensive computing resources and time to retrain the whole model. Moreover, it is time-consuming and expensive to label out-of-bag images for model retraining. Therefore, this kind of methods is hard to implement in real applications. To solve this problem, few-shot learning is usually utilized to classify unseen classes without retraining whole models [29][30][31].
There have been two approaches to solve this problem in the recent decade, i.e., zeroshot learning and few-shot learning. For zero-shot learning, usually transfer learning or model-driven methods have been applied. For instance, the zero-shot scene classification method in the remote sensing application attempts to train the model with data from other domains and then semantic information is used to classify new unseen classes without labeled samples [32]. In natural language processing, the Skip-gram model is used to embed each class into semantic vectors [33], where semantic relationships are built between seen and unseen classes by constructing a directed graph over the classes. On the other hand, extracted features by zero-shot learning model are based on a user-designed textual description of the new classes and it could generate biased knowledge to new unseen classes.
in contrast to the zero-shot learning, data-driven methods can be applied in few-shot learning by leveraging very few labeled data instead of the other model-driven methods to recognize unseen classes. Thus, few-shot learning (including one-shot learning) aims to recognize new unseen classes with very few labeled samples together with the known classes available in the large-scale training dataset. A simplified comparison between zero-shot and few-shot learning is demonstrated in Table 1. Table 1. Comparison of traditional supervised, zero-shot and few-shot learning for classification tasks.

Classification Task Unseen Classes Auxiliary Materials
Traditional Supervised Learning × None Zero-Shot learning Textual description of unseen classes Few-Shot learning Few labeled samples of unseen classes Few-shot learning has been successfully applied to computer vision [29][30][31]34], natural language processing [35,36], speech recognition [37], gesture recognition [38], medical image classification [39], image translation [40] and drug discovery [41] tasks. Approaches for few-shot learning can be summarized in three ways. The first approach is based on meta learning (i.e., learning to learn) [42][43][44], where a Model-Agnostic Meta-Learning (MAML) model is proposed to learn good initial parameters of a deep network over multiple tasks [42], and Meta-SGD [44] (Stochastic Gradient Descent) and Meta-LSTM (Long Short Term Memory) [43] try to solve the problem by learning good learning rates and gradient computing functions and it succeeds in fine-tuning the network with only a few unseen samples. The second approach for few-shot learning is based on memory modules [45][46][47][48]. This approach uses deep neural networks to extract features for unseen samples and then external memories are updated afterwards for few-shot classification. The last approach is distance metric learning [29][30][31]49]. CNN networks for the backbone are firstly used as feature extractors, and then the classifiers based on a distance metric to compare between unseen classes are applied, where feature extractors are often pre-trained by a large scale base training dataset.
However, previously researched methods focus more on training strategies rather than an informative feature extraction. Since the major advantage of deep learning lies in the end-to-end feature extraction, it is necessary to develop an optimized feature extractor for remote sensing few-shot scene classification. In particular, remote sensing images are prone to having plentiful texture features rather than shape ones as shown in Figure 1. Accordingly, it is important to extract valuable texture semantic features from a limited number of input images. Examples of images in a remote sensing scene classification dataset that has rich texture features. Images on the first row are from a remote sensing scene classification dataset [50] and images at second row are from the Imagenet dataset [51].
Due to a hierarchical feature representation, deep convolutional neural networks have been successfully applied in different applications for the tasks, such as image classification [52], object detection [53] and semantic segmentation [54]. Accordingly, it is better to exploit multi-scale features from a limited number of support sets to obtain a better out-of-bag classification performance. In pyramidal feature models, such as FPN [55] and its extensions [56,57], the combination of multi-scale bottom-up and top-down features is exploited to generate different scales of objects. Here, low-level features contain more accurate locations of small objects and top-down features provide richer semantic information. Top-down features in lower levels are the interpolation of highest-level features through upsampling. If using FPN for classification, features from the highest level are re-emphasized by setting a bigger weighting value. To do so, features for fine-grained scenes are prone to be weakened to lead to poor performance.
To solve the aforementioned problems, a deep feature fusion network for few-shot scene classification is proposed in terms of a multi-scale feature aggregation architecture. Here, multi-scale features with different receptive fields are extracted and kept as stacked feature maps for further feature fusion, rather than FPN in which features are back-propagated from the top layers to the lower layers with lateral connections. Here, multi-scale features with different resolutions in the proposed network are automatically selected for the final decision by a self-attention scheme according to the importance of the features derived from a pyramidal feature hierarchy. Accordingly, the proposed model is called a Self-Attention-based Feature Fusion Network, denoted as SAFFNet. To do so, the support set images can be exploited to "fine-tune" the importance of different-scaled features automatically. The different modules to generating multi-scale feature hierarchy by FPN and the proposed method are shown in Figure 2. Experiments conducted on three benchmark datasets confirm the effectiveness of the proposed deep feature fusion network.

Few-Shot Scene Classification
Approaches for few-shot scene classification can be summarized into three ways. The first approach is based on a meta learning algorithm (i.e., learning to learn) [42,43]. Since the goal of the meta learning algorithms is to develop a model that can be applied for multiple similar tasks with limited data, there is a huge similarity in training strategy between meta learning and few-shot learning. For detail, a Model-Agnostic Meta-Learning (MAML) model is proposed to learn good initial parameters of a deep network over multiple tasks [42] and Meta-LSTM [43] tried to solve the problem by learning good learning rates and gradient computing functions and it succeeds in fine-tuning the network with only a few unseen samples. The main characteristic of this approach is that proposed methods are focused on the training strategy to solve the problem.
The second approach for few-shot learning is based on memory modules [45,48]. This approach uses deep neural networks to extract features for unseen samples and then external memories are updated afterwards for few-shot classification. Another approach is to apply a graph-based approach. [58] used a Graph Neural Network (GNN) to extract features and added a fully connected layer as the classification head to solve the fewshot problem.
The last approach is distance metric learning [29,30,59]. This approach uses CNN networks for a backbone feature extractor and using a classifier based on a distance metric to compare unseen classes. The feature extractor is pre-trained in the large-scale base training dataset. Since this approach is only revising. the classification head training strategy from meta learning methods also can be utilized. Recently, the distance metric for few-shot classification head has also been substituted for a neural network. RelationNet [31] successfully substituted traditional distance metrics for a neural network and achieved outstanding performance. Other than classification tasks, few-shot learning has been applied to other computer vision tasks successfully [60][61][62].

Multi-Scale Feature Fusion
Multi-scale feature fusion is an essential method to improve performance in object detection, instance segmentation and semantic segmentation tasks nowadays. All of the recent models that showed the fine performance are based on this method [53,63]. The most well-known approach is Feature Pyramid Network (FPN)-based methods [55]. FPN [55] tried to fuse the features from various scales and directly predict bounding boxes simultaneously. However, features from the initial stage of the backbone network are not deep enough to extract features from input data. Therefore, top-to-down layers are applied to FPN to solve this problem with upsampling layers. Features from bottom-top stages are fused with upsampled features from top-down layers with corresponding scales by the addition operator. Finally, FPN directly predicts from multiple scales of feature maps.
After FPN, there was a number of studies to develop advanced FPN. PANet [64] improved performance by attaching another bottom-to-top pyramid feature fusion layer. In addition, with significant progress on Neural Architecture Search(NAS) [65], NAS-FPN [66] and BIFPN [67] are introduced into an object detection task. The structures of those feature pyramid networks are automatically searched by reinforcement learningbased algorithms. NAS-based methods show an outstanding improvement in performance, but those methods have challenges, such as heavy storage and large computational complexities. Consequently, it is challenging to implement them in a regular computational environment.
Another approach is to apply multi-scale feature fusion by revising the backbone network itself. For example, HRNet [68] is designed as a feature extraction network by multi-scaling resolutions of input images by up-sampling. On the other hand, Res2Net [69] tries to apply a multi-scale mechanism in a channel-wise manner. In detail, it divides each 3 × 3 convolutional layer in the ResNet [52] bottleneck block into four scales of the channel.
Through channel multi-scaling, Res2Net improves feature extraction performance with a considerable computational cost.
The last approach is to apply a multi-scale feature fusion mechanism at the detection stage [70]. For example, a multi-scale strategy by dynamic detection heads is adopted in [71]. By applying multiple scale kernel sizes at the detection head with concatenation operation, average precision for object detection tasks can be improved. Other various approaches also show an outstanding performance by multi-scale feature fusion methods regardless of tasks in the computer vision research community [72,73].

Image Attention
The attention network helps the model to focus more on important features by refining input features. An attention mechanism is firstly adopted at natural language processing tasks. Then, the attention strategy is also successfully applied to image processing and analysis tasks. According to the data properties, there are two popular feature refining strategies, i.e., channel-wise [74] attention and pixel-wise [75].
Since pixels have plentiful spatial information for certain objects, Pixel-wise attention is normally appropriate for shape-biased tasks like object detection, instance segmentation and semantic segmentation [72]. By re-weighting, representative pixels can be extracted and left for final prediction from an input image. The non-local network [75] directly applied the self-attention mechanism from the natural language process research. The deformable convolutional network [76] tried to recalculate the position of every pixel during the convolutional operation rather than using a pre-defined anchored kernel. Empirical attention applies an attention mechanism for each factor in object detection to improve detection results [77].
Whereas channel-wise attention can refine the deep feature of each pixel, the Squeeze and Excitation network (SENet) [78] succeeds in achieving state-of-the-art performance without large computational cost with channel-wise attention. SENet finds that performance gain for image classification can be acquired by applying only channel-wise attention rather than the complicated self-attention network. In addition, SENet can be attached to any convolutional neural network-based feature extractor. The Global Contextual Network (GCNet) [74] revises and simplifies the non-local network [75] in a channel-wise manner based on SENet [78] and proves that channel-wise attention is better from the perspective of the tradeoff between performance gain and computational cost.
Combined attention denotes the attention network that applied both pixel-wise and channel-wise attention. Combined attention can achieve the best result but requires more computational resources. CBAM [79] and Dual Attention [80] successfully combined the two kinds of attention schemes in one model.
Recently, the concept of the attention network has been adopted to select features from an extracted feature map [71] or even utilized attention network itself to recognize images [81], for object detection [82] and conventional remote sensing scene classification [73]. Those methods also achieve comparable performance and computational cost compared to a traditional deep neural network, such as a convolutional neural network and recurrent neural network. Regardless of the task, attention networks come into the spotlight as a major topic in the research community due to their simplicity and effectiveness.

Self-Attention-Based Feature Fusion Network (SAFFNet)
In the task of scene classification, different classes in remote sensing images usually contain rich texture semantic features. Accordingly, multi-scale features from different resolutions need to be adopted for further feature or decision fusion. FPN and its extensions are proposed for object detection and then also applied to semantic segmentation. How to integrate the multi-scale features is an important but challenging problem in these tasks. As discussed in Section 2.2, FPN and its extension models, feature or decision fusion, use the features derived from all stages of the models.
The original concept of FPN [55] is to predict bounding boxes of various sizes (small objects and large objects) for the object detection task. Therefore, extracting pixel-wise features (positional-info) is crucial for object detection. FPN [55] tried to apply bottom to top and top-down layers with upsampling layers to reconstruct feature maps from the pooled feature maps. Consequently, it is better to directly regress the bounding box location from reconstructed feature maps (multiple decisions) for the object detection task. However, there is no need to regress bounding boxes in different sizes for remote sensing of few-shot scene classification tasks. In addition, the upsampling layer has a large computational complexity. Therefore, instead of upsampling and predicting directly, our main idea is to let the neural network select and highlight more important features from multiple scales of features to improve few-shot accuracy rather than extracting upsampled pixel-wise feature maps for bounding box prediction.
Our goal is to extract the rich texture semantic features from remote sensing images as shown in the second row of Figure 1. As global features from scene images can better capture contextual semantic information, channel-wise attention for feature selection is designed to "emphasize" the class-specific features in different resolutions derived from few support images for few-shot scene classification. Through cascading the feature selection module, the important features are reactivated and the trivial ones die. To do so, the features for the query image can be better matched with the self-attention fused features contained in the class-specific support images. Accordingly, the proposed model is denoted as SAFFNet (Self-Attention Feature Fusion Network). The overall architecture of SAFFNet is shown in Figure 3.

Multi-Scale Feature Generation
A multi-scale feature representation contains semantic information in different resolutions of remote sensing images using convolutional neural networks (CNN) for image classification. The corresponding feature maps contain the local salience of scene images and detail contextual information in lower levels of a CNN. However, more extracted features can be generated in higher levels for a pattern discriminant. To better classify remote sensing images, different levels of texture features should be adopted by integrating local semantic-rich and global extracted features in different receptive fields of a CNN.
In each stage, the feature maps are re-weighted by a channel-wise attention based on GCNet [74] in SAFFNet to refining deep features as shown in Figure 3. GCNet caculates the attention map by a 1 × 1 convolutional layer and softmax and then transforms features by a 1 × 1 convolution and layer normalization function before aggregation. After refining, we adopt convolutional and global pooling layers to match the channel and size of feature maps and reduce computational costs by avoiding expensive fully connected (FC) layers in the decision stage of deep networks for few-shot learning. Furthermore, channel-wise features are left by global pooling to focus more on plentiful texture features. Finally, the refined features, i.e., S * 1 , S * 2 , S * 3 and S * 4 from each stage are concatenated as a channelwise fused feature representation for further feature selection. Note that except for S 1 using 1 × 1, a 3 × 3 convolutional filter is adopted for the MFG module according to experimental results shown in Table 2. Note that all of the experimental results with bold text in this paper indicate the best accuracy within the comparison. The combination of {1 × 1, 3 × 3, 3 × 3, 3 × 3} achieved the best accuracy regardless of the dataset. This suggests that emphasizing features with deeper stages (last three stages) by 3 × 3 kernel size improved performance rather than only using 1 × 1 or 3 × 3 kernels. Table 2. Experimental result on the combination of a 1 × 1 convolutional layer and a 3 × 3 convolutional layer at each stage of a Multi-scale Feature Generation module with the AID dataset [50] and NWPU-RESISC45 dataset [20].Number 1 denotes the 1 × 1 convolutional layer and 3 denotes the 3 × 3 convolutional layer. For precise measurement, the experiment was conducted without the SAFS module.

Self-Attention-Based Feature Selection
How to integrate different levels of feature representation is a key concern for a multi-scale feature fusion model. Unlike FPN and its extensions, we adopt the concept of self-attention to select more informative features from the concatenated multi-scale feature after the multi-scale feature generation module as shown in Figure 3.
In detail, a 1 × 1 convolutional layer followed by the so f tMax function is obtained to assign feature importance to each channel. Then, an additional 1 × 1 convolutional layer is utilized to self-emphasize the important features by re-activating the corresponding channels. This module is intended for feature selection by a self-attention strategy. Therefore, the module is denoted as the Self-Attention Feature Selection (SAFS) module.
To further pay more attention to the important features, the SAFS module is cascaded to generate better feature representation by the concatenation operation. Experimental results in Table 3 demonstrate that we can obtain better performance by repeating the SAFS module three times. However, the weights of each convolutional layer in cascaded SAFS modules are shared to give more weight to selected features adaptively. In detail, the convolutional layer of the SAFS module shares the weight with the first convolutional layer in the previous SAFS module except for the first convolutional layer in the first SAFS modules since the input size is different. Furthermore, the performance of is decreased four times; we supposed that shared weights for each convolutional layer are overfitted four times. Table 3. Experiments to measure performance according to the number of feature selection layers with the NWPU-RESISC45 dataset [20] since this dataset has the largest volume. n means the number of feature selection blocks. In addition, to verify the effectiveness of architecture for the SAFS module, experiments conducted on two datasets are shown in Table 4. The structure of the SAFS module denotes the functions between two 1 × 1 convolutional layers. Since the So f tmax function can be regarded as both the activation and normalization function, other combination activations and normalization functions are experimented with to prove that the improvement of the proposed method is derived from simple normalization or feature selection. Batchnorm [83] (Batch normalization) + ReLU [84] denotes applying the ReLU (Rectified Linear Units) activation function after the batch normalization operation between convolutional layers and Batchnorm+Softmax denotes applying the Softmax function after the batch normalization layer between two convolutional layers. A combination of So f tmax with Batchnorm showed the worst few-shot accuracy within our comparison. This clearly denotes that normalization and So f tmax itself is not the critical part of improvement. The combination of batch normalization with the ReLU function showed the best result on the AID dataset. However, few-shot classification accuracies were relatively low on the other dataset (NWPU). Single batch normalization improved the performance for 1-shot but showed relatively low few-shot accuracy on the 5-shot task. This denotes that normalization is more critical to a 1-shot (extremely limited sample) task than a 5-shot task. In general, the combination of the So f tmax activation function with batch normalization operation showed the best performance within experiments conducted with three datasets. This denotes that the effectiveness of the proposed feature selection method is not only derived from normalization. Therefore, this combination is adopted to the proposed SAFS module.

5-Way
Other than a combination of functions, an experiment was also conducted to compare the addition concatenation operation and the addition operation at the end of the SAFS module. Experimental results shown in Table 5. Except for the 1-shot task with the AID dataset, all the experiment results for the concatenation operation outperform the addition operation. Features can be automatically selected from concatenated features from the previous layer; however, for addition, features can be just referred to the next layer. Consequently, the basis for adopting a concatenation operation at the end of the SAFS modules was provided by the experiment.  Table 5. Experiment to compare the concatenation operation and addition operation at the end of the SAFS module with SAFFNet50.

Training and Prediction
The few-shot scene classification problem can be defined as an N-way K-shot problem. N denotes N number of unseen classes, defined as S = {s 1 , . . . , s N }, where each class contains K labeled samples, e.g., for the class n with new labeled set. It can be defined as s n = {(x n1 , y n1 ), . . . , (x nK , y nK )}. The Y ∈ {1, . . . , N} is the set of corresponding labels in the support set. Predicting class labels in the list of Y for unlabeled unseen scene samples is the goal of few-shot classification task. Given a query scene imagex and the label of the query setŷ can be predicted by the support set S as followŝ y = arg max y i ∈Y P(y i |x, s i ). (1) For training strategy, a fine-tuning strategy is widely and successfully exploited to refine the feature representation for a small target dataset. Since our model focuses on feature extraction rather than training strategies, we adopt the Baseline + + training strategy in [85] due to its simple implementation and effectiveness. In detail, the SAFFNet is fine-tuned by the support set S. At the beginning, the pre-trained model f θ is obtained by the large volume of in-bag training datasets from scratch. Then, for new classes, the pre-trained backbone model is frozen except for the classifier for a final decision based on the So f tmax function. The overall architecture is shown in Figure 4. According to [85], the cosine similarity metric with a softmax function is adopted as a classifier for a final decision. Specifically, a cosine function is used to measure the similarity between the feature vector q of a query image and those from the support images.
In detail, the similarity between the query image qi belonging to the class n and the support image si belonging to the class n can be calculated by To obtain a prediction result, the softmax function is applied to compute the posterior probability as follows.
Based on the similarity between query images and support images, the prediction of the query image can be predicted by maximizing posterior probabilities over C classes. y = arg max n P(y = n|qi , si n ). (4)

Datasets
To verify the effectiveness and performance of the proposed SAFFNet, three scene classification datasets are divided into base training, validation and test sets, respectively. The UC Merced land-use dataset [86] contains 2100 land-use images with 21 classes, and each class has 100 corresponding scene images with the size of 256 × 256 pixels in the RGB color space. This dataset is manually collected from the urban area imageries in various areas around the country acquired by the United States Geological Survey (USGS). For the UC Merced dataset, the number of classes of the training, validation and test set were randomly selected from the 21 scene classes as 11, 5 and 5, respectively.
The AID dataset [50] is a large-scale aerial image database created by collecting sample images from Google Earth Imagery and images are post-processed using RGB rendering from the original optical aerial images. This dataset comprises 10,000 images with 30 scene classes. Those classes are randomly split into 15, 8 and 7. Classes of the test set are beach, commercial, forest, mountain, pond, river and stadium, respectively. Furthermore, all the images of each class in the AID dataset are carefully chosen from different countries and regions around the world to increase the within-class diversity. Therefore, the AID dataset is a more challenging dataset for the scene classification task compared with the UC Merced dataset.
The NWPU-RESISC45 dataset [20] contains 31,500 remote sensing scene images with 45 scene classes, and each class consists of 700 aerial images as shown in Figure 5. The size of all images is 256 × 256 pixels in the RGB color space. These 45 scene classes are randomly split into 23, 11 and 11, respectively. Classes of the test set are basketball court, church, dense residential, golf course, intersection, medium residential, palace, rectangular farmland, sea ice, stadium and thermal power station. This is the largest publicly available dataset in the remote sensing scene classification task.   [20] dataset. We adopted basketball court, church, dense residential, golf course, intersection, medium residential, palace, rectangular farmland, sea ice, stadium and thermal power station as novel classes for few-shot classification.

Experimental Settings
The SAFFNet is trained based on ResNet 101, 50, 34 and 18 from scratch. The batch size is set to 32 which denotes that each batch contains 32 iterations for training. Standard data augmentation including, random crop, left-right flip and color jitter are applied. Total epochs at the base training level is set to 600. The learning rate is set to 0.001. The models are trained via an Adam Optimizer [87], and β1 and β2 are set to 0.9 and 0.999, respectively, which control the exponential decay rates. The input size of image is resized to 224 × 224 pixels. Due to the quantity limitation of classes, only 5-way 1-shot and 5-way 5-shot tasks are adopted as a comparison in the following experiments. The loss function for base training and few-shot training is adopted as cross entropy loss for both base training and few-shot training.

Experimental Results
First of all, the evaluation metric for our experiments as follows. The most similar feature S n in N class is retrieved based on the similarity function d(·, ·) Equation (2) given the embedding feature vectors of the query images {q i , . . . , q N q }. It is assumed that the query samples and their corresponding most similar features belong to the same class. Accordingly, the evaluation metric is defined by To validate the effectiveness of the proposed SAFFNet, other few-shot methods are used including prototypical network (denoted as PN) [30], matching net (denoted as MN) [29] and relation net (denoted as RN) [31]. The experiments are carried out on the NWPU, AID and UCM datasets. For a fair comparison, all models are trained from scratch and hyperparameters of the backbones are set to the same as SAFFNet. Predictions results shown in Figure 6. From Table 6, one can see that SAFFNet significantly improves the classification accuracies in both 5-shot and 1-shot learning. This result denotes that feature extraction and selection is a more important factor to improve remote sensing few-shot scene classification rather than training strategy. In addition, we conduct experiments to compare it with other multi-scale deep feature fusion methods. We adopt the Res2Net [69] and FPN (Feature Pyramid Network) [55]. As shown in Table 7, SAFFNet obtains the best results compared with other deep feature fusion methods. Furthermore, the experimental result also proved the effectiveness of the intelligent feature selection strategy by the neural network itself rather than direct feature fusion.  Table 6. Few-shot scene classification accuracies on three datasets provided by the proposed SAFFNet, ProtoNet [30], MathcingNet [29] and RelationNet [31] models on the ResNet18, denoted as SAFFNet18, PN-ResNet18, MN-ResNet18 and RN-ResNet18, respectively.  Table 7. Comparison with other multi-scale feature fusion methods, i.e., Res2Net [69] and FPN [55]. Note that we concatenate features instead of direct prediction for FPN to obtain better results for the classification task.

5-way Acc (%) Dataset
Methods To further validate the effectiveness of the proposed method, four different backbone networks were exploited for further comparison as shown in Table 8. Here, the backbone networks are based on ResNet 101, 50, 34 and 18, respectively. The experimental results show that the proposed SAFFNet can significantly increase both few-shot classification accuracies and robustness in both the 5-way 5-shot task and 5-way 1-shot task in all datasets compared with the backbone models without feature fusion.

Ablation Study
For the ablation study, we conducted experiments to validate the influence of MFG and SAFS modules of SAFFNet. Experimental results are shown in Table 9 which prove that two modules improve the classification accuracies for the few-shot scene classification problem. In particular, the improvement by the MFG module is bigger on the 5-shot problem while the SAFS module obtains a larger increase on the 1-shot problem. By aggregating the two modules, SAFFNet achieves the best results on the both 5-shot and 1-shot problems. Table 9. Ablation study to verify the influence of the MFG and SAFS modules. To measure the influence of fused features at each stage, we conducted an experiment on different combinations of feature fusion and exploited our model to the 5-way 5-shot and 5-way 1-shot tasks by the proposed SAFFNet model based on the ResNet50 network in terms of the NWPU datasets shown in Table 10. Feature vector S 4 denotes the final feature map output without a multi-scale strategy, which is used for the other networks, e.g., the PN, MN and RL models. The outputs from the first stage to last stage, i.e., from the first stage of the base CNN model, are denoted as S 1 , S 2 , S 3 and S 4 shown in Figure 3. As shown in Table 10, the proposed MFG module using the combination of the features S 1 , S 2 , S 3 with S 4 showed the best classification accuracy on the NWPU dataset in both 5-way 5-shot and 5-way 1-shot tasks.

5-Way Acc (% )
This empirically confirms the effectiveness of the proposed refined multi-scale feature fusion network extracting better features from texture biased remote sensing scene images. Moreover, regardless of any combination of multi-scale features fusing from the first stage to the last stage, the classification accuracy can be significantly improved by the proposed MFG and SAFS module in the remote sensing few-shot learning scene classification task. This further proves that the proposed model can achieve better classification accuracy and robustness as well. Table 10. Accuracies provided by the proposed SAFFNet using the NWPU dataset in terms of different combinations of multi-scale feature fusions. To further verify the influence of the number of shots on the few-shot learning scene classification results, 5-way K-shot(K = {1, 2, 3, 4, 5, 6, 7, 8, 9 and 10}) tasks are experimented on the NWPU dataset. To be a fair comparison with the proposed SAFFNet, all experiments were based on the ResNet50 network. The "shot" denotes the number of new training instances of each unseen class. As shown in Figure 7, with the increase in the number of training "shots" K, the few-shot classification accuracies are smoothly improved on the NWPU dataset by the proposed SAFFNet in 5-way K-shot tasks. Furthermore, the influence of the number of "ways" on the classification result is also evaluated by N-way 5-shot and 1-shot (where N = {3, 4,5,6,7,8,9, and 10}) tasks and the experiments are conducted on the NWPU dataset for fair comparison by the proposed SAFFNet model. The "way" denotes the number of given classes in the test set. As shown in Figure 8, with the increase in the number N, the classification accuracies are gradually decreased on the NWPU dataset.

5-Way
Regardless of shots and ways, errors for few-shot prediction also gradually decreased according to the increase in shots and ways. This denotes that errors become smaller when there are more data samples for few-shot classification.
Lastly, few-shot classification accuracy according to the depth of backbone network is demonstrated in Figure 9. The experiment confirmed that a deeper backbone network does not always ensure the better few-shot classification accuracy in both 5-shot and 1-shot tasks. Furthermore, the experiment results with the 5-shot task showed a more stable curve according to the depth of the network. This also suggests that additional parameters do not always ensure better few-shot accuracy with an extremely small amount of samples.

Conclusions
In the paper, a self-attention feature selection module was proposed for deep feature fusion in a multi-scale structure for a few-shot remote sensing image classification. The method is denoted as SAFFNet, i.e., a self-attention-based feature fusion network. SAFFNet consists of two modules, i.e., multi-scale feature generation (MFG) and self-attention feature selection modules, respectively. In the MFG module, refining feature maps in different resolutions with richer semantic information are generated by channel-wise attention based on GCNet [74] for a scene image. To do so, the feature importance in different receptive fields can be automatically computed. After that, refined features from different scales are concatenated as input to the SAFS module for further feature fusion. In each SAFS module, the important features with bigger coefficients are highlighted while the trivial ones with smaller parameters decay. Finally, the module is cascaded to concatenate the features generated from the previous module. Experiments conducted on three remote sensing scene image datasets confirm the effectiveness of the proposed SAFFNet by significantly improving scene classification accuracies compared with the existing few-shot scene classification models and multi-scale feature fusion methods as well.
Although the SAFFNet has shown outstanding performances on a few-shot learning scene classification task for remote sensing images, the proposed SAFFNet still requires few unseen training samples to achieve more efficient and meaningful training for fine-tuning of a CNN backbone network. As a future development, a generative adversarial networksbased algorithm [88] or semi-supervised learning approaches [89,90] will be adopted to deal with this problem in a data-driven perspective. Additionally, other efficient CNN architectures will be searched by a neural architecture search [65] strategy for few-shot remote sensing land-cover/use image classification tasks.

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.