Few-shot Classification of Aerial Scene Images via Meta-learning

Few-shot Classification of Aerial Scene Images via Meta-learning Pei Zhang 1,†, Yunpeng Bai 2,†, Dong Wang 1, Bendu Bai3 and Ying Li 1,* 1 School of Computer Science, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Shaanxi Provincial Key Laboratory of Speech & Image Information Processing, Northwestern Polytechnical University, Xi’an 710129, China; cszhangpei@mail.nwpu.edu.cn (P.Z.); dongwang@mail.nwpu.edu.cn (D.W.) 2 School of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia; yunpengb@student.unimelb.edu.au (Y.B.) 3 School of Communication and Information Engineering, Xi’an University of Posts and Telecommunications, Shaanxi, Xi’an 710121, China; baibendu@xupt.edu.cn (B.B.) * Correspondence: lybyp@nwpu.edu.cn (Y.L).; Tel.: +86-029-8843-1532 † These authors contributed equally to this work. Version November 16, 2020 submitted to Remote Sens.


Introduction
Aerial images, taken from the air and space, provide sufficient detail about the earth's surface, such as its landforms, vegetation, landscapes, buildings, and other various resources. Such abundant information is a significant data source for earth observation [1], which opens the door to a broad range of essential applications spanning urban planning [2], land-use and land-cover (LULC) determination [3,4], mapping [5], environmental monitoring [6] and climate modeling. As a fundamental problem in the remote sensing community, aerial scene classification is crucial for these research fields. Xia et al. [7] defined the aerial scene classification as automatically assigning a specific semantic label to each image according to its content.
Over the past few decades, aerial scene classification enjoys much attention from researchers, and many methods have been proposed. The existing approaches to aerial scene classification have mostly fallen into three categories -methods adopting low-level feature descriptors, methods using middle-level visual representations and methods relying on deep learning networks.
Methods adopting low-level feature descriptors. Most early researches [8][9][10] on aerial image classification fall into this category. These methods use hand-crafted, low-level visual features such as color, spectrum, texture, structure, or their combination to distinguish aerial scene images. Among the hand-crafted features, the most representative feature descriptors include color Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 October 2020 doi:10.20944/preprints202010.0033.v1 histograms [10], texture features [9], and SIFT [8]. While this type of method performs well in certain aerial scenes with uniform structures and spatial arrangements, it has limited performance for aerial images containing complex semantic information. Methods using middle-level visual representations. In order to overwhelm the insufficiency of low-level methods, many middle-level methods have been explored for aerial scene classification. Such methods mainly aim at combining the local visual attributes extracted by low-level feature methods into high-order statistical patterns to build a holistic scene representation for aerial scenes. Bag of Visual Words (BOVW) [11] and many of its variants have been widely used. Besides the BOVW model, typical middle-level methods include, but not limited to, Spatial Pyramid Matching (SPM) [12], Vector of Locally Aggregated Descriptors (VLAD) [13], Locality-constrained Linear Coding (LLC) [14], Probabilistic Latent Semantic Analysis (pLSA) [15] and Latent Dirichlet Allocation (LDA) [16]. Compared with low-level methods, the scene classification methods using middle-level visual representations have obtained higher accuracy. However, middle-level methods will only go so far; they require hand design features and lack adaptability; their generalization is poor for complex scenes or massive data.
Methods relying on deep learning. Fortunately, with the emergence of deep learning, especially convolutional neural networks [17,18], image classification approaches have seen great success in both accuracy and efficiency, also in remote sensing fields. The methods relying on deep neural networks automatically learn global features from the input data and cast the aerial scene classification task as an end-to-end problem. More recently, while the deep CNNs methods have become the new state-of-the-art solutions [19][20][21][22] for the aerial scene classification area, yet, there are clear limitations. Specifically, the most notorious drawback of deep learning methods is that they typically require vast quantities of labeled data and suffer from poor sample efficiency, which excludes many applications where data is intrinsically rare or expensive [23]. In contrast, humans possess a remarkable ability to learn new abstract concepts from only a few examples and quickly generalize to new circumstances. For instance, Marcus, G.F. [24] pointed out that even a 7-month-old baby can learn abstract language-like rules from a handful of unlabeled examples, in just two minutes.
Why do we need few-shot learning? In a world with unlimited data and computational resources, we might hardly need any other technique rather than deep learning. However, we live in a real-world where data are never infinite, especially in the remote sensing community, due to the high cost of collecting. Still, almost all existing aerial scene datasets have several notable limitations.
On the one hand, the classification accuracy is saturated; to be more specific, the state-of-the-art methods can achieve nearly 100% accuracy on the most popular UC Merced dataset [11] and the WHU-RS19 [25] dataset. Yet, we argue, such a limited number of categories in the two datasets are critically insufficient for the real world. On the other hand, the scale of the scene categories and the image number per class are limited, and the images lack scene variation and diversity. An intuitive way to tackle this issue is to construct a large-scale dataset for aerial scene classification, and several more challenging datasets, including the AID dataset [7], the PatternNet dataset [26], the NWPU-RESISC45 dataset [19], and the RSD46-WHU dataset [27,28], have been proposed.
Although the aerial scene datasets increase in scale, most of them are still considered small from the perspective of deep learning. For similar situations in the machine learning community, few-shot learning [29] offers an alternative way to address the data-hungry issue from a different standpoint. Instead of expanding the dataset scale, few-shot learning aims to learn a model that can quickly generalize to new tasks from very few labeled examples. Arguably, few-shot learning is a human-like way of learning. It assumes a more realistic situation where not rely on thousands or millions of supervised training data. Namely, few-shot learning can help to relieve the burden of collecting data, especially in some specific domains in which collecting labeled examples is usually time-consuming and laborious, such as aerial scene field or drug discovery. Figure 1 demonstrates a specific 1-shot scenario that it is possible to learn much information about a new category from just one image. By seeing the potential that few-shot learning can alleviate the data-gathering effort, improve computing efficiency, and bridge the gap between Artificial Intelligence and human-like learning, we introduce the few-shot paradigm to the aerial scene classification problem. The goal of this work is to classify aerial scene images with only 1 or 5 labeled samples. More specifically, we adopt a meta-learning framework to address this problem. To the best of our knowledge, we are the first to provide a testbed for few-shot classification of aerial scene images. We re-implement three state-of-the-art few-shot learning algorithms, namely Prototypical Networks [29], MAML [30] and Relation Network [31], moreover, a typical CNN-based method D-CNN [21] for comparison.
The main contributions of this article are summarized as follows.
1. This is the first work to provide a testbed for comparison with several different few-shot learning algorithms in the aerial scene field. Our experimental evaluation reveals that it is possible to learn much information for a new category from just a few labeled images, which is a great potential for introducing the few-shot paradigm to the remote sensing community.
2. The proposed method including a feature extraction module and a meta-learning module. First, ResNet-12 is used as a backbone to learn a representation f of input on base set. Then, in the meta-training stage, we optimize the classifier by cosine distance with a learnable scale parameter in the feature space.
3. We conduct extensive experiments on two challenging datasets: NWPU-RESISC45 and RSD46-WHU. Besides, we build a mini dataset from the RSD46-WHU to investigate how the scale of the dataset affects the performance. At last, we analyze the performance as a function of the support shots. The experimental results demonstrate that our model is specifically effective in few-shot settings.
The remainder of this paper is organized as follows. In Section 2, we discuss the related work on CNN-based methods of aerial scene classification and various state-of-the-art few-shot classification approaches that developed recently. In Section 3, we introduce some preliminary of the few-shot classification as it may be new to some readers. The proposed meta-learning method is described in Section 4. We illustrate the datasets and discuss the experiment results in Section 5. Moreover, finally, Section 6 concludes the paper with a summary and an outlook.

CNN-based methods of Aerial Scene Classification
Aerial scene classification has been well studied for the last few decades owing to its broad applications. Since the emergence of the AlexNet [17] in 2012, deep learning-based methods have made an enormous breakthrough, much defeated the traditional methods based on low-level and middle-level methods, and became mainstream in the aerial scene classification task.
One strand study attempted to use a transfer learning method to fine-tune the pre-trained CNNs for aerial image classification. In [32], Yu et al. studied how to transfer the activations of CNNs pre-trained on the ImageNet dataset to high-resolution remote sensing classification. Cheng et al. [19] obtained better performance by using fine-tuned AlexNet [17], VGGNet-16 [18], and GoogleNet [33] on the dataset NWPU-RESISC45. Similarly, Nogueira et al. [20] carried out three strategies, namely full training, fine-tuning, and using CNNs as feature extractors, for exploiting six common CNNs in three remote sensing datasets. Their experiment results demonstrate that fine-tuning is generally the best strategy in different situations.
Some further studies utilize the pre-trained CNNs for feature extraction and combine the high-level semantic features with hand-crafted features. Zhao and Du [34] proposed a CNN framework to learn local spatial patterns from multi-scale. Wang et al. [35] presented an encoded mixed-resolution representation framework where multilayer features are extracted from various convolutional layers. The study by Lu et al. [36] introduced an adaptive feature strategy that fuses the deep learning feature and the SIFT feature to overwhelm the scale and rotation variability, which is essential in remote sensing images but cannot be captured by CNN-based methods.
More recent research has begun to concern the problem of within-class diversity and between-class similarity in aerial scene images. For example, to tackle this issue, Cheng et al. [21] trained a discriminative CNN model by optimizing a novel objective function. Beyond a traditional cross-entropy loss, a metric learning regularization term and a weight decay term are added to the proposed objective function. Li et al. [22] constructed a feature fusion network that combining the original feature and attention map feature; besides that, they adopted center loss [37] to improve feature distinguishability.

Few-shot Classification via Meta-Learning
Deep learning-based approaches have achieved remarkable success in various fields, especially in areas where vast quantities of data can be collected and where substantial computing resources are available. However, deep learning is often suffered from poor sample efficiency. Recently, few-shot learning is proposed to tackle this problem and have been marked by exceptional progress. Few-shot learning aims to learn new concepts from only small amounts of samples and quickly adapt to unforeseen tasks, which can be viewed as a special case of meta-learning. In the following, we introduce some representative few-shot classification literature, gathering into two main streams: optimization-based methods and metric-based methods.
Optimization-based methods. This line of work is most understood as learning to learn, which tackles the few-shot classification problem by effectively optimizing model parameters to new tasks. Finn et al. proposed a model-agnostic algorithm named MAML [30], which targets to learn a good initialization of any standard neural network. In such a way, it means to prepare that network for fast adaptation to any novel task through only one or a few gradient steps. The authors also presented a first-order approximation version of MAML by ignoring second-order derivatives to speed-up the network computation. Reptile [38] expands on the results from MAML by performing a Taylor series expansion update and finding a point near all solution manifolds of the training tasks. Many variants [39][40][41] of MAML follow a similar idea that, with a good initialization, one is just a few gradient steps away from a solution to a new task. These approaches face a critical challenge that the external optimization needs to solve as many parameters as internal optimization. Besides, there is a key debate. That is, whether a single initial condition is sufficient to provide fast adaption for a wide range of potential tasks. And further, whether an initial condition is restricted to relatively narrow distributions.
Metric-based methods. Another family of approach aims to address few-shot classification by learning to compare. The key insight of the idea is to learn a feature extractor that mapping raw input into a representation suitable for predicting, such that, when represented in this feature space, the query and support samples are easy for comparison (e.g., with Euclidean distance or cosine similarity). Matching Networks [42] mapping the support set via an attention mechanism to a function and then classifying the query sample by a weighted nearest-neighbor classifier in an embedding space. Prototypical Networks [29] follows a similar idea that learns a metric-based prediction rule over embeddings. The prototype of each category is represented by the mean embedding of samples, such that the classification can be performed by computing distances to the nearest category mean. Besides a usual embedding module, Relation Network [31] introduces an additional parameterized CNN-based 'relation module' for learnable metric comparison.
While meta-learning approaches have seen great success in few-show classification, some pre-trained methods have recently gained competitive performance [43,44]. Our work is more related to the second line of work by finding a suitable distance metric and taking the pre-trained method's strength by learning good feature embeddings.

Preliminary
Before introducing our overall framework in detail, we first look at some preliminary of the few-shot classification as it may be new to some readers.
In standard supervised classification, we are dealing with a dataset The training set takes labeled input-output pairs as inputs, denoted as where N is the number of training samples, C is the number of categories in train D . We are interested in learning a model ( ) In few-shot classification, we instead consider learning a model that can generalize effectively to unseen categories in training, given only a few samples, usually 1 or 5, in each new category. Following recent work [29,42], we formalize the few-shot classification paradigm as below. Given a meta-set   , base novel = , and base novel  =  , where represents the category. The vision is to learn a model on base that can quickly adapt to unseen categories in novel with only a few support samples. To this end, we consider training and evaluating the model on a set of tasks, or so-called episodes. Here, we treat entire tasks as training instances in conventional machine learning. Specifically, we adopt an N-way K-shot setting, in which each episode has a support-set and a query-set . The support-set contains N unique categories with K labeled samples in each. The query-set holds the same N categories, each with Q unlabeled samples being to classify. The difference between standard supervised classification and few-shot classification is illustrated in Figure 2. More details are described in section 4.3.

Overall Framework
In this work, we propose a meta-learning method for few-shot classification of aerial scene images. The framework consists of a feature extractor, a meta-training stage, and a meta-testing stage. Figure 3 illustrates the overall procedure of our method. First, a feature extractor is trained on base dataset base to learn a representation of inputs for further comparison in feature space. To achieve this, we train a typical classifier on all base categories by minimizing a standard cross-entropy loss and removing its last FC Layer to get a 512-dimensional feature representation. Then, we consider training a meta-learning classifier over a set of episodes in the meta-training stage. Concretely, the objective is to optimize the classifier by minimizing the generalization error across episodes. For a single episode, the query features are compared with the category mean of support features by cosine distance. Finally, in the meat-testing stage, the meta-learning classifier is estimated on a set of episodes sampled from the novel set novel , usually referred to as a meta-test set.

Feature extractor
We train a feature extractor f  with parameters  on the base set base that encodes the input data to a 512-dimensional feature vector suitable for comparison. Here we employ ResNet-12 to learn a classifier on all base categories and remove the last fully connected layer to get f  , which is described below; though, other backbones can also be used. Before feeding to the network, all input images in base are resized to 80 80  . The architectural setting of ResNet-12 we use, illustrated in Figure 4, consists of four ResNet blocks. Three convolutional layers configure each ResNet block with a 33  kernel, followed by BN and Leaky ReLU. As shown in the figure below,  

Meta-training stage
Section 3 briefly introduces the problem settings in few-shot learning. In this section, we first begin by presenting the problem definition and the key notation more formally in detail, which are useful for understanding the following training procedure.
A dataset we deal with in few-shot learning denotes as meta-set , which has a split of base and novel , meaning base-set and novel-set, respectively. on base-set base that learns meta-knowledge, which can generalize to unseen categories with only a few support samples in novel-set novel . In this section, we focus on the learning procedure in the meta-training stage that only processes data in the base-set base . To train the model in an effective way, we usually assume improving performance by learning from a set of tasks, denoted as   i = , also known as episodes. In effect, an episode i is treated as a data-point in meta-learning. Following the standard few-shot classification paradigm, we often employ the N-way K-shot setting to evaluate the model. Thus, we construct an episode with N randomly selected categories, each with K support samples and Q query samples.  One intuitive way to predict the probability that a query sample x belongs to category c is to compare the distance between the feature embedding ( ) fx  and the centroid c  of category c . Two common distance metrics are Euclidean distance and cosine similarity, here we employ the cosine similarity, and thus the prediction can be formalized as follows: = .
The learned model is adapted to predict unseen categories with the new support set novel .

Experiments and Analysis
In this section, we first present some implementation details and dataset description. Then, we compare our method with three state-of-the-art few-shot methods and one typical CNN-based method, D-CNN. In addition, we conduct a new dataset mini-RSD46-WHU to investigate how the scale of the dataset impacts the results. At last, we also carry out experiments to evaluate the 5-way accuracy as a function of shots.

Implementation details
Following the few-shot experimental protocol proposed by Vinyals, O. [42], we carry out the experiments of N-way classification with K shots, here 5 N = , 1 K = or 5. In the meta-training procedure, a few-shot training batch is composed of several episodes where an episode is a selection of 5 randomly categories drawn from base . We set 4 episodes per batch to compute the average loss, namely the batch size is 4. The support set in each training episode is expected to match the same number of shots as in the meta-test stage. That is, for example, if we want to perform 5-way 1-shot classification at test-time, then the training episodes could be constituted of 5 Note that each category contains K query samples during meta-training stage and 15 query samples during meta-testing.
We employ Resnet-12 as our backbone; by removing the fully connected layer, the network generates a 512-dimensional feature vector for each input image. For this step, we use SGD optimizer with momentum 0.9, the learning rate is initialized to 0.1, and the decay factor is set to 0.1. The feature extractor was trained for 100 epochs with batch size 128 on 2 GPUs, the weight decay for ResNet-12 is 0.0005. For ProtoNet, MAML, and RelationNet, we follow the original literature and adopt a four-layer convolutional backbone (Conv-4). In addition, we re-implement a typical machine learning classification method D-CNN [21], to evaluate its performance in the few-shot scenario. ResNet-12 and the same settings are used in the re-implementation. All our code was implemented in Pytorch and run with 2 NVIDIA 2080ti GPUs.

Datasets Description
We evaluate our proposed method on two challenging datasets: NWPU-RESISC45 [19] and RSD46-WHU [27,28]. Besides, to answer the question of how the dataset scale impacts the performance, we construct a mini dataset from the RSD46-WHU dataset. The details of the considered datasets are described as follows: The NWPU-RESISC45 dataset was proposed by Cheng et al. [19] in 2017 and became a popular benchmark in the RS classification research. It involves 45 categories with 700 remote scene images in each category, each with a size of 256 x 256 pixels. These aerial images are collected by experienced experts from Google Earth; the spatial resolution ranges from approximately 30 to 0.2 m per pixel. According to the split division setting proposed by Ravi et al. [46], we split the 45 categories into 25, 8, 12 for meta-training, meta-validation and meta-testing, respectively. Note that, the validation set was held-out for hyper-parameter selection of the meta-training stage. The set-split for meta-training are the same 25 categories of base . It is further divided into three sets: meta_train_support, meta_train_val, meta_train_query. The number of images in each category is shown in Table 1.  base   meta_train_support  25  350  meta_train_val  25  175  meta_train_query  25  175  val  meta_validation  8  700  novel meta_test(unseen) 12 700 The RSD46-WHUdataset contains 46 categories, each with images ranging from 428 to 3000, for a total of 117,000. Like many other RS datasets, the images are collected by hand from Google Earth and Tianditu, with the ground resolution spanning from 0.5m to 2m. Similar to the NWPU-RESISC45 dataset, that the 46 categories in RSD46-WHU dataset are divided into 26, 8, 12 for meta-training, meta-validation, and meta-testing, respectively. It is relevant to mention that we have dropped about 1200 images in total because some images are not in the size of 256 256  pixels or or contain incorrect content. The details of our modified dataset-split are listed in Table 2. We further conduct a new dataset mini-RSD46-WHU to investigate how the scale of the dataset impacts the results. The mini-RSD46-WHU dataset is formed from the RSD46-WHU dataset by randomly selecting 500 images in each category. Except for category Sewage plant-type-two only has 428 images, because that is all it holds in the original dataset. We follow the same division setting of the RSD46-WHU dataset; the only change is the number of images in each category. Table 3 shows the details.

Results and Comparisons
Following the most common setting in few-shot classification, namely 5-way 1-shot, and 5-way 5-shot, we conduct experiments to evaluate our method's effectiveness. The proposed method is compared with three state-of-the-art few-shot learning algorithms and one conventional deep learning method. The three few shot methods include the ProtoNet, MAML, and RelationNet. Also, the performance of a conventional classification algorithm D-CNN is analyzed in few-shot classification scenarios.
For 5-way 1-shot experiment, one labeled support sample per category is randomly selected as the supervised sample at the test time. Likewise, 5 support samples per category are provided for 5-shot setting. Following the setting of [29], 15 query images per category are batched in each episode for evaluation. We computed the mean classification accuracy of 800 randomly generated episodes from the novel (meta-test) set.
On both datasets, the results of average 5-way accuracy (%) with 95% confidence interval of 1-shot and 5-shot are reported in Table 4 and Table 5. As we can see, our method outperforms the other models under both 5-way 1-shot and 5-way 5-shot settings. D-CNN shows inferior performance both in the 1-shot and 5-shot cases, and this result is reasonable due to D-CNN is not designed specifically to few-shot classification. Typical CNNs-based methods most likely lead to Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 October 2020 doi:10.20944/preprints202010.0033.v1 overfitting when meeting so few supervised samples, whereas meta-based methods have achieved considerable performance. Table 4. Few-shot classification results on the NWPU-RESISC45 dataset.

Method
Backbone 1-shot 5-shot ProtoNet [29] Conv4 51.17% +-0.79% 74.58% +-0.56% MAML [30] Conv4 53.52% +-0.83% 71.69% +-0.63% RelationNet [31] Conv4 57.10% +-0.82% 73.55% +-0.56% D-CNN [21] ResNet12 36.00% +-6.31% 53.60% +-5.34% Ours ResNet12 69.46% +-0.22% 84.66% +-0.12% A bar chart of few-shot classification results on both datasets is shown in Figure 6. We observe that our method outperforms the other four methods by a significant margin. Similar to our method, ProtoNet and RelationNet are both metric-based methods. ProtoNet uses Euclidean distance while RelationNet compares an embedding f  and query samples using an additional parameterized CNN-based 'relation module.' Our method computes the class centers as the same in ProtoNet. Yet, we employ a cosine distance with a learnable scaling factor for classifying, which contributes a lot to achieve better performance. For MAML, a representative method for model initialization, we adopt a first-order approximation version for the experiments. The original paper of MAML reports that the performance of the first-order approximation is almost identical to the full version. We take the first-order approximation version for its efficiency; the performance of MAML may get narrowly enhance by the full version. An interesting phenomenon we observed is shown in Figure 7. We plot the first 90 epochs of the generalization of our model on base and novel categories. Base generalization indicates the training accuracy from unseen data in the base categories, and the novel generalization means test performance from data in novel categories. As shown, while the model achieves better performance on unseen data in the base set, the novel generalization drops instead. Why the test performance Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 October 2020 doi:10.20944/preprints202010.0033.v1 decreases? We suppose lacking supervised data is the reason causing the over-fitting problem, which leads to this phenomenon. This problem will be discussed further in Section 5.4.

Figure 7.
Generalization discrepancy in meta-learning stage.

Effect of Dataset scale
To investigate how dataset scale impacts the performance, we conduct a variant of the RSD46-WHU dataset with only 500 images in each category, called mini-RSD46-WHU. The overall accuracies of 5-way 1-shot and 5-shot are reported in Table 6. We adopt the same backbone and training strategy on both datasets. As we can see, apparently, the performance improves when the scale of dataset gets larger. The overall accuracy of 5-way 1-shot and 5-shot on the original dataset increased by 6.86% and 5.78% compared to the mini dataset.

Effect of Shots
To further evaluate the 5-way accuracy as a function of shots, we conduct the experiments by providing our model with 1, 5, 10, 15, 20, and 25 labeled support samples on both datasets. The results are presented in Figure 8. As we expected, the prediction accuracy is greatly improved when the shot is increasing from 1 to 5. However, the performance does not benefit much more when the shot continues to increase. These findings confirm that our model is specifically effective in very-low-shot settings. From the experiments in Section 5.3, we observe from Figure 7 that the model with the best accuracy often appears in the first 40 epochs. For a further analysis of the generalization discrepancy, we plot the generalization curve with different shots on both NWPU-RESISC45 and RSD46-WHU datasets, see Figure 9 and Figure 10. As we can see, the same phenomenon appeared again: when the generalization gets better on the unseen data of base, indicating that the model learns the objective better, whereas the test performance gets worse on the novel task. In other words, this phenomenon still exists when the support labeled instances increases; over-fitting may not be the very reason for the test performance drops. This generalization discrepancy may be caused by the objective difference between the novel set and the base set. That is, in the meta-training stage, our model learns too specific on base-set, which has adverse effects on the novel-set. Our investigations suggest that the generalization discrepancy might be a potential challenge in few-shot learning.

Conclusion
The topic of few-shot learning has attracted much attention in recent years. In this paper, we bring few-shot learning to aerial scene classification and demonstrate that useful information may be learned from a few instances. To pursue this idea, we proposed a meta-learning framework which aims to train a model that generalizes well on unseen categories when providing a few samples. The proposed method first employs ResNet-12 to learn a representation on base-set, and then in the meta-training stage, we optimize the classifier by cosine distance with a learnable scale parameter. Our experiments, conducted on two challenging datasets, are encouraging in that our method can achieve a classification performance of around 69% for a new category by just providing one instance, besides approximately 84% for 5 support samples. Furthermore, we have conducted several ablation experiments to investigate the effects of dataset scale and support shots. At last, we observe an interesting phenomenon that there is potentially a generalization discrepancy in meta-learning. We suggest that further research in this phenomenon may be an opportunity to achieve better performance in the future.

Conflicts of Interest:
The authors declare no competing financial interests. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; and in the decision to publish the results.