Task-Adaptive Embedding Learning with Dynamic Kernel Fusion for Few-Shot Remote Sensing Scene Classiﬁcation

: The central goal of few-shot scene classiﬁcation is to learn a model that can generalize well to a novel scene category (UNSEEN) from only one or a few labeled examples. Recent works in the Remote Sensing (RS) community tackle this challenge by developing algorithms in a meta-learning manner. However, most prior approaches have either focused on rapidly optimizing a meta-learner or ﬁnding good similarity metrics while overlooking the embedding power. Here we propose a novel Task-Adaptive Embedding Learning (TAEL) framework that complements the existing methods by giving full play to feature embedding’s dual roles in few-shot scene classiﬁcation—representing images and constructing classiﬁers in the embedding space. First, we design a Dynamic Kernel Fusion Network (DKF-Net) that enriches the diversity and expressive capacity of embeddings by dynamically fusing information from multiple kernels. Second, we present a task-adaptive strategy that helps to generate more discriminative representations by transforming the universal embeddings into task-adaptive embeddings via a self-attention mechanism. We evaluate our model in the standard few-shot learning setting on two challenging datasets: NWPU-RESISC4 and RSD46-WHU. Experimental results demonstrate that, on all tasks, our method achieves state-of-the-art performance by a signiﬁcant margin.


Introduction
Scene classification plays an essential role in the semantic understanding of Remote Sensing (RS) images by classifying each image into different categories according to its contents [1].It provides valuable support to applications ranging from land use and land cover (LULC) determination [2,3], environmental monitoring [4], urban planning [5,6], and deforestation mapping [7].
In the past few years, deep learning-based approaches [8][9][10][11][12] have achieved humanlevel performance on certain RS scene classification benchmarks [1,[13][14][15].Despite the remarkable achievements, these excellent methods are data-hungry in order to learn massive parameters and often fail when encountering the natural conditions that humans face in the real world-data is not always enough.For instance, consider training a traditional classifier to identify a novel category that has never existed in the current RS scene datasets, e.g., a bicycle-sharing parking lot, a new scene that has recently emerged in China.One would have to first collect hundreds or thousands of relevant RS images taken from the air and space.The high cost of collecting and annotating hinders many downstream applications where data is inherently rare or expensive.Moreover, a trained deep learning model usually struggles when asked to solve a new task unless it re-executes the training process with high computational cost.In contrast, humans can learn new concepts quickly from just one or a handful of examples by drawing upon previous knowledge and experience [16].These issues motivated research on Few-Shot Learning (FSL) [17][18][19]-a learning paradigm that emulates human learning-the ability to learn and adapt to new environments rapidly.Specifically, the contemporary FSL setting [17][18][19][20] is designed to mimic a low-data scenario.Focusing on few-shot classification tasks, we are dealing with two sets of categories-base set (SEEN) and novel set (UNSEEN)-that disjoint in the label space.A successful FSL learner needs to exploit transferable knowledge in the base set, which has sufficient labeled data, and leverage it to build a classifier that generalizes well on UNSEEN categories when provided with extremely few labeled instances per category, e.g., one or five images.Recent research generally addresses the FSL problem by following the idea of meta-learning, i.e., broadening the learner's scope to batches of related tasks/episodes rather than batches of data points, and gains experience across the tasks.This episodic training scheme is also referred to as "learning-to-learn" by leveraging the experience to improve future learning performance.
The recent success of few-shot learning has captured attention in the remote sensing community.Rußwurm et al. [21] evaluate a well-known meta-learning algorithm, Model-Agnostic Meta-Learning (MAML) [19], for land cover few-shot classification problems.They observe that MAML outperforms the traditional transfer learning methods.The work [22] adopts deep few-shot learning to handle the small sample size problem in hyperspectral image classification.Most previous RS scene few-shot classification methods [23][24][25][26] fall under the umbrella of metric learning and are built upon Prototypical Networks (ProtoNet) [18].RS-MetaNet [25] improves ProtoNet with a new balance loss that combines the maximum generalization loss and the cross-entropy loss.Zhang et al. [23] present a meta-learning framework based on ProtoNet and use cosine distance with a learnable scale parameter to achieve better performance.Later on, Discriminative Learning of Adaptive Match Network (DLA-MatchNet) [27] couples the attention technique and Relation Network [28], where the former aims to exploit discriminative regions while the latter learns the similarity scores between the images by an adaptive matcher.Zhang et al. [26] propose an approach named Self-Supervision Equipped with Knowledge Distillation (SSKD) by adopting a self-supervision strategy to drive the network digging the most discriminative category-specific region and boost the performance by a round of self-knowledge distillation.While these methods have achieved significant progress in RS few-shot classification, we observe that these approaches do suffer from two distinct limitations.
One missing piece of the puzzle is that these metric-based algorithms mainly focus on identifying a suitable similarity measure or construct a combined loss function to drive the parameter updates while overlooking the importance of the embedding network.DLA-MatchNet [27] introduces an attention mechanism in the feature learning stage to capture attention features from channels and spatial dimensions.RS-SSKD [26] weaves self-supervision into a two-branch network to dig the base-set data fully by refining the pretraining embedding.Both methods aim at learning the most relevant regions to achieve better embeddings.On the other hand, we must pay attention to the inherent characteristics of remote sensing data.For example, as the RS scene images are taken from a top view, the ground objects vary from small sizes such as airplanes to large regions like a forest or meadow.Moreover, under a spatial resolution range from about 30 to 0.2 m per pixel (e.g., the NWPU-RESISC45 dataset [15]), irrelevant objects inevitably exist in the RS scene images (see Figure 1).These issues may drive the embeddings from the same category far apart in a given metric space.If we have sufficient training samples, this problem can be greatly alleviated by a deeper neural network.However, we are dealing with a low data regime of the FSL setting, where the embedding network is either too shallow to leverage the model's expressive capacity or too deep and results in overfitting [29].That is the reason why Conv-4 [17][18][19]28], Resnet-12, and Resnet-18 [30] are the most popular embedding networks in the FSL world.The other concern is that the existing models generally project all instances from various tasks into a single common embedding space indiscriminately [23][24][25][26][27].Such strategy implies that the discovered knowledge, i.e., embedded visual features learned on the SEEN categories, are equally useful for any downstream target classification tasks derived from UNSEEN categories.We argue that the issue of which features are the most discriminative to a specific target task has not received considerable attention.For instance, consider that we have two separate classification tasks: "freeway" vs. "forest" and "freeway" vs. "bridge".It is intuitive that these two tasks use a diverse set of the most discriminative features.Therefore, the ideal embedding model would first need to extract discriminative features for either task simultaneously, which is challenging.Since the current model does not know exactly what the "downstream" target tasks are, it may unexpectedly emphasize unimportant features for later use.Further, even if two sets of discriminative features are extracted, they do not certainly lead to the best performance for a specific target task.For example, the most useful feature to distinguish "freeway" vs. "forest" may be irrelevant to the task of distinguishing "freeway" vs. "bridge".Naturally, we expect the embedding spaces to be separated, each of which is customized to the target task so that the extracted visual features are the most discriminative.Figure 2 schematically illustrates the difference between task-agnostic and task-adaptive embeddings.

Task Agnostic Embeddings
Task Adaptive Embeddings To sum up, we suggest that the embedding module is crucial due to its dual rolesrepresenting inputs and constructing classifiers in the embedding space.Several recent studies [31][32][33] have supported this assumption with a series of experiments and verified that better embeddings lead to better few-shot learning performance.The question is, how do we get a good embedding?We answer this question by solving two challenges: (1) design a lightweight embedding network that tackles the problems posed by the inherent characteristics of RS scene images; and (2) construct an embedding adaptation module that tailors the common embeddings into adaptive embeddings according to a specific target task.See Sections 3.3 and 3.4 for details.

Task Instances i
Our main contributions in this paper are summarized as follows: • We develop an efficacious meta-learning scheme that utilizes two insights to improve few-shot classification performance: a lightweight embedding network that captures multiscale information and a task-adaptive strategy that further refines the embeddings.

•
We present a new embedding network-Dynamic Kernel Fusion Network (DKF-Net)-that dynamically fuses feature representations from multiple kernels while preserving comparably lightweight customization for few-shot learning.

•
We propose a novel embedding adaptation module that transforms the universal embeddings obtained from the base set into task-adaptive embeddings via a selfattention architecture.This is achieved by a set-to-set function that contextualizes the instances over a set to ensure that each has strong co-adaptation.

•
The experimental results on two remote sensing scene datasets demonstrate that our framework surpasses other state-of-the-art approaches.Furthermore, we offer extensive ablation studies to show how the choices in our training scheme impact the few-shot classification performance.
The rest of this paper is organized as follows.We review the related work in Section 2. The problem setting and the proposed framework are formally described in Section 3. We report the experimental results in Section 4 and discuss with ablation studies in Section 5. Finally, Section 6 concludes the paper.

Related Work
Current few-shot learning has been primarily addressed in the meta-learning manner, where a model is optimized through batches of training episodes/tasks rather than batches of data points, which is referred to as episodic training [17].We can roughly divide the existing works on FSL into two groups.(1) Optimization-based methods search for more transferable representations with sensitive parameters that could rapidly adapt to new tasks in the meta-test stage within a few gradient descent steps.MAML [19], Reptile [34], LEO [35], and MetaOptNet [36] are the most representative approaches in this family.
(2) Metric-based methods mainly learn to represent input data in an appropriate embedding space, where a query sample is easy to classify with a distance-based prediction rule.One can measure the distance in the embedding space by simple distance functions such as cosine similarity (e.g., Matching Network [17]) or Euclidean distance (e.g., Protypical Networks [18]), or learn parameterized metrics via an auxiliary network (e.g., Relation Network [28]).Later, in DSN-MR [37], all samples and metrics are operated in affine subspaces.SAML [38] suggests the global embeddings are not discriminative enough as dominant objects may locate anywhere on images.The authors tackle this problem by a "collect-and-select" strategy that aligns the relevant local regions between query and support images in a relation matrix.Given the local feature sets generated by two images, DeepEMD [39] shoots the same problem by employing the Earth Mover's Distance [40] to capture their structural similarity.Our work falls within the second group but differs from them in two ways.
First, like SAML and DeepEMD, Tian et al. [33] also suggest that the core of improving FSL lies in learning more discriminative embedding.In response, contemporary approaches address this challenge either refining the pretraining strategy to exploit the base-set data fully [33], leveraging self-supervision to feed auxiliary versions of original images into the embedding network [41], or applying self-distillation to achieve an additional boost [26].While these approaches effectively make the embedding more representative, they tend to concentrate too much on designing a complex loss function [26,41] or building networks to capture relevant local features at the cost of computing resources and time [38,39].On the contrary, our solution offers a lightweight embedding network that generates more discriminative representations while imposing fewer parameters than the most popular backbone, i.e., ResNet-12 [30,36], in few-shot learning.A fundamental property of neurons present in the visual cortex is changing their receptive fields (RF) in response to the stimulus [42].This mechanism of adaptively adjusting receptive fields can be incorporated in neural networks by multiscale feature aggregation and selection, which would benefit constructing a desirable RS scene few-shot classification algorithm-considering that the ground objects vary largely in size.Inspired by Selective Kernel (SK) Networks [43], we introduce a nonlinear procedure for fusing features from multiple kernels in the same layer by a self-attention mechanism.We incorporate two-branch SK convolution into our embedding network and name it Dynamic Kernel Fusion Network (DKF-Net).
Second, the abovementioned methods assume all samples are embedded into a taskagnostic space, hoping the embeddings could sufficiently represent the support data such that the similarities predicted from simple nonparametric classifiers will generalize well to new tasks.We suggest that ideal embedding spaces for few-shot learning should be separated, where each of them is customized to the target task adaptively so that the extracted visual features are discriminative.Some recent works also pay attention to this assumption.TADAM [44] proposes to learn a task-dependent metric space by constructing a conditional learner on the task-level set and optimizing with an auxiliary task cotraining procedure.TapNet [45] constructs a projection space for each episode/task and introduces additional reference vectors, in which the class prototypes and the reference vectors are closely aligned.Unfortunately, the task-dependent conditioning mechanism in TADAM requires learning of extra fully connected networks while the projection space in TapNet is solved through the singular value decomposition (SVD) step; both strategies significantly increase training time.Taking inspiration from Transformer [46], we propose an embedding adaption module based on a self-attention mechanism that transforms "task-agnostic" embeddings into "task-adaptive" embeddings, see Section 3.4.

Methodology
We now present our approach for the few-shot classification of RS scenes, starting with preliminaries.Then, we present our few-shot learning workflow in Section 3.2, wherein the overall framework is depicted in Figure 3.The proposed Dynamic Kernel Fusion Network (DKF-Net) is described in Section 3.3, the embedding backbone in the whole flow of our work.At last, we elaborate on the embedding adaption module and discuss how it helps few-shot learning in Section 3.4.

Preliminaries
Problem setting.In a traditional classification setting, we are given a dataset D = {D train , D test } with C total categories.D train = {(x i , y i )} N i=1 terms as the training set, where y i ∈ {1, ..., C total } and (x i , y i ) are the input image and corresponding label pairs.A predictive model is learned on D train at training time, and generalization is then evaluated on D test , i.e., the test set.In few-shot learning (FSL), however, we are dealing with a dataset D, divided into three parts with respect to categories: D base , D val , and D novel , i.e., training set, validation set, and test set.The category spaces in the three sets are disjointed from each other.The goal of FSL is to learn a general-purpose model on D base (SEEN) that can generalize well to UNSEEN categories in D novel with one or few training instances per category.In addition, D val is held out to select the best model.
Episodic training.To mimic the low-data scenario during testing, most of the FSL methods [17][18][19]28,36,47] proceed in a meta-learning fashion.The intuition behind metalearning is improving the performance of a model by extracting transferable knowledge from a collection of sampled mini-batches called "episodes", also known as "tasks", and minimizing the generalization error over a task distribution.Formally, a set of M tasks is denoted as , sampled from a task distribution p(T ).Each task T i can be considered a compact dataset containing both training and test data, referred to as support set S i and query set Q i .

Overall Framework
The outline of our method to RS scene few-shot classification is: (1) we employ a pretraining stage to learn an embedding model f φ (x) on the base set D base ; (2) in the meta-learning stage, we optimize the embedding model with the nearest centroid classifier, in an episodic meta-learning paradigm; and (3) at inference time, i.e., the meta-test stage, the model is fixed, we sample tasks from the novel set D novel for evaluation and report the mean accuracy.The overview of our method is depicted in Figure 3.All the stages of our model are built upon the proposed DFK-Net backbone (see Section 3.3 and Figure 4).The details of these stages are as follows.Pretraining stage.We train a base classifier on D base to learn a general feature embedding for the downstream meta-leaner, which is helpful to yield robust few-shot classification.The predictive model ŷ = f φ (x), parameterized by φ, is trained to classify N b base categories (e.g., 25 categories in the NWPU-RESISC45 dataset) in D base with the standard cross-entropy (CE) loss, by solving: The performance of the pretrained model is evaluated after each epoch, based on its 1-shot classification accuracy on the validation set D val .Specifically, assuming that there are N v categories in D val , we randomly sample 200 1-shot N v -way tasks from D val to assess the classification performance of the pretrained model and select the best one.The weights of the penultimate layer from the best pretrained model are utilized to initialize the embedding backbone and are optimized in the next meta-learning stage.
Meta-learning stage.In most few-shot learning setups, a model is often evaluated in N-way K-shot tasks, where K is usually very small, e.g., K = 1 or K = 5 are the most common settings.Following prior work [17,18,20], an N-way K-shot task T i is constructed by randomly sampling N categories, and K labeled instances per category as the support set S i = {(x n , y n )} N×K n=1 , where (x n , y n ) is an image-label pair, and y n ∈ {1, . . ., N}.We take a fraction of the remaining instances from the same N categories to form the query set , and the end goal becomes the classification of the N × Q unlabeled instances into N categories.Note that S i and Q i are disjointed, i.e., S i ∩ Q i = ∅ while sharing the same label space.Since the pretrained model is trained only on the base set, it often falls into the over-fitting dilemma or is updated very little when facing the novel categories with a meager amount of support instances.Some recent approaches handle this problem by fixing the pretrained model and fine-tune it on the novel set.We adopt an opposite strategy by using a meta-learning paradigm built upon ProtoNet [18] to optimize the pretrained model f φ , parameterized by φ, directly without introducing any extra parameters.
During the meta-learning stage, we sample a collection of N-way K-shot tasks {T i = (S i , Q i )} I i=1 from D base to form the meta-training set T train .Likewise, we obtain the meta-validation set T val and the meta-test set T test from D val and D novel in the same way.Given the meta-training set T train , the meta-learning procedure minimizes the generalization error across tasks.Thus, the learning objective can be loosely defined as: ( For each N-way K-shot task T i = (S i , Q i ), there are K images belonging to category c in the support set, where c ∈ {1, . . ., N}.We define the mean feature of these K images as the "prototype" p c , i.e., the category center, corresponding to category c: where f φ is an embedding function with learnable parameters φ, mapping the input x k into the feature space.Then, we perform the nearest neighbor classification with the negative Euclidean distance to predict the probability of query instance x q belonging to category c by the following expression: where d(•, •) denotes the Euclidean distance.Inspired by prior work [44], we apply a scale factor, γ, to adjust the similarity score, then the above equation becomes: During the experiments, we tune the initial values of the scale factor empirically and find it affects the meta-learning when the model is optimized based on pretrained weights.

Dynamic Kernel Fusion Network
We propose the Dynamic Kernel Fusion Network (DKF-Net), a simple yet effective embedding scheme for few-shot learning, to enrich the diversity and expressive capacity of typical backbones, e.g., Conv-4 [17][18][19], ResNet-12, and ResNet-18 [30].DKF-Net aims to collect multiscale spatial information by dynamically adjusting the receptive field size of neurons with Selective Kernel (SK) convolutions [43].The top of Figure 4 depicts an SKUnit that is constituted of a {1 × 1 convolution, SK convolution, 1 × 1 convolution}, and the bottom shows the complete DFK-net architecture.
The SK convolution performs dynamic fusion from multiple kernels via three operations -"Split", "Fuse", and "Select".Given a feature map X ∈ R H ×W ×C with C channels and spatial dimensions H × W , as shown at the top of Figure 4, we start by constructing two branches built upon transformations F 1 and F 2 , mapping X to feature maps U 1 ∈ R H×W×C and U 2 ∈ R H×W×C , separately.F 1 and F 2 refer to two convolutional operators with kernels 5 × 5 and 7 × 7, respectively, and are followed by Batch Normalization (BN) [48] and ReLU [49] in sequence.In practice, the F 2 with a 7 × 7 kernel is displaced with a dilated convolutional layer with the rate of 2, which can alleviate further computational burden.This procedure is defined as "Split".ker nel We expect the neural network can adjust the RF sizes according to the stimulus content adaptively.An instinctive idea is to regulate the information flows from two branches by the second operation-"Fuse".First, the two branches are initially integrated via element-wise summation, which can be expressed as: where U ∈ R H×W×C is the fused feature.Then, U is passed through a global average pooling (GAP) layer, which produces channel-wise statistic s ∈ R C by shrinking feature maps through their spatial dimensions, H × W. Formally, if we let s c denote the c-th element of s, it is calculated by: where u c (i, j) denotes the value at point (i, j) of the c-th channel U c .The vector s represents the importance of each channel, and it is further compressed to a compact feature descriptor z ∈ R d to save parameters and reduce dimensionality for better efficiency.Specifically, z is obtained by simply applying a fully connected (FC) layer to s: F FC (•) represents the fully connected operation defined by weights W ∈ R d×C , where B refers to the BN [48] and δ denotes the ReLU [49] function.Thus, the number of channels is reduced to d = max((C/r), L), where r indicates the compression ratio and L is the minimum value of d.Following previous work [43,50], we empirically set r to 16 and L to 32.
Finally, the last operation, "Select", guided by the compact feature descriptor z, is applied to fulfill a dynamic adjustment of multiscale spatial information.This is achieved by a control gate mechanism based on soft attention to assign the importance of each branch across channels.Specifically, let a, b ∈ R C be the soft attention vectors for U 1 and U 2 ; the channel-wise weights can be obtained by applying a softmax operator: where A, B ∈ R C×d , A c ∈ R 1×d denotes the c-th row of A and a c denotes the c-th element of a; B c and b c are likewise.It is noteworthy that a c and b c have a relationship of a c + b c = 1 as there are only two branches in our case.We now have the refined feature map X by applying the attention vectors a and c to each branch along the channel dimension: where Xc refers to the c-th channel of X and Xc ∈ R H×W .The proposed DFK-Net contains four stages with a block of {SK-Unit, ReLU, SK-Unit, ReLU} in each, as illustrated at the bottom of Figure 4. We set the filters in each stage to 64, 160, 320, and 640, respectively, and add an 11 × 11 GAP layer after the last stage, which outputs 640-dimensional embeddings.

Embedding Adaption via Transformer
Up until now, the embedding function f φ (•), parameterized by φ, is assumed to be task-agnostic; we argue that such a setting is not ideal since the knowledge, i.e., the discriminative visual features learned in the base set, are equally effective to any novel categories.Here, we propose an embedding adaptation module that tailors the visual knowledge extracted from the base set, i.e., SEEN categories, into adaptive knowledge according to a specific task.We visualize this concept in Figure 5 schematically.
Embedding Adaption Module Quer y instance Soft Nearest Neighbor Classification Scores Our embedding adaption module is achieved by contextualizing the instances over a set; thus, each of them has strong co-adaptation.Concretely, given a task-agnostic embedding function f φ (x), let T denote a set-to-set function that transforms f φ (x) to a task-adaptive embedding function f ψ (x).We treat the instances as bags or a set without order, requiring the set-to-set function T to output an adaptive set of instance embeddings while keeping permutation-invariant.The transformation step can be formalized in the following way: where S i is the support set of a target task, and π(•) is a permutation operator over a set that ensures the adapted embeddings will not change regardless of T receiving a set of input instances in any order.Inspired by the Transformer networks [46], we utilize dot-product self-attention to implement the set-to-set function T. In the following, we use φ x and ψ x instead of f φ (x) and f ψ (x) for the sake of notational simplicity.Following the literature [46], we can describe the Transformer layer by defining the triplets (Q, K, V ) to indicate the set of the queries, keys, and values.Note that, in order to avoid the unfortunate double use of the term "query", we use italics to denote the "query" in the transformer layer to emphasize the difference from the "query set" in the few-shot tasks.Mathematically, in any instance that x j belongs to S i , we first obtain its query by q j = W Q φ x j ; ∀x j ∈ S i , where W Q is a linear matrix.Similarly, the "key-value" pairs k j and v j are generated with W K and W V , respectively.For notion brevity, the bias in the linear projection is omitted here.Next, the similarity between an instance x j with others in the support set can be measured by the scaled dot-product attention: where d is the dimensionality of the queries and keys.This similarity score then serves as weights for the transformed embedding of x j : where v k is the value of the k-th instance in S i .Finally, the task-adaptive embedding is given by: where W FC indicates the projection weights of a fully connected layer and τ represents a procedure that further transforms the embedding by performing dropout [51] and layer normalization [52].The whole flow of our Transformer module is illustrated on the right side of Figure 5.

Experimental Results
We verify the effectiveness of our proposed method Task-Adaptive Embedding Learning (TAEL) on two challenging datasets: NWPU-RESISC45 [15] and RSD46-WHU [53].We will first introduce the datasets in Section 4.1 and then provide the implementation details in Section 4.2.Finally, we summarize the main results in Section 4.3.

NWPU-RESISC45.
The NWPU-RESISC45 dataset is a collection of remote-sensing scene images extracted from Google Earth by experienced experts, proposed by Cheng et al. in 2017 [15].It is composed of 45 categories with each category containing 700 images with a size of 256 × 256.In order to compare fairly with state-of-the-art (SOTA) algorithms for few-shot classification, we rely on the split setting proposed by Ravi et al. [39], and used in the prior FSL works [23,24,26] on the RS scene, which includes 25 categories for meta-training, 8 for meta-validation, and the remaining 12 for meta-testing, as shown in Table 1.Specifically, the pretraining and meta-learning stages are performed on the 25 SEEN categories, and the best model is chosen based on the few-shot classification performance on the HELD-OUT meta-val split (UNSEEN).This serves as our final model, and it is evaluated on few-shot tasks sampled from the meta-test split (UNSEEN) without further fine-tuning.Following the most common setting in FSL [18][19][20], all images are first resized to 84 × 84 pixels.

Implementation Details
We use the proposed DFK-Net as the embedding backbone for both the pretraining stage and meta-learning stage, and the architecture of DFK-Net is stated in Section 3.3.
Pretraining strategy.During the pretraining stage, the embedding backbone is trained as a typical classifier to classify all the categories in D base , e.g., 25 categories in NWPU-RESISC45, with CE loss.As MetaOptNet [36] suggested, the training is performed with data augmentation, i.e., random flip, crop, and color jittering, to increase the diversity of training data.After each epoch, we sample 200 N val -way 1-shot (N val = 8) episodes from the meta-validation set D val .Then, the best pretrained model is selected based on the average accuracy of 8-way 1-shot classification over the 200 episodes.Later on, the pretrained weights of the best model are leveraged to initialize the embedding network and will be further optimized during the meta-learning stage.
Optimization.We use stochastic gradient descent (SGD) for optimizing in the pretraining stage and meta-learning stage.The parameters related to optimization are collected in Table 3.In the pretraining stage, the initial learning rate is 0.001, and we shrink the learning rate by 10 at 75, 150, and 300 epochs.In the meta-learning stage, the initial learning rate is set to 0.0001, and we decrease the learning rate at every 20 epochs with a rate ratio of 0.2.We empirically tune the scale factor γ in Equation ( 5) from the reciprocal of {0.1, 1, 10, 16, 32, 64} and find 64 is the best.Furthermore, for the Embedding Adaption Module, the dropout rate in the transformer is set to 0.5.

Main Results
We have evaluated our method on two challenging RS scene datasets, namely NWPU-RESISC45 [15] and RSD46-WHU [53].Following [18,19,23], the standard evaluation protocols are used in all our experiments, exactly as in corresponding compared works.All the experiments are constructed and evaluated on the most commonly used 5-way 1-shot and 5-way 5-shot classification settings.Of note, in these experiments, keeping with the spirit that training and testing conditions should be consistent, the task configuration for meta-training, meta-validation, and meta-testing is the same.For instance, consider the 5-way 1-shot scenario.A 5-way 1-shot task is composed of five random sampled categories, and each category includes one support instance and 15 unlabeled query instances, which are used for training and inference, respectively.We keep sampling 5-way 1-shot tasks from the base set during the meta-training phase and set 100 tasks as an epoch.Then, at the end of each epoch, we feed 600 tasks drawn from the HELD-OUT validation set to the model and record the 5-way 1-shot classification accuracy.We train the meta-learner for 200 epochs and select the best one based on the 5-way 1-shot classification performance on the validation set.
As depicted in Figure 6a, left pane, we can see that the best model of 5-way 1-shot on NWPU-RESISC45 appeared at the 177-th epoch, with a correspondingly low loss.In addition, we do not utilize any data from the meta-test set during training nor perform further fine-tuning during meta-testing.Once the meta-training procedure is done, the performance of the proposed method TAEL is finally evaluated by the mean accuracy over 10,000 5-way 1-shot tasks randomly sampled from the meta-testing split, with 95% confidence intervals, and the same goes for the 5-way 5-shot scenario.Note that most previous approaches [18,19,28,36,44] are evaluated with 600-2000 tasks sampled from the metatesting split according to their original setup, which introduce high variance, as shown in Tables 4 and 5.We adhere to one key principle that avoids falsely embellishing the capabilities of our method by overfitting a specific dataset.That is, in all experiments, whether 1-shot or 5-shot, as described in Section 4.2, we keep all the hyperparameters in the pretraining and meta-learning stages the same for both datasets.Figure 6b shows that the performances of TAEL are not so steady on the meta-validation split of RSD46-WHU.This is probably on account of the RSD46-WHU dataset containing lower quality images, which is extremely challenging for the severe low-data scenarios.The few-shot classification accuracies on NWPU-RESISC45 and RSD46-WHU for TAEL and other previous methods are summarized in Tables 4 and 5, respectively.Methods with * indicate that the original backbone has been replaced by ResNet-12, and corresponding results are reported in [23].As seen in Tables 4 and 5, the proposed TAEL is uniformly better than SOTA algorithms on both 1-shot and 5-shot regimes for the NWPU-RESISC45 and RSD46-WHU datasets.By jointly leveraging the strengths of multiscale kernel fusion and task-adaptive embedding learning, TAEL improves over the RS scene few-shot classification baseline [23] across all datasets by approximately 2-4% for both 1-shot and 5-shot scenarios.We can also observe from Table 5 that our method TAEL outperforms the current best results (RS-SSKD [26]) on NWPU-RESISC45 by 2.55% in the 1-shot task, whereas for the 5-shot task, it improves the accuracy by 0.94%.For the RSD46-WHU dataset, Table 5 displays TAEL surpasses RS-SSKD by 1.54% and 1.84% for 1-and 5-shot, respectively.[44] ResNet-12 65.84 ± 0.67 82.79 ± 0.58 MetaOptNet [36] ResNet-12 62.05 ± 0.76 82.60 ± 0.46 DSN-MR [37] ResNet-12 66.53 ± 0.70 82.74 ± 0.54 FEAT [47] ResNet-12 71.04 ± 0.21 85.27 ± 0.13 Zhang et al. [23] ResNet-12 69.08 ± 0.25 84.10 ± 0.15 RS-SSKD [26] ResNet-12 71.73 ± 0.25 85.90 ± 0.15 TAEL (ours) DKF-Net 73.27 ± 0.20 87.74 ± 0.12 To compare whether the embedding backbone impacts the performance of FSL algorithms, we plot bar charts in Figures 7 and 8 for a better observation.The bars with dots denote the re-implementation of methods in which the original backbone is replaced by ResNet-12 [30], and the results are provided by [23].Surprisingly, Figures 7 and 8 show that the re-implementation of MAML [19] gets notable improvements over the Conv-4 version on the 1-shot scenario of both datasets, whereas only minor improvement is obtained in the 5-shot case on RSD46-WHU.For ProtoNet [18] equipped with ResNet-12, even bigger improvements are achieved on both datasets, especially in the 1-shot case of NWPU-RESISC45, which improves 11.61%.On the contrary, Relation-Net [28] becomes even worse in the 1-shot setting for both datasets, which may be due to its auxiliary comparison module leading to over-fitting when using deeper networks.Generally speaking, the gap among different approaches drastically diminishes when the backbone goes deeper.MAML claims that using a deeper backbone rather than Conv-4 may cause overfitting; however, this issue is overcome by applying data augmentation such as random crop, horizontal flip, and color jitter suggested in MetaOptNet [36].Such data augmentation has become a standard operation in current methods [23,26,36,37,47].The rest of the comparison methods in this work, including our own, are built upon ProtoNet.DSN-MR [37] achieves considerable performance while consuming many computational resources since the similarity measure is performed in subspaces.The work [23], FEAT [47], and RS-SSKD [26] demonstrate that learning better embeddings can significantly improve performance.TADAM [44] attempts to retrieve task-specific embeddings for each target task based on an additional task-conditioning module.The authors of TADAM adopt an auxiliary cotraining strategy to alleviate the computational burden, yet extra parameters and additional complexity are introduced to the network.As with TADAM, the embedding adaption module in our method also provides task-adaptive embeddings that generate discriminative embeddings tailored to target tasks while coming at a modest increase in computational cost.We perform further analysis on training time and computational cost of all comparative methods and ours in Section 5.3.

Effect of Different Embedding Networks
To verify the effectiveness of the proposed embedding network DFK-Net in our method, we perform an ablation study on the NWPU-RESISC45 and RSD46-WHU datasets by changing the embedding backbone to the most popular architectures in few-shot learn-ing, i.e., the 4-layer convolution network (Conv-4) adopted in [18,19,28] and the 12-layer residual network (ResNet-12) adopted in [23,26,36,37,44,47].We use Adam [54] and SGD to optimize the Conv-4 and ResNet-12 variants, respectively.
The Conv-4 network is constituted by four repeated blocks.Each block is a sequential concatenation of {3 × 3 convolution with k filters, batch normalization, ReLU, and maxpooling with size 2}.The number of filters in each block is set to 64; namely, the network architecture is 64-64-64-64, the same as in ProtoNet [18].We apply a global max-pooling layer with size 5 after the last block to reduce the computational cost.
We employ the ResNet-12 structure as suggested in [36], which contains four residual blocks, each of which repeats the following convolutional block three times {3 × 3 convolution with k filters, batch normalization, Leaky ReLU(0.1)}.Then a 2 × 2 max-pooling layer with stride 2 is applied at the end of each residual block.The number of filters k starts with 64 and is then set to 160, 320, or 640, respectively.At last, we apply a 5 × 5 global average pooling (GAP) layer, which generates 640-dimensional embeddings.
The architecture of the proposed DFK-Net is stated in Section 3.3, and please see Figure 4.In addition, we have further experimented with a DFK-Net variant, denoted as DFK-Net † , by changing the filters in each stage to 64, 256, 512, and 1024, respectively, which yields 1024-dimensional embeddings.
Figure 9 shows the few-shot classification results of our model with different embedding networks, including the Conv-4, ResNet-12, the proposed DFK-Net, and the DFK-Net variant.Results on the NWPU-RESISC45 dataset show a clear tendency that the performance gap among different backbones significantly reduces when the embedding architecture gets deeper.On the RSD46-WHU dataset, a similar trend can be observed.Moreover, we can also observe that our model using DFK-Net consistently outperforms the ablation using ResNet-12 on both datasets with a margin, which indicates that the proposed embedding backbone is very efficient.We attribute the success of DFK-Net for few-shot classification to two factors: its ability to dynamically weight the averaging of features from multiple kernels according to the receptive field size while its high parameter efficiency is well suited to the low data regime.Table 6 reports the number of parameters and FLOPs [55] of each embedding network.We can see that the Conv-4 has quite a low amount of parameters and FLOPs, whereas it degrades the accuracy of our model a lot.We conjecture that the shallow architecture of Conv-4 is responsible for the failure of the performance as it does not adequately use our model's expressive capacity and leads to underfitting.As illustrated in Table 6, the number of parameters and FLOPs of DFK-Net is slightly more than half of ResNet-12 due to the grouped and dilated convolutions adopted in our architecture.For further comparison, we conduct a variant of DFK-Net by changing its depth to match the complexity of ResNet-12, denoted as DFK-Net † .Surprisingly, Figure 9 shows that DFK-Net † , i.e., the increased complexity version, does not lead to better accuracy with respect to the original DFK-Net, except in the 5-way 5-shot scenario of RSD46-WHU.A potential explanation is that the optimization process of meta-learner becomes more difficult with so few data points when increasing the number of parameters and the size of backbone, which trends to overfitting.It is crucial to find a trade-off between the model's generalization capacity and parameter efficiency.All above, we conclude that the advantage of DFK-Net can be attributed to the adaptive fusion mechanism of weighted multiscale information from different kernels and a high parameter efficiency, which yields more diversity and better generalization ability for few-shot classification.

Effect of Embedding Adaption Module
To investigate whether the embedding adaption module is indeed effective, we perform analyses for our method TAEL and its ablated variants on both datasets: NWPU-RESISC45 and RSD46-WHU.The following experiments are established on the proposed embedding backbone DFK-Net.
We start by evaluating our method with and without the embedding adaption module.As stated in Section 3.4, if we train a model without embedding adaption, the embedding function is assumed to be task-agnostic, and we name this vanilla model as "ours-agnostic".Then, we apply the embedding adaption procedure to the data in the support set to construct the classifier (see Figure 5).In this case, the extracted visual knowledge will be transformed into task-adaptive knowledge according to a specific task and yield more discriminative embeddings; thus, we name this model "ours-adaptive", i.e., the proposed method TAEL.As seen in Table 7, the model using task-adaptive embeddings achieves better performance than the vanilla model, especially in the 1-shot scenario, which gains an approximately 2-3% promotion.This confirms that the proposed embedding adaption module can efficiently tailor the common embeddings to task-adaptive embeddings, which are more discriminative to a specific target task.These experimental results support our hypothesis: embedding is one of the most crucial factors in few-shot learning, and we can expect that better embeddings lead to better FSL performance.We further investigate the impact of different architectural choices of the Transformer in our embedding adaption module.In our current TAEL model, the embedding adaption is implemented by a set-to-set function with Transformer [46], in which we adopt a shallow architecture of simply one attention head and one layer.We follow the common practice in [46] to conduct the Transformer with more complex structures, e.g., multiple heads and deeper stacked layers.First, we replace the single head attention in our module with multihead attention by increasing the number of heads to two, four, and eight while fixing the number of layers to one.The performance of one-head and multihead ablations on 5-way classification are summarized in Table 8.The results indicate that the multihead ablations provide minimal benefits or even harm the performance while introducing extra computational costs.Fixing the attention head to one, we next turn to stack the layers in the Transformer to two and three.From Table 9, we see barely any improvements from this change, and in fact, the performance often drops with respect to the one-layer structure.Thus, we empirically speculate that, under an extremely low data regime like few-shot learning, complex structures do not always result in performance promotion since the difficulty of optimization also increases, and the model becomes more difficult to converge.

Training Time Analysis
In this section, we compare the meta-training time of our model TAEL with the stateof-the-art methods.We report the 5-way 1-shot and 5-way 5-shot runtime on both datasets: NWPU-RESISC45 and RSD46-WHU.To ensure a fair comparison, the prior works and ours are processed in the same experimental condition, i.e., AMD 2950X with 16 cores, 128 GB RAM, and a single GPU GeForce RTX 3090.The only exception is the test for DSN-MR [37], which requires 2 RTX 3090GPUs, due to the high GPU memory consumption of its SVD step.
The Conv-4 adopted in ProtoNet [18], MAML [19], and RelationNet [28] differ in the number of filters per layer, which are 64-64-64-64, 32-32-32-32, and 64-96-128-256, respectively.The architectures of ResNet-12 and the proposed DFK-Net are stated in Section 5.1.In practice, the meta-training time is heavily dependent on the number of epochs and how many N-way K-shot episodes/tasks are in each epoch, which is set empirically by the authors.Our tests of all methods follow their original settings.For example, the early FSL methods like ProtoNet, MAML, and RealationNet are trained with 600 epochs, each containing 100 tasks.MetaOptNet [36] suggests training with fewer epochs while setting more tasks per epoch, e.g., 1000 tasks/epoch and 60 epochs in total.Most current methods followed the latter setting, e.g., Zhang et al. [23] and RS-SSKD [26] set each epoch with 800 tasks and meta-trained for 60 epochs.Our model follows the setting of ProtoNet, where each epoch contains 100 tasks, yet only 200 epochs are meta-trained as we have an additional pretraining stage, enabling the model to converge faster in the meta-training phase.Table 10 summarizes the running times of the discussed methods on both datasets with respect to the number of total meta-training iterations.From Table 10, we observe that the running time of ProtoNet and RelationNet only slightly increase when the backbone is changed to Resnet-12.The evaluation of MAML is performed on its first-order approximation version by ignoring second-order derivatives to speed up the training time.However, MAML using Resnet-12 as the backbone still increases the running time by more than twice that of the original version using Conv-4.We notice that the metric-based methods built upon ProtoNet, e.g., FEAT [47], Zhang et al. [23], and RS-SSKD [26], generally come with short training times.In comparison, while MetaOpt-Net [36] achieves a competitive performance of few-shot classification, its training time is significantly increased because it incorporates the differentiable quadratic programming solver to learn an end-to-end model with a linear classifier SVM.DSN-MR [37] constructs the classifier on closed-formed projection distance in subspaces.It is not surprising that DSN-MR has a very slow training time since its subspaces are obtained through a singular value decomposition (SVD) step, which is computationally expensive.Thanks to the embedded adaption module, our model converges quickly in the meta-training stage and needs fewer total training iterations (episodes/tasks) than other methods.As expected, the results show that our model is practical and offers absolute gains over the previous methods at a modest training time.
Additionally, we see an interesting phenomenon: almost all methods have virtually the same meta-training time on both datasets.The reason is simple, the running time per iteration is inherent for each method; thus, the meta-learning time depends on the total number of training iterations, which is the same on both datasets.The only exception is TADAM [44], which utilizes an auxiliary cotraining scheme in the meta-training phase.This cotraining scheme comes with a high computational cost due to introducing an additional logits head, i.e., the traditional M-way classification on base set where M is the number of all categories.We can easily infer that the running time of TADAM differs on the two datasets because the burden of cotraining consumes more on RSD46-WHU as it is larger than NWPU-RESISC4.

Conclusions
This work suggests that embedding is critical to few-shot classification as it plays dual roles-representing images and building classifiers in the embedding space.To this end, we have proposed a framework for a few-shot classification that complements the existing methods by refining the embeddings from two perspectives: a lightweight embedding network that fuses multiscale information and a task-adaptive strategy that further tailors the embeddings.The former enriches the diversity and expressive capacity of embeddings by dynamically weighting information from multiple kernels, while the latter learns discriminative representations by transforming the universal embeddings into

Figure 1 .
Figure 1.Examples of the inherent characteristics of remote sensing scene images, i.e., the ground objects vary in size and irrelevant objects exist.

Figure 2 .
Figure 2.An illustration of the difference between task-agnostic and task-adaptive embeddings.

Figure 3 .
Figure 3. Overall framework of the proposed method.

{Figure 5 .
Figure5.Illustration of the structure of our embedding adaption module, implemented with a set-to-set function based on Transformer.The right part shows the adaption step that transforms the embedding f φ (x) to f ψ (x).For notational simplicity, we use φ x and ψ x instead of f φ (x) and f ψ (x) in the figure, respectively.

Figure 9 .
Figure 9. Few-shot classification accuracy of our model using different embedding networks on the NWPU-RESISC45 and RSD46-WHU datasets.

Table 1 .
[53]-RESISC45 Dataset split.The RSD46-WHU dataset is collected from Google Earth and Tianditu by hand, and released by Long et al[53].It includes 46 categories, with around 500-3000 RS scene images in each and 117,000 in total.Similar to NWPU-RESISC45, we partition it into 26, 8, and 12 categories for meta-training, meta-validation, and meta-testing, respectively; see Table2for details.Likewise, all images are resized to 84 × 84 pixels.

Table 4 .
[23]arison to previous works on NWPU-RESISC45.Average 5-way few-shot classification accuracy (%) is reported with 95% confidence intervals.The symbol * denotes the backbone of the original model is replaced with ResNet-12, and the results are reported in[23].The best results in each column are marked in bold.

Table 5 .
[23]arison to previous works on RSD46-WHU.Average 5-way few-shot classification accuracy (%) is reported with 95% confidence intervals.The symbol * denotes the backbone of the original model is replaced with ResNet-12, and the results are reported in[23].The best results in each column are marked in bold.

Table 6 .
[55]meter efficiency of different embedding networks.#Pstandsfor the number of parameters, and FLOPs denotes the number of multiply-adds, following the definition in[55].DFK-Net † is a variant of the proposed embedding network.

Table 7 .
Alation study of whether the embedding adaption module improves the performance of few-shot classification.Results are averaged over 10,000 test tasks with 95% confidence intervals.

Table 8 .
Ablation study on the number of attention heads in the proposed embedding adaption module.Results are averaged over 10,000 test tasks with 95% confidence intervals.

Table 9 .
Ablation study on the number of layers in Transformer of the proposed embedding adaption module.Results are averaged over 10,000 test tasks with 95% confidence intervals.

Table 10 .
Meta-training runtime comparison on NWPU-RESISC45 and RSD46-WHU datasets, under 5-way 1-shot and 5-way 5-shot classification scenarios.The symbol * denotes the backbone of the original model is replaced with ResNet-12.