Task-Adaptive Multi-Source Representations for Few-Shot Image Recognition

: Conventional few-shot learning (FSL) mainly focuses on knowledge transfer from a single source dataset to a recognition scenario with only a few training samples available but still similar to the source domain. In this paper, we consider a more practical FSL setting where multiple semantically different datasets are available to address a wide range of FSL tasks, especially for some recognition scenarios beyond natural images, such as remote sensing and medical imagery. It can be referred to as multi-source cross-domain FSL. To tackle the problem, we propose a two-stage learning scheme, termed learning and adapting multi-source representations (LAMR). In the first stage, we propose a multi-head network to obtain efficient multi-domain representations, where all source domains share the same backbone except for the last parallel projection layers for domain specialization. We train the representations in a multi-task setting where each in-domain classification task is taken by a cosine classifier. In the second stage, considering that instance discrimination and class discrimination are crucial for robust recognition, we propose two contrastive objectives for adapting the pre-trained representations to be task-specialized on the few-shot data. Careful ablation studies verify that LAMR significantly improves representation transferability, showing consistent performance boosts. We also extend LAMR to single-source FSL by introducing a dataset-splitting strategy that equally splits one source dataset into sub-domains. The empirical results show that LAMR can achieve SOTA performance on the BSCD-FSL benchmark and competitive performance on mini -ImageNet, highlighting its versatility and effectiveness for FSL of both natural and specific imaging.


Introduction
Recent years have witnessed significant progress in computer vision applications thanks to the development of deep learning [1,2] with large-scale annotated data [3].However, when the deployed domain is specific, the training data may be limited or the labeling cost can be particularly extreme as it must be done by an expert, for example, a doctor in the medical area.To relax the demanding data requirements in deep learning, the emerging topic of few-shot learning (FSL) [4] has received considerable attention and developed as a fundamental research problem in the past few years.With only a few annotated samples per class available, few-shot image recognition aims to efficiently build a classification model for recognizing new classes in an unseen domain.
Directly training a deep recognition model [2] from scratch with only scarce data would intuitively lead to over-fitting collapse [5].So recent few-shot image recognition is typically addressed in an inductive transfer learning paradigm [6], which aims to improve the learning with limited few-shot data (typically denoted as support set S) using the knowledge in a base set D b containing abundant samples.Conventionally, the learning process is divided into two stages: (1) learning a transferable model from the base dataset D b , and (2) adapting the pre-trained model to the unseen target few-shot task with S.
Prevailing approaches [7][8][9][10] to learning in the first stage are typically based on meta-learning [11].It learns a meta-model by maximizing the generalization accuracy across a variety of few-shot tasks drawn from the base set, with the goal of transferring meta-knowledge to improve generalization on the unseen domain.The meta-model is shown to hold the promise of fast adaptation [8] and avoiding over-fitting [7].Although meta-learning provides an elegant solution to FSL, recent studies also indicate that the sophisticated meta-learning algorithms may be unnecessary [12][13][14][15][16]. Instead, the simple representation learning based on supervised cross-entropy loss on the entire dataset could transfer well, and achieve even better performance.Those findings significantly underpin that the essential role of few-shot transfer mainly relies on feature reuse [17] instead of fast adaptation.Other techniques, including self-supervised learning [18] and knowledge distillation [14], have also effectively improved feature transferability.Besides directly leveraging the frozen representation for target FSL, some efforts have also explored the improvements based on task-specific adaptation [19], indicating that proper adaptation may still be necessary [20], especially for cross-domain FSL [21].
However, most of those existing FSL protocols and methods limit their source domains to using only one dataset for pre-training, but many datasets from semantically different domains are indeed available.Besides, a recent benchmark called meta-dataset [22] suggests using multiple source datasets to deal with FSL, but its target datasets for evaluations are still just natural images.In practice, the scenarios for FSL are more likely to come from the specific recognition domains, such as remote sensing [23] and medical imagery [24][25][26].In this paper, we aim to address the practical few-shot setting, referred to as multi-source crossdomain few-shot learning.To promote FSL with knowledge from multiple source domains, some methods [27][28][29] are devoted to learning universal representations but still lack effective adaptations.Unlike most prior methods that focus on either representation learning or adaptation on the few-shot data, we address the problem by exploring both aspects: how to effectively learn different generalizable features from multiple source domains and how to use few-shot data to make an efficient adaptation (or deployment) for a wide range of cross-domain FSL scenarios.Therefore, we propose a novel two-stage learning scheme (as illustrated in Figure 1), namely learning and adapting representations (LAMR).Concretely, in the first stage, we propose a parameter-efficient multi-head framework for training multi-source representations.Instead of learning a single domain-agnostic embedding, we aim to represent diverse features by constructing separate sub-spaces, each of which corresponds to a specific domain.This is achieved by optimizing multiple in-domain classification tasks on the multi-head representation spaces with a shared backbone.In this way, our model can preserve information with regard to each domain in a compact network.The representations can then be universal enough to further support generalization to vastly different FSL tasks.
The pre-trained representations are expected to generalize well to the unseen task that is similar to the source domain.However, this is still a challenge if a large domain shift exists between the source and target data, where pre-trained features are less transferable and proper task-specific adaptation on the limited target data becomes necessary.Besides, we consider that instance discrimination and class discrimination are two crucial capabilities for a robust recognition model.To impose the two objectives, we accordingly propose two feature contrastive losses for improving model discrimination towards unseen classes on the few-shot training data.This enables effective task-specific adaptation as the adapted features can be more task-relevant to the target classes.Empirical results show that the adaptation can yield significant performance boosts; the recognition scenario suffers extreme domain shifts, such as remote sensing and medical domains.
In summary, our contributions are as follows: • We develop a novel two-stage learning scheme, namely learning and adapting representations (LAMR), for vastly addressing cross-domain few-shot learning tasks, especially for recognition scenarios beyond natural images, including remote sensing and medical imagery.

•
To achieve multi-source representations, we propose a parameter-efficient multihead framework, which can further support simple but effective transfer to different downstream FSL tasks.

•
To achieve task-specific transfer, we propose a few-shot adaptation method for improving model discrimination towards unseen classes by imposing instance discrimination and class discrimination at the feature level.

•
LAMR can achieve state-of-the-art results on cross-domain FSL benchmarks in the multi-source setting.
Compared to the preliminary version of the conference paper [30], this work additionally presents the following novel contents:

•
We extend LAMR to single-source FSL by introducing dataset-splitting strategies that equally split one source dataset into sub-domains.The empirical results show that applying simple "random splitting" can improve conventional cosine-similarity-based classifiers in FSL with a fixed single-source data budget.LAMR also achieves superior performance on (single-source) BSCD-FSL benchmark and competitive results on mini-ImageNet.

•
We conduct more careful ablation studies, which verify that the performance gains come from not only the good transferability of the proposed multi-source representations but also each component in the objectives of few-shot adaptation.

•
Discussions and comparisons of more related works, especially for few-shot learning with multi-source domains, are included.

•
More feature visualizations and analyses are included.Limitations and future directions are discussed.
The rest of this paper is organized as follows.In Section 2, we briefly review the related works.In Section 3, we formulate the task and present baseline methods.In Section 4, we elaborate on our proposed method.In Sections 5 and 6, we describe the benchmark datasets, implementation details, experiment results, and ablation studies.In Section 7, we draw conclusions, discuss limitations and provide some promising future directions.

Few-Shot Learning
Meta-learning [11] is a pioneering approach to addressing few-shot learning [7][8][9][10].The corresponding training regime is namely episodic training, which focuses on mimicking a target few-shot task style, i.e., "N-way K-shot task".Concretely, this approach trains a meta-model on various "N-way K-shot" tasks (or episodes) sampled from the source dataset, with the goal of steering meta-knowledge that can generalize well in the unseen domain.For instance, MAML [7] optimizes a model-agnostic meta-initialization that can enable fast adaptation to a novel FSL task with only a few fine-tuning steps.Meta-LSTM [8] suggests using an LSTM module as the meta-learner to provide the task-specific update rule for the optimization.Prototypical networks [9] and Matching Networks [10] seek to learn a good metric space capable of directly separating new, unknown classes.Those methods are proven to avoid over-fitting and hold the promise of fast adaptation, as the meta-model is assumed to be the optimal initialization for the different unseen few-shot tasks.Although meta-learning is an elegant solution to the problem of few-shot learning, recent research indicates that it may not be necessary to use complex algorithms.Instead, the simple representation learning [12][13][14][15][16]21,31] based on supervised cross-entropy loss on the entire dataset could transfer well and achieve competitive or even better performance.Those findings significantly highlight that the essential role of few-shot transfer mainly relies on source-feature reuse [17] instead of fast adaptation.Furthermore, to make the feature representations more generalizable and transferable, other techniques, including self-supervised learning [18], knowledge distillation [14,32], saliency-guided attention [33], and contrastive learning [34] have also been proved to effectively improve the performance and enhance model discrimination on the novel categories.
The methods discussed above assume only one source dataset is used for pre-training, but many datasets collected from semantically different domains can be available in the machine learning community.To promote few-shot learning with knowledge from multiple domains, learning Universal Representations [21,[27][28][29] and feature selection [21,27,28] are explored in the literature.Concretely, the simplest way [21] to achieve such representations is to train separate feature extractors for each available domain.SUR [27] and URT [28] obtain multiple representations in a parameter-efficient backbone [35] where domainspecific FiLM [36] layers are inserted after each batch normalization layer.For addressing given few-shot tasks, Guo et al. [21] propose a greedy selection algorithm that iteratively searches for the best subset of features on all layers of all pre-trained models, and the selected features in the set are concatenated for training a linear classifier.SUR [27] proposes a feature selection procedure to linearly combine the domain-specific representations with different weights.URT [28] further trains a universal representation transformer layer to weigh the features.Different from [21,27,28] use multiple representations, URL [29] proposes distilling knowledge from the separate multi-domain networks into a single feature extractor.
Aside from directly using representations trained from one domain or multiple domains, some work has also looked into how to make effective few-shot task adaptations with limited data [19,20,[37][38][39][40].Concretely, TADAM [20] applies a task embedding network block, which takes the mean vector of few-shot features as input and produces element-wise scaling and shift vectors to adjust each batch normalization layer, thus making the feature extractor task-specific.FN [38] directly fine-tunes the scaling and shifting parameters of batch normalization on few-shot data to adapt the feature extractor.ConFeSS [39] proposes to learn a task-specific feature masking module that can produce refined features for finetuning a target classifier and the feature extractor.Associative alignment [19] first selects a set of task-relevant categories from source data and conducts feature alignment between the selected source data and target data for network adaptation.PDA [41] proposes a proxy-based domain adaptation scheme to optimize the pre-trained representation and a novel few-shot classifier simultaneously.Instead of adjusting the pre-trained network, some methods [37,40] choose to incrementally learn some parametric modules for adaptation to novel tasks and leave the pre-trained parameters frozen.For example, Implanting [37] adds and learns new convolutional filters within the existing CNN layers.TSA [40] attaches residual adapters to each module of a pre-trained model and optimizes them from scratch on the few-shot data.Unlike these methods that perform adaptation by leveraging auxiliary parametric modules [20,37,39,40] or additional data [19], our method provides a more effective adaptation scheme that directly optimizes the pre-trained representations with the limited target data.
Except for the widely investigated few-shot learning for regular image recognition, recent studies have also focused on other tasks, such as scene recognition [42], multi-label classification [43] and multi-modal learning [44].In this paper, we aim at a practical FSL setting, namely multi-source cross-domain few-shot learning.Different from most existing methods that focus on either representation learning or adaptation on the few-shot data, we address the problem by focusing on both aspects: how to design a good multi-source representation network and how to adapt the representations to address cross-domain FSL in a wide range of scenarios.

Domain Adaptation
Domain adaptation (DA) typically aims at transferring knowledge from a data-rich source domain to an unlabeled target domain.Most existing DA approaches intend to learn invariant feature representations across two domains by distribution alignment [45,46] or adversarial learning [47].Besides the single-source DA, our method is more relevant to the multi-source DA [48,49], which also intends to leverage knowledge from multiple source domains.To learn domain-invariant feature representations, these methods typically align domains pairwise based on a domain-shared feature extractor, where the learning framework is also similar to ours.Concretely, Xu et al. [49] leverages multiple domain discriminators to reduce the domain shift by adversarial learning, while [48] matches moments of feature distributions across all pairs of source and target domains.However, the addressed task in this paper intrinsically differs in both single-source and multi-source DA, where the source and target domains have the same classes (or label space).In contrast, we tackle the problem of few-shot learning, where the classes in the source and target domains do not overlap.

Contrastive Learning
Our few-shot adaptation strategy is highly inspired by the self-supervised contrastive learning by imposing instance discrimination [50][51][52][53] and supervised contrastive learning [54].All those methods aim to learn a good universal representation from a large-scale dataset, thus boosting transferability on a variety of computer vision tasks.A basic idea of these methods is to make contrasts between positive and negative pairs.For instance, NCE [50] proposes a non-parametric softmax classifier that is made up of instance features to achieve instance discrimination.MoCo [52] and SimCLR [53] typically construct different views from the same instance via a variety of data augmentations.SimCLR [53] learns the representation by minimizing the distance of the features from these views and maximizing the distance of the features from other instances.MoCo [52] minimizes contrastive loss based on a dynamic feature dictionary and a momentum encoder.Supervised contrastive learning [54] minimizes the distance of the features of the same category samples and maximizes the distance of the features from different categories.Unlike these methods that make use of contrastive learning for large-scale pre-training, we propose two contrastive objectives to impose both instance discrimination and class discrimination on the few-shot data for adapting pre-trained feature representations to be task-specific.

Multi-Task Learning
Multi-task learning aims to learn multiple related tasks simultaneously [35,55,56].The main idea is to build a compact network that can represent all domains by sharing most model parameters except for minimal parameters for task specialization.Unlike multi-task learning, which aims to achieve optimal performance across multiple source tasks, transfer learning focuses on addressing a specific target task with insufficient training data using knowledge from a single or multiple source domain.In this paper, we seek efficient multi-source representation learning in a multi-task setting in order to further support a broad range of downstream few-shot learning tasks.

Task Formulation
Few-shot image recognition aims to generalize basic knowledge to perform novel class categorization in previously unseen domains.It can be defined from an inductive transfer learning perspective, corresponding to two learning routines, i.e., meta-training and meta-testing stages.In the conventional few-shot setting, there is only one source dataset for training, and the deployed recognition scenario is also similar to the source domain, which is also regarded as in-domain few-shot learning.In contrast to the usual single-source setting, we look at how to use more abundant modal information from multiple source datasets sampled from different domains to support broad FSL tasks, especially for visual recognition tasks that go beyond natural images, such as remote sensing and medical images.
Formally, let us assume we have B source datasets D = {D b } B b=1 in the meta-training stage, where each D b = {(x, y)} ⊂ X b × Y b corresponds to a specific domain, and the (x, y) denotes an image sample and its associated class label.Based on deep neural networks, few-shot learning algorithms aim to extract general and transferable knowledge from large-scale data D. In the meta-testing stage, the pre-trained model is adapted with a novel few-shot learning task which provides a small support set S sampled from target domain ) is referred to as the "N-way K-shot" recognition task.Particularly, all the source datasets D b and target dataset D n have no common class.After the pre-trained model has been adapted to the support set, a query set Q sampled from the unseen classes is used to evaluate the generalization performance.

Transfer Learning Baseline
We revisit a conventional transfer learning baseline, where a feature extractor F is first pre-trained on the source data and then frozen when adapting to the few-shot task.For the multi-source setting, we can simply merge the multiple source datasets into a joint dataset Therefore, the representation learning can be conducted by one joint classification task.Associated with a classification layer C base , the feature extractor can be trained in an end-toend manner for recognizing all joint classes by minimizing the expected empirical risk, where the L denotes a loss function (typically as cross-entropy) that measures the agreement between the true class label and the corresponding prediction from the classifier.Note, that the y J denotes the class label in the current joint space Y J instead of the space in the original source domain.

FT Baseline
Given a target few-shot task presented by S, a simple fine-tuning (FT) baseline is freezing the pre-trained feature extractor and retraining a new classifier head C novel with the features of the support set, i.e., The new recognition model composed of {F, C novel } can be used for a target task.This baseline method has recently been proven to be effective when the pre-trained features can be transferable and reused in the target domain.

NNC Baseline
Another natural approach to directly performing unseen class categorization can resort to the k-nearest neighbor (KNN) with the pre-trained representation.Particularly, this method expects that the deeply learned features are discriminative and generalized enough to separate new classes, so the query (test) sample can be well classified by its nearest neighbors.Besides, a more generally used non-parametric method for multi-shot FSL is the nearest neighbor classifier (NNC) baseline [12,27], where the weights of the target classifier can be regarded as class prototypes [9].Each prototype is computed using the averaged features of the corresponding support class as follows, For a query image x q ∈ Q, the NNC assigns it to the label of the closest support class with a similarity metric sim(•, •) on the representation space, ŷq = arg max j∈{1,...,N} sim F(x q ), p j . (5)

Approach
In this section, we elaborate on our approach to addressing multi-source cross-domain few-shot learning, which includes two learning stages: (1) learning multi-source representations and (2) adapting them to the few-shot task.Particularly, in the adaptation procedure, the objective is to adapt the pre-trained representations to be task-specific and discriminative enough for identifying novel classes.

Multi-Source Representation Learning
Given the multiple source datasets {D b } B b=1 , we first present our framework for training multi-source representations, aiming to effectively extract the diverse semantic information.Typically, a simple way to obtain multi-domain representations is to train a separate feature extractor for each source domain [57].However, when the number of domains is large, adapting and deploying a lot of models would be impractical.Besides, another baseline (presented in Section 3.2), training a single-task network with the merged source data, can be parameter-efficient but suppress feature diversity.What is worse, the potential interference across different domains may impede regular training [58].
To reduce the computational cost and model size, we achieve efficient domain-specific representations by a multi-head structure, where the multiple source datasets share a backbone network, with the assumption that low-level features are generalizable across different domains and tasks [5].Concretely, the multi-head representations have multiple projection layers, each of which corresponds to a different domain and is used to map shared features into the space of that domain.The learning framework is depicted in Figure 2.Besides the original B domains, we also create a universal domain on all merged source data D J presented in Section 3.2 and further define the number of feature representations as D = B + 1.
Inspired by the previous study [59], we instantiate each projection layer with a lowrank bilinear pooling (LBP) structure, since it has been proven to improve feature discrimination for the single-source FSL.Assuming the feature maps outputted by a shared CNN backbone are f ϕ (x) ∈ R h×w×c , we add parallel low-rank bilinear (LBP) layers [60,61] at the end of the shared backbone f ϕ (x) as can be seen in Figure 2. Denoting the D domainspecific LBP layers as {P θ i } D i=1 , we can obtain a feature representation set as {F i (•)} D i=1 for D domains, where P i,1 ∈ R c×d and P i,2 ∈ R c×d are two projection matrices for P θ i at the i-th domain, and subscript l represents the spatial index among h * w.As shown in Figure 3, the detailed architecture of LBP in our implementation consists of two parallel 1 × 1 convolutions with c channels followed by a Hadamard product and a global average pooling operation.The feature dimension can be manually set to d.

Domain-specific representations
Domain-specific predictions We train the multi-source representations by using regular supervised training with in-domain classification, which is performed on each representation head.Concretely, the cosine classifiers [62,63] are used as the classification layers, denoted as , where W i = [w 1 , . . ., w N i ] are the d-dimensional classification weight vectors for the N i classes in the i-th domain.The classifier C i (•; W i ) produces the normalized classification score (probability) for the j-th class where the cosine similarity sim(x, y) = x T y ∥x∥ 2 ∥y∥ 2 is defined as the dot product between the two ℓ 2 normalized vectors, and γ is a regular associated scalar.In summary, the pre-training procedure minimizes the multi-domain classification losses.For clarity, we re-denote an image example in the joint dataset as (x, y O , y J , y D ) ∼ D J , where y O , y J , y D are its originaldomain class label, joint-domain class label, and domain index label, respectively.The end-to-end training objective in a multi-task setting is as follows, where L CE is the cross-entropy function, and 1 ∈ {0, 1} is a domain indicator function that returns 1 if its argument is true and 0 otherwise.During network training with mini-batch SGD, the back-propagated gradients accumulated from the multiple tasks on the shared parameters may be too large to ensure proper end-by-end optimization.To stabilize the training process, we adopt a simple gradient scaling mechanism.To be specific, when the losses are backpropagated to the shared features, the cumulative gradients from the multi-head branches are averaged.In this way, the magnitude of gradients for domain-shared parameters (the CNN backbone) and domain-specific parameters (the projection and classification heads) can be balanced for proper end-to-end training.
In summary, our framework can be trained end-to-end and be built upon any CNN backbone, which is parameter-efficient and simple to implement.The joint training regime ensures that the shared low-level features are general and that multi-head projections are fully responsible for domain specialization.Thus, the produced representations can be universal enough to support further generalization to vastly different few-shot recognition tasks.

Adapting Representations on Few-Shot Data
After obtaining a set of feature representations {F i (•)} D i=1 , we further conduct model adaptation, aiming to generalize the pre-trained representations to address the few-shot task, which only provides a small support set.To achieve this goal, we identify instance discrimination and class discrimination as two crucial factors for improving model generalization.Accordingly, we propose two contrastive learning objectives, which are performed on each domain-specific head.Different from the previous method [34], which uses contrastive learning on the source data to improve feature transferability in the first stage, our method conducts model adaptation by enhancing contrast across few-shot data, thus directly making the pre-trained features more specific and discriminative to the target task.As the adaptation procedure is conducted on each independent representation head F(•; θ i ), we omit the domain index in the following notations for clarity.The adaptation procedure is depicted in Figure 4.

Parametric Instance Discrimination
Unlike most self-supervised contrastive losses, which use complex data augmentation to construct positive pairs from the same instance to achieve instance discrimination, we propose a parametric module, namely instance parametric proxy (IPP).It is functionally similar to a memory bank storing instance features for instance classification [50,52], but different in that our IPP is learnable and updated by gradient descent.For an N-way K-shot task that provides a support set S = {S i } N i=1 , |S i | = K, we denote the weights of the IPP as V = {v i } N * K i=1 , v i ∈ R d , each of which corresponds to a support instance and can be initialized by original support features.For each iteration of model adaptation, let i ∈ I ≡ {1, . . ., N * K} be the index of an arbitrary transformed sample from the original support image, and A(i) be the negative index set of the sample x i .Then, we perform contrastive learning by enforcing each instance x i to be close to its proxy v i and far from its negative samples indexed from A(i) in the feature space.Our parametric instance discrimination (PID) loss modified from info-NCE [51] is as follows: where E (w 1 ,w 2 ) = exp(sim(w 1 , w 2 )/τ) and τ is a regular temperature parameter.Unlike unsupervised contrastive learning [52,53] makes an anchor instance discriminate from all other instances; the negative index set in Equation ( 12) is defined as A(i) ≡ {a ∈ I : y a ̸ = y i }, which means the negative pairs between same-class instances would be filtered by the category labels.As a result, |A(i)| = (N − 1) * K for the N-shot K-shot task specifically.Thus, this supervised objective is more effective in reducing instance variations, benefiting from avoiding negative contrast between same-class features.During adaptation, this contrastive loss is jointly minimized with respect to the IPP and the parameters of feature representation by SGD.

Class Feature Discrimination
While instance discrimination loss can enhance instance invariance for improving generalization, it only loosely ensures intra-class compactness [54], which is a key capability to cluster same-class features.To make the representations more discriminative to the target task, we enforce intra-class feature invariance while also keeping the separation of between-class features.Given the arbitrary transformed support samples indexed by i ∈ I ≡ {1...N * K}, let A(i) and P (i) for sample x i be the negative and positive index sets, respectively.Here, P (i) ≡ {p ∈ I : y p = y i } \ {i} and |P (i)| = K − 1 for the K-shot task specifically.Then our class feature discrimination (CFD) targets minimizing the supervised contrastive loss as follows, Pi,c = E (F(x i ),F(x p )) With the complementary supervision of Equation (11) and Equation ( 13), not only the inter-class feature differences can be enlarged but also the intra-instance and intra-class feature variations can be reduced.

Prototypical Classification
The two contrastive objectives can enhance model discrimination at the feature level, thus improving accuracy for the direct NNC baseline.However, previous literature findings [14,64] also indicate that training linear classifiers on the frozen feature extractor can outperform NCC, as they can learn better class boundaries by exploiting overall support examples instead of only calculating class centers.A natural approach is to build a linear classifier from scratch as presented FT baseline in Section 3.2.Here, we propose to implicitly conduct classification by repurposing the IPP without rebuilding a parametric layer.It is also beneficial to avoid over-parameterization in the low-data regime.In each iteration, we first calculate the class prototypes by averaging the instance proxies belonging to the same class: For a support sample x i , the posterior probability Ps belonging to the support class k is as follows The prototypical classification loss based on cross-entropy between the predictions and the support labels is as follows This regularization can encourage the model to learn more comprehensive features by enforcing accurate prediction with natural cross-entropy.It turns out to be particularly effective and leads to significant performance boosts, which can be attributed to improvements in the quality of the adapted features and more representative class prototypes induced by IPP.

Implementation of Total Adaptation Loss
Finally, the few-shot adaptation is conducted by minimizing the three combined losses on the support set: where the λ 1 and λ 2 are two regular trade-off parameters.
We conduct two adaptation strategies: (1) LAMR: adapting the projection layers and leaving the backbone frozen.This adaptation is independently performed on each representation head with the shared features extracted, which typically allows for rapid adaptation.(2) LAMR++: adapting both the projection layers and the shared backbone.

Query Prediction
With the proposed adaptation, we can transfer the pre-trained representations to task-specific ones, denoted as {F(•; θi )} D i=1 .We can build the nearest neighbor classifier (NNC) [9] by each adapted IPP.For the i-th domain, the induced prototypes of NCC computed by Equation ( 15) are denoted as { pi j } N j=1 .For a query image x q ∈ Q, the similarity to the class j on the i-th representation is computed as sim(F(x q ; θi ), pi j ).The final prediction is based on the aggregation of class similarity across all the representation heads, as follows:

Extension to Single-Source FSL
We can make our multi-source framework applicable to the single-source FSL by dividing the source dataset into sub-domains, each of which contains some unique classes.We propose the following two splitting methods.

1.
Random splitting.The original classes are equally randomly split into sub-datasets.2.
Clustering splitting.One natural class splitting choice would be K-means clustering on class prototypes computed over image features, with a representation pre-trained on the full classes.However, K-means may result in unbalanced partitions.Inspired by the previous method [65], we iteratively split each current dataset in half along the principal component computed over the class prototypes.For splitting N iterations, the original dataset can be divided into 2 N subsets, each of which can be regarded as a distinct domain and is composed of classes that are closer to each other.
With the split sub-domains on a fixed training data budget, our framework can indeed encourage more diverse representations, which can be further adapted to the FSL task and produce an ensemble of NNC classifiers.However, it typically worsens the performance of the individual classifier on average but makes the ensemble prediction more accurate.This modification turns out to be particularly effective when the number of partitions is appropriate, as illustrated in the experimental section.For the extreme case where there is only one class per domain, the representation cannot be trained without supervision.Therefore, we can expect that there may exist an optimal number of partitions for a dataset.We can choose the hyperparameter based on the performance of the validation set.We mainly evaluate our approach on the recently proposed cross-domain benchmark BSCD-FSL [21], which provides few-shot evaluation protocols in both single-and multisource settings.Figure 5 shows examples of source and target images.For the multi-source setting, training datasets contain mini-ImageNet [8], CUB [66], CIFAR100 [67], DTD [68], and Caltech256 [69].All the source domains belong to colored natural images.For the single-source setting, only mini-ImageNet is used.The testing domain covers a spectrum of image types, including CropDiseases [70], EuroSAT [23], ISIC2018 [24,25], and ChestX [26] datasets.Concretely, they are images of plant diseases, remote sensing images, dermoscopy images of skin lesions, and chest X-ray images, each of which corresponds to a different level of similarity to natural images.Compared to previous benchmarks [13,22], this provides more diverse specialized recognition scenarios for evaluating cross-domain FSL.

Source Domains
Target Domains

Mini-Imagenet
For conventional (in-domain) few-shot learning, we evaluate the most commonly used mini-ImageNet [8] dataset, which is derived from the ImageNet dataset [3] and consists of 60,000 color natural images of size 84 × 84 that belong to 100 classes, each with 600 examples.The mini-ImageNet was first proposed in [10].We use the common follow-up setting [8] where the dataset is divided into 64 base classes, 16 validation classes, and 20 novel classes.To make this dataset applicable to our framework, the proposed splitting methods in Section 4.3 are performed on the 64 base classes, and the optimal hyperparameters are selected by the validation set.

Network Architecture
We use ResNet12 [20,64], a derivative of residual networks [2] particularly designed for few-shot learning, as the feature extraction backbone f ϕ (•) to produce the shared features in all experiments.The detailed structure of ResNet12 is shown in Figure 6.ResNet12 has four residual blocks, and each block is made up of three convolutional layers and one 2 × 2 max-pooling layer with stride 2. Each convolutional layer has a 3 × 3 kernel, followed by batch normalization and leaky ReLU of 0.1.The four blocks output feature maps with 64/160/320/640 channels, respectively.The number of parameters of ResNet12 is approximately 12.4M.For the 3 × 84 × 84 input image size, the output feature maps have a size of 640 × 5 × 5.The low-rank bilinear pooling (LBP) layer utilizes two parallel 1 × 1 convolutional layers.We set its feature dimension d as 640 equal to the output channels of the ResNet12 backbone.The number of parameters of the LBP is approximately 0.8 M.

Training Details
Our codebase is developed on the few-shot learning framework with Pytorch in [21].In the meta-training stage, we use the SGD optimizer with the Nesterov momentum 0.9 and a weight decay of 1 × e −4 is applied to all the model parameters.We train 140 epochs totally on both single-and multi-source learning, with the learning rate initialized by 0.1 and dropped to 0.01 at the 100 epoch similar to [12].Conventional data augmentations, including random resize and crop, horizontal flip, and color jittering are applied to the source training images.In the meta-testing stage, we conduct model adaptation on the few-shot data.For the method LAMR, only the domain-specific layers are fine-tuned.For the method LAMR++, both the pre-trained backbone and the domain-specific layers are fine-tuned.Concretely, we use an SDG optimizer with 100 epochs fine-tuned on few-shot data (support set).The trade-off parameters λ 1 and λ 2 are simply set to 1.The metric scalar γ (or temperature parameter τ) in the cosine similarity is set to 20 (or 0.05) in all equations.

Evaluation Protocol
For the BSCD-FSL benchmark, we evaluate 5-way few-shot performance varying the shot in {5, 20, 50} over the 600 tasks following the previous evaluation protocol [21].For the mini-ImageNet benchmark, we evaluate 5-way 1-shot and 5-shot generalization performance over 2000 tasks on the novel set as in [18,41].Each few-shot task contains 15 queries per class.We report the average accuracy with a corresponding 95% confidence interval over all tasks for all experiments.In particular, we apply consistent sampling to make a fair comparison rigorously, where the sampling of testing few-shot tasks follows a deterministic order by numpy with a fixed seed.It makes our ablation studies and comparisons more convincing.

Results on Multi-Source FSL
We present the results of our method in the multi-source FSL setting, where five semantically different datasets can be used for pre-training.TSA [40]: An adaptation method attaches residual adapters to each convolution layer of a pre-trained model (here is URL [29]) and a pre-classifier feature mapping on the pre-trained model and optimizes them from scratch with the few-shot data.
Table 1 reports the detailed results on the four target datasets.Figure 7 summarizes the result comparison across different methods according to the average accuracy across all shot levels and datasets in the benchmark.We can observe that the proposed LAMR demonstrates a clear promise to set a new state of the art in all experimental settings, as it can consistently precede both previous methods and the baseline methods, as shown in Table 1.Concretely, Ensemble, All-EBDs, and IMS-f use the multi-domain features built on the fully separated feature extractors, which is inefficient and impractical when deployed to target domains.In addition, FiLM-pf and SUR built on a parameter-efficient backbone are also computationally expensive, as they still require multiple forward passes to obtain the multi-domain features.In contrast to these methods, our multi-head network is both parameter-efficient and computation-efficient.Instead of using an ensemble of multiple feature representations, URL learns a single network by knowledge distillation from the ensemble of separate multi-domain networks.The distilled single network shows better generalization ability compared to the Ensemble and Union-CC but still underperforms compared to other methods if directly using its features.However, when making further adaptations to the URL, results can be significantly improved.Concretely, URL+Ad simply employs a linear layer on the top of the URL to make feature adaptation, which results in an average improvement of 2.7%.TSA conducts a deeper adaptation, which can provide an average improvement of 3.9% over the URL.Besides, our proposed adaptation method is a combination of fine-tuning losses, which is also orthogonal to methods like TSA that make network adaptation by incrementally learning some new parametric modules.Last, our LAMR, which only fine-tunes the projection layers, consistently performs better than the TAS in all settings shown in Table 1.With the backbone being further adapted, our method LAMR++ can achieve consistent performance gains.Particularly on the ISIC dataset, our LAMR++ can produce improvements of 3.7%, 3.2%, and 3.4% over our shallower adaptation method (LAMR) in {5/20/50}-shot settings, respectively.It is generally considered important to adapt both shallow and deep layers in a neural network for successfully addressing cross-domain few-shot learning.Overall, we can clearly observe that our methods (LAMR & LAMR++) have absolute superiority over other methods.Concretely, LAMR++ achieves an average classification accuracy of 73.07%across all datasets and shot levels, which outperforms the TSA by 3.2%.We first investigate the effects of different splitting methods and splitting numbers for single-source FSL.As previously demonstrated in Section 4.3, there may exist a trade-off partition number for achieving the optimal generalization performance.The evaluation is conducted on the validation set of mini-ImageNet, with consistently sampled 200 tasks.We evaluate the two proposed splitting methods, random splitting and clustering splitting, with the splitting number varying in {1, 2, 4, 8, 16}.For the random splitting method, we perform three trials based on different random split classes and report the mean of the three runs.It is worth noting that the results across different trials have a low variance, as the variation between each two random trials is within 1%.The plots of five-way one-shot and five-shot validation accuracy are shown in Figure 8.The one splitting (or root point) refers to the strong FSL baseline, namely CC [62,63], trained with the original dataset without splitting.We can observe that the best performance is achieved at 2 splitting for both 1-shot and 5-shot settings, which is also a trade-off between the number of split domains and classes (or data) per domain for a fixed data budget on the mini-ImageNet benchmark.

ChestX
First, it is of interest that the random splitting method achieves better performance than the clustering splitting method.A possible explanation for this phenomenon is that randomly split sub-domains include more heterogeneous classes, which can yield more discriminative representations and better average performance in the ensemble.Second, we can observe that the best number of splitting for both methods on mini-ImageNet is 2. It also indicates our multi-source framework can improve single-source FSL with a fixed data budget.Third, we can find that when the split sub-domains (greater than 4) become too large, the accuracy is seriously decreased, since there are too many limited classes and data to enable meaningful representation learning in each sub-domain.

Results on Mini-ImageNet
We further evaluate our LAMR trained on the "fake multi-domain" which is partitioned by the optimal splitting strategy validated in Figure 8.We report the results on the mini-ImageNet test set and compare them with the prior methods that focus on learning or adapting a good representation in Table 2.We make comparisons as follows.(1) First, we compare our method with the approaches [12][13][14]62,71] that directly rely on a good pre-trained representation learning.They all perform few-shot classification on the frozen representation by building a target classifier with NNC [12,62,71] or FT [13,14] baselines, but differ in the way the feature extractor is learned.Concretely, the CC [62] is our baseline model, whose deep representation is trained on a cosine classifier.Neg-Cosine [71] enhances CC representation by employing a negative margin into the softmax loss.Other methods [12][13][14] would rather use natural linear classifiers to minimize the cross-entropy loss for obtaining the representation, and the Meta-Baseline [12] further improves the pretrained feature extractor followed by a meta-training stage.All the methods mentioned above are competitive with meta-learning-based methods [9,64], and beneficial from a good embedding.However, our LAMR shows significant superiority over the CC-based methods [62,71] as well as other methods [12][13][14].For example, LAMR can outperform Embed-Distill [14] by 1.9% and 1.8% in 5-shot and 1-shot settings, respectively.Besides, our method is also orthogonal with those methods [9,12,13,62,71], as their learning algorithms can also be used for pre-training our multi-head representation framework.(2) Second, we further compare our LAMR with the methods [19,20,27,37,72,73] that perform feature adaptation when employed to target few-shot tasks.TADAM [20] employs a task embedding network (TEN) block that generates scaling and shift vectors for each batch normalization layer, adapting the network to be task-specific.However, learning an accurate auxiliary network may be a challenging task, especially when target data are limited and the domain shift is significant.Centroid-Align [19] first selects a set of taskrelevant categories from source data and conducts feature alignment between the selected source data and target few-shot data for network adaptation.Free-Lunch [72] proposes to calibrate the distribution of the novel samples using the statistics of selected base classes that are considered task-relevant.H-OT [73] further develops a novel hierarchical optimal transport framework to achieve adaptive distribution calibration.Unlike methods [19,72,73] that perform adaptation by leveraging the base data [19] or their statistics [72,73], our method provides a more effective adaptation scheme that directly optimizes the pretrained representations with the limited target data without re-accessing the source data.Besides, our methods also achieve on par with or better performance than them.For example, our LAMR++ outperforms H-OT by 0.27% and 0.97% in 1-shot and 5-shot settings, respectively.Instead of adjusting the pre-trained parameters, Implant [37] adds and learns new convolutional filters upon the frozen CNN layers.Our LAMR++ is also orthogonal to it and performs significantly better than it by 3.4% and 4.1% in 1-shot and 5-shot settings, respectively.
(3) Third, we compare LAMR with an ensemble-based method, Robust-20 [57], which trains an ensemble of 20 ResNets promoted by cooperation and diversity regularization.Our approach is more efficient for building an ensemble of multiple representations and also significantly outperforms it, with notable absolute accuracy improvements of 2.0% and 2.3% for 1-shot and 5-shot settings, respectively.
It is observed that adapting the backbone (by LAMR++) only slightly helps fewshot transfer for this benchmark dataset, whereas the improvement is more pronounced for cross-domain few-shot learning.It may also indicate that deeper adaptation is more necessary when the domain distribution shift increases.

Results on BSCD-FSL
We further report the results of single-source cross-domain FSL on the four specific domains of BSCD-FSL and compare them with the prior approaches in Table 3. Concretely, the Linear and Mean-centroid denote FT and NNC baselines, respectively, presented in Section 3.2.The Ft-CC denotes fine-tuning a cosine classifier based on the frozen feature extractor.It is obvious that the proposed approach can surpass all three transfer learning baselines by a large margin.For instance, LAMR performs better than the Linear by 5.4%, 4.6%, and 3.1% in {5/20/50}-shot settings, respectively.Similar to the observation in the multi-source FSL, LAMR++ also yields notable and consistent improvements over LAMR.Particularly on the ISIC dataset, LAMR++ can produce improvements of 5.5%, 5.8%, and 5.2% over our LAMR in {5/20/50}-shot settings, respectively.
Besides, we also make comparisons with other state-of-the-art methods [32,38,39,74].LDPnet [32] imposes local-global feature consistency of prototypical networks by knowledge distillation, which improves the cross-domain generalization of the learned features.However, due to a lack of feature adaptation, LDP-net is typically inferior to other adaptation methods [38,39,74] and ours, especially for 20/50-shot settings.For example, on the ISIC dataset, LAMR++ performs significantly better than LDP-net by 6.0%, 9.8%, and 9.9% in {5/20/50}-shot settings, respectively.
Other methods [38,39,74] conduct domain-specific feature adaptation for tackling the large domain shift.Particularly, FN [38] adapts the feature extractor by fine-tuning its scaling and shifting parameters of batch normalization on few-shot data.We can observe that the FN is inferior to the transfer learning baselines in several cases of 5-shot settings.A possible interpretation for this phenomenon is that, with too limited data and extreme domain shift, optimizing the BN parameters accurately may be particularly hard.ConFeSS [39] proposes to learn a task-specific feature masking module that can produce refined features for further fine-tuning a target classifier and the feature extractor.NSAE [74] pretrains and fine-tunes the network with an additional auto-encoder to improve the model generalization, which implicitly augments the support data.Unlike ConFeSS [39] and NSAE [74], which leverage auxiliary modules for model adaptation, our approach is more efficient by directly optimizing the target model.Finally, we can also observe that our approach can achieve the highest accuracy among all the methods and experimental settings, except for one case in 5-way 5-shot EuroSAT where ConFeSS slightly outperforms our LAMR++ by 0.03%.But in 5-way 20-shot and 50-shot EuroSAT, our LAMR++ can achieve significant performance gains of 2.6% and 3.1% over it, respectively.Table 3.The results of single-source few-shot learning on the BSCD-FSL benchmark.The best result in each setting is marked in bold.24.05 ± 1 the results from [21]. 2 reproduced results based the official released code [32] and their trained model.

Ablation Study and Analysis
We also conduct the ablation study based on the BSCD-FSL benchmark, as its target domains include a vast array of different specific recognition scenarios.

Effect of Multi-Domain Learning Framework
We first explore how our multi-source framework benefits feature transferability.We compare our framework with two baseline models: (1) Single-source: The feature representation is trained on one source dataset (that is mini-ImageNet).(2) Merged-multisources: The feature representation is trained by one task that classifies merged classes from all source datasets, as presented in Section 3. We validate the transferability of these representations based on the NNC baseline directly, and the results are reported in Table 4.We observe that the representation trained with Merged-multi-sources does better than the representation trained with Sigle-source in most cases.This is because the multi-source could provide a substantial amount of training data.Compared to Merged-multi-sources, our multi-source framework achieves much higher accuracy in most settings.Only for the ChestX, we see our framework is slightly underperforming within 1%.We conjecture that because the distribution of this target domain does not match any of the train distributions, thus robust knowledge transfer can not be ensured.However, the overall performance still demonstrates the benefit of using our multi-source representations rather than using one representation learned on the combined source dataset.In order to understand what enables good representation adaptation over few-shot data, we systematically study the effect of all components in our adaptation loss, i.e., PID, CFD, PC with respect to Equations ( 11), ( 13) and (17).Table 5 shows the detailed results for all 24 settings that vary in two source types, four target domains, and three shot levels.We can make the following observations: (1) Applying any one of PID, CFD, or PC can lead to consistent performance gains in all 24 experimental settings.( 2) With the combined supervision of PID and CFD, the results can be better than when only PID or CFD is used in 16 out of 24 settings.(3) Incorporating all three components can lead to the best result in 17 out of 24 settings.Particularly for the ISIC dataset with 50 examples per class available, the overall performance improves by up to 10.0% and 15.8% in multi-and single-source settings, respectively.It also verifies that our adaptation strategy is advantageous for dealing with the domain shift problem in few-shot learning.(4) The overall performance gains are more significant when more data are available.For example on the ChestX dataset, our adaptation method can yield improvements of {1.4%,4.0%,6.6%}over the baseline in {5/20/50}-shot settings, respectively.This also indicates that when suffering from extreme domain bias, such as in the medical domain, recognition requires more data to ensure good adaptation.To better evaluate the effectiveness of each isolated component and different combinations of the three components towards performance in cross-domain few-shot transfer, we further compute the average accuracy across different datasets and shot levels and rank them in Figure 9.We can observe that the rank of isolated performance gains of using PID, CFD, or PC is {PC > CFD > PID} for both multi-source and single-source FSL.The full adaptation can achieve the best mean accuracy in both multi-source and single-source settings.Overall, the adaptation provides an average improvement of 4.5% and 6.4% for multi-source FSL and single-source FSL, respectively.We can observe that the full adaptation can achieve the best mean accuracy in both multi-source and single-source settings.

Effect of Different Classifier Learning
We also compare some variants built on our pre-trained multi-source representations using different classification modules or fine-tuning regimes: The results are reported in Table 6.We can observe that: (1) only fine-tuning a classifier on the frozen representations can obtain performance gains in all settings.Besides, finetuning cosine classifiers (Ft-CC) always outperform linear classifier-based ones (Ft-LC), which is also consistent with the previous literature findings [13,21].(2) further adapting the representations associated with the classifiers (Ft-MSR-LC & Ft-MSR-CC) can also lead to accuracy boosts.

Impressive Visualization
To further qualitatively understand how the adaptation leads to few-shot performance gains, we visualize feature embeddings of the query images and the class prototypes by t-SNE [76] in Figure 10, which is computed in a 5-way 5-shot task sampled from the CropDisease dataset.Figures 10a-f denote the multi-head representations, each of which shows its feature embeddings before or after the adaptation, respectively.The benefits of our adaptation method can be attributed to two aspects: (1) The query features of the same class become more compact, and the class clusters are more separable from each other after making the adaptation.It also indicates that our LAMR can encourage intra-class compactness and inter-class divergence, which results in more discriminative features for classification.(2) The induced class prototypes computed from the adapted instance proxy are also more representative so the prototypes can be used to classify the query features well.These observations can also explain the significant performance boosts presented in the ablation study.We further analyze how features are changed with the adaptation by visualizing class activation maps (CAMs) [77] on a 5-shot task sampled from the CropDisease, which can also be regarded as a fine-grained recognition task on grape diseases.Comparisons of regions that the deep CNNs focus on for discrimination before and after the adaptation are shown in Figure 11.We can make the following observations: (1) For the healthy grape leaf, visual cues to make predictions do not significantly change before and after the adaptation, which indicates that the pre-trained features can be generalized enough to recognize such common natural objects.However, we can still see that the adaptation can make CNN features more focused on the skeleton and edge of the leaf but less on the background.
(2) For the two grape leaf diseases, our adaptation method can help CNN concentrate on the most relevant regions with respect to the two specific diseases.In contrast, the CNN features without the still focus on the visual cues of the common object, such as the leaf edge.The change indicates our method can improve discrimination towards the class-specific visual cues.The ability to steer the task-relevant features verifies the superior performance of our LAMR.

Conclusions and Future Work
In this paper, we investigate a more practical FSL setting, namely multi-source crossdomain few-shot learning.To tackle the problem, we propose a simple yet effective multisource representation framework for learning prior knowledge from multiple datasets, which enables generalization to a wide range of unseen domains.Further task-specific adaptation on few-shot data is performed to enhance instance discrimination and class discrimination by minimizing two contrastive losses on the multi-domain representations.We empirically demonstrate the superiority of our LAMR over many previous methods and strong baselines, which achieves state-of-the-art results for cross-domain FSL.We extend LAMR to single-source FSL by introducing dataset-splitting strategies that equally split one source dataset into sub-domains.The empirical results show that applying simple "random splitting" can improve conventional cosine-similarity-based classifiers in FSL with a fixed single-source data budget.Extensive ablation studies and analyses illustrate that each component of our method can effectively facilitate few-shot transfer.Our method also has some limitations, and we could see some promising future directions.First, we conduct adaptation by either fine-tuning or freezing the full backbone.It would be promising for future work to seek more flexible adaptation methods that can select a part of layers or parameters to adjust conditioning on the given task.Second, a study [65] has also shown that the choice of the source training dataset has a huge impact on the performance of downstream tasks.We also acknowledge that not every dataset in the multi-source domains contributes equally to the target task.Further improvements can also be explored for a more scalable transfer by considering the similarity between the source and target domains.It is worth noting that the two limitations may also exist with most other methods that focus on representation learning on source data or adaptation on few-shot data.Besides, to our knowledge, multi-source few-shot learning on other fundamental computer vision applications, such as segmentation and detection has not been explored yet.Developing new benchmarks for those computer problems would also foster future progress in this field.

Figure 1 .
Figure 1.Illustration of our approach.First, we pre-train efficient multi-domain feature representations on the abundant data from semantically different domains.Then, given a few-shot task (such as remote sensing scene recognition), we perform adaptation on the pre-trained multi-domain representations by optimizing domain-specific parameters with few-shot data (the support set).

Figure 2 .Figure 3 .
Figure 2. Multi-source representations learning from multiple source datasets.The structure of representations contains a shared CNN backbone f ϕ and the multi-head projection layers {P θ i } D i=1 , which consists of a feature set {F i (x)} D i=1.We train the representations by taking entropy loss on the multi-head classification tasks.The five input images correspond to the five source datasets of BSCD-FSL[21].Best viewed in color.

1 , 1 p 2 p 3 Figure 4 .
Figure 4. Adapting representations for recognizing previously unseen categories.The adaptation is performed on each representation head.Best viewed in color.

Figure 7 .
Figure 7. Overall performance comparison across different methods in the multi-source setting of BSCD-FSL benchmark.

(a) 5 -Figure 8 .
Figure 8. Validation accuracy varying splitting strategies and numbers of partitioned sub-domains on mini-ImageNet.(a) Five-way one-shot validation performance (b) 5-way 5-shot validation performance.We can observe that the best performance is achieved at 2 splitting for both 1-shot and 5-shot settings, which is also a trade-off between the number of split domains and classes (or data) per domain for a fixed data budget on the mini-ImageNet benchmark.

Figure 9 .
Figure 9. Evaluating the effectiveness of each isolated component and different combinations of the three components (PID, CFD, and PC) towards performance in cross-domain few-shot transfer.(a) Mean accuracy across different datasets and shot levels for multi-source FSL.(b) Mean accuracy across different datasets and shot levels for single-source FSL.We can observe that the full adaptation can achieve the best mean accuracy in both multi-source and single-source settings.

•
Fixed-MSR: Directly leveraging the frozen multi-source representations with NNC baseline.• Ft-LC: Fine-tuning a linear classification layer on each frozen representation head.• Ft-CC: Fine-tuning a cosine classification layer on each frozen representation head.• Ft-MSR-LC: Fine-tuning both multi-source representations (projection layers) and following linear classification layers.• Ft-MSR-CC: Fine-tuning both multi-source representations (projection layers) and following cosine classification layers.

Figure 10 .
Figure 10.The t-SNE visualization of the feature distribution on the multi-head representations before and after making adaptation by our method.(a-f) show the six representation spaces learned on the BSCD-FSL benchmark.Better viewed in color with zoom-in.

Figure 11 .
Figure 11.Visualization using the class activation maps (CAMs) to show the regions that deep networks focus on before and after the adaptation.Image examples from a 5-shot task in the CropDisease.Better viewed in color.

(Unseen Classes) CropDiseases EuroSAT ISIC ChestX mini-ImageNet CIFAR100 CUB Caltech256 DTD Decreasing similarity to source
[62]following multi-domain learning methods are compared:•Union-CC[62]: A baseline method trains a single feature extractor on the union of all training data with the cosine classifier and tests it with the NNC classifier.• Ensemble: A baseline method trains separate feature extractors on each dataset and tests with the average prediction of the NCC classifiers built on them.
[29]L+Ad[29]: An adaptation method attaches a pre-classifier feature mapping (a linear layer) to the URL and optimizes it with the few-shot data.•

Table 1 .
The results of multi-source few-shot learning on the BSCD-FSL benchmark.Best results are marked in bold.

Table 2 .
Comparison to previous methods on mini-ImageNet.Our LAMR is trained following the optimal splitting strategy selected by the validation set, which can fairly compare with other methods under a same training data budget.All methods use ResNets as the feature backbone.The best result in each setting is marked in bold.

Table 4 .
Comparing transferability of feature representations trained by different models.The best result in each setting is marked in bold.

Table 5 .
Ablation study on 5-way K-shot performance by validating three components of the adaptation objective, including PID (parametric instance discrimination), CFD (class feature discrimination), and PC (prototypical classification).The best result in each setting is marked in bold.

Table 6 .
Quantitative analysis of different classifiers that are incorporated into our pre-trained multi-source representations during the meta-test stage.