Revisiting the Transferability of Few-Shot Image Classification: A Frequency Spectrum Perspective

Few-shot learning, especially few-shot image classification (FSIC), endeavors to recognize new categories using only a handful of labeled images by transferring knowledge from a model trained on base categories. Despite numerous efforts to address the challenge of deficient transferability caused by the distribution shift between the base and new classes, the fundamental principles remain a subject of debate. In this paper, we elucidate why a decline in performance occurs and what information is transferred during the testing phase, examining it from a frequency spectrum perspective. Specifically, we adopt causality on the frequency space for FSIC. With our causal assumption, non-causal frequencies (e.g., background knowledge) act as confounders between causal frequencies (e.g., object information) and predictions. Our experimental results reveal that different frequency components represent distinct semantics, and non-causal frequencies adversely affect transferability, resulting in suboptimal performance. Subsequently, we suggest a straightforward but potent approach, namely the Frequency Spectrum Mask (FRSM), to weight the frequency and mitigate the impact of non-causal frequencies. Extensive experiments demonstrate that the proposed FRSM method significantly enhanced the transferability of the FSIC model across nine testing datasets.


Introduction
Few-shot image classification (FSIC) endeavors to identify unlabeled images within the query set belonging to novel classes, leveraging knowledge acquired from base classes with a few labeled images in the support set [1,2].Recently, numerous methodologies have emerged to address the challenge of few-shot image classification, which are mainly divided as follows.(1) Fine-tuning-based methods [3][4][5] tackle the problem by learning to transfer.They follow the standard machine learning or transfer learning [6][7][8] procedure to pretrain transferable knowledge and test-tune the knowledge in FSIC episodes sampled from novel classes.(2) Metric-based methods [9,10] solve the problem by learning to compare.They calculate the similarity between the query with the unlabeled data and the support set with labeled images.Finally, (3) meta-based methods [11][12][13] address the problem by learning to learn.They learn a good model initialization for fast adaptation to novel classes.
When distribution shifts exist between the base (or training dataset) and novel classes (or testing dataset), it is common for the transferability of the model trained on the base classes to diminish.In Figure 1, we scrutinize the reasons behind performance declines resulting from distribution shifts by employing T-SNE [14] visualization.This analysis is conducted using Resnet12 trained on the miniImageNet-train dataset.In Figure 1a-d  Recently, the majority of works have been dedicated to addressing the issue of distribution shift [15][16][17].However, these works predominantly concentrated on enhancing transferability by introducing diverse regularizations or larger model architectures, often leading to increased memory and time consumption in exchange for marginal performance improvements [18,19].The primary challenge of these works stems from a lack of clarity regarding why FSIC performance diminishes and what information undergoes transferring during the distribution shifts.In this paper, we seek to elucidate the underlying mechanism behind the remarkable transferability observed in the FSIC problem.Therefore, we approached the problem by constructing a causal graph of FSIC from a frequency spectrum perspective, motivated by both theoretical and experimental evidence suggesting that the frequency space contains more distinguishable semantic information than the feature space, as established in Figure 2 and prior studies [20,21].Figure 2 showcases the average amplitudes of the eigenfrequencies across four testing datasets using a pretrained model, vividly illustrating pronounced distinctions in the frequency space among testing datasets.In this paper, we construct a causal graph of FSIC within the frequency space which delineates the relationships among the causal frequency (e.g., object information), noncausal frequency (e.g., background knowledge), and prediction.As illustrated in Figure 3a, our causal assumption posits the non-causal frequency as a confounder between the causal frequency and prediction [22].For example, consider a scenario where the dog images in the training data feature grass backgrounds.This scenario poses a challenge during the testing phase when encountering the dog image with a water background, as the inconsistent background information hampers recognition.Consequently, the presence of confounding information stands as a primary factor influencing the performance of FSIC under distribution shifts.Notably, the causal frequency, representing domaininvariant knowledge, should be effectively transferred from training to testing datasets.
We introduce a straightforward yet potent method called the Frequency Spectrum Mask (FRSM) to weight the influences of causal frequency components.Through empirical validation, we benchmark the FRSM against state-of-the-art (SOTA) methods across nine FSIC datasets, demonstrating its efficacy.Our primary contributions can be outlined as follows: • We clarify why there is a decline in performance and what information is transferred during the distribution shift in the few-shot image classification task from a frequency spectrum perspective.

•
We adopt a causal perspective of few-shot image classification to demonstrate that non-causal frequencies impact transferability and introduce a straightforward yet efficient method, the FRSM, to weight frequencies.

Related Work 2.1. Few-Shot Image Classification
Most works have been proposed in recent years to solve the FSIC problem.These methods are broadly categorized into three groups: (1) fine-tuning-based methods employ a non-episodic paradigm to pretrain the model on base classes and then test-tune the pretrained model on novel classes.For example, Baseline and Bseline++ [3] propose using the fully connected layer or cosine distance to calculate the classification loss, respectively.SKD [23] further employs rotation-based self-supervision learning during pretraining to enhance the feature extraction process.(2) Metric-based methods aim to learn a discriminative embedding space by leveraging a learned distance metric.For example, ProtoNet [10] proposes performing classification by measuring the Euclidean distance between the query and prototype representation of each class.RelationNet [24] proposes calculating the distance between the query and prototype by leveraging the relational module.(3) Metabased methods focus on learning how to optimize a model through bi-level optimization.In particular, MAML [11] proposes quickly adapting to novel classes with only a few labeled images and a small number of gradient updates.LEO [12] proposes learning a stochastic latent space from high-dimensional parameters.Our method tends toward fine-tuning based methods due to their simplicity and effectiveness.

Frequency Spectrum Learning
In conventional image processing, frequency analysis has been widely studied for years.Recently, it has been set forth to be incorporated into deep learning methods.Some researchers have been working on analyzing and understanding some behaviors of deep neural networks.The frequency spectrum has garnered considerable attention from researchers due to its utility in analyzing and comprehending the behavior and interpretability of deep neural networks.For example, in [25], the authors observed that convolutional neural networks (CNNs) exhibit a pronounced bias toward recognizing textures rather than shapes.Some research efforts have aimed at enhancing the generalization capability of CNNs by leveraging insights from the frequency spectrum.Notably, FACT [26] proposes forcing the model to capture phase information under the assumption that phase information is more robust to distribution shifts.FSDR [27] improves the generalization capability through randomizing images in a frequency space, preserving domain-invariant frequency components while randomizing domain-variant ones exclusively.Diverging from prior research, ours centers on analyzing the transferability of the FSIC problem across base and novel classes through frequency analysis.

Methodology
In this section, we begin by delineating the problem formulation of few-shot image classification, setting the stage for a detailed exploration of this challenge.Following this, we adopt a novel approach by examining the issue through the lens of causal analysis from a frequency spectrum perspective.This allows us to identify key frequency components that significantly influence classification outcomes, especially when limited data are available.Lastly, we introduce a straightforward yet potent technique: the frequency spectrum mask (FRSM).This method strategically adjusts the weights of frequency components to enhance the model's capability to emphasize the most pertinent features for classification.This refinement ultimately leads to improved accuracy and robustness in few-shot learning scenarios.By implementing the FRSM, we aim to mitigate the impact of irrelevant frequency noise and amplify the contribution of crucial frequencies.

Few-Shot Image Classification
Several approaches have been proposed to tackle the few-shot image classification (FSIC) problem.Among these, fine-tuning-based methods have emerged as particularly effective due to their simplicity and notable efficacy.Therefore, we developed the FRSM based on fine-tuning techniques.These methods typically adhere to the standard transfer learning procedure, which comprises two phases: the pretraining phase, using the training set D train , and the test-tuning phase, employing a testing set D test .Additionally, a validation set D val is utilized for model selection during the test-tuning phase.
During the pretraining phase, the entire training set D train is employed to pretrain the feature extractor and classifier on base classes C base using the standard cross-entropy loss L ce as follows: where f and g are the feature extractor and classifier, parameterized by θ and ω, respectively.The pairs {x i , y i } represent images and their corresponding labels from the training set, denoted as D train = {x i , y i } N i=1 , where x i ∈ R 3×H×W , N is the total number of images and H and W represent the height and width of the images, respectively, with the one-hot labeling vector y i ∈ Y = {0, 1}.
During the test-tuning phase, we sampled a C-way K-shot episode from the testing set D test containing novel classes C novel .Each episode T comprises a support set S and a query set Q, denoted as T = ⟨S, Q⟩, where S = {x i , y i } CK i=1 represents the support samples and Q = {xi, y i } CM i=1 denotes the query samples, with CK denoting the number of support samples and CM denoting the number of query samples.A new C-class classifier will be relearned based on S every time.Basically, the pretrained embedding parameter θ is fixed to avoid overfitting because there are limited labeled data in S. Once the novel classifier is learned, the labels of Q can be predicted.

A Causal Graph of FSIC
We used a casual graph of the FSIC to model and construct a structural causal model (SCM) in Figure 3a.It shows the causality relationship among five variances: the input data X, non-causal frequency F 1 , causal frequency F 2 , frequency F, and prediction or label Y, where the link from one variable to another indicates the cause-and-effect relationship (cause → effect).We will now list the following explanations for our proposed SCM: The variable F 2 signifies the causal frequency that genuinely represents the inherent characteristic of the input data X, such as the object details in an image.Conversely, F 1 indicates the non-causal frequency typically resulting from biases in the data or superficial patterns, such as the background details of an image.Given that F 2 and F 1 coexist within the input data X, these causal relationships are established.
The variable F corresponds to the frequency of the input data X; that is, F = FFT( f θ (X)), with FFT denoting the fast Fourier transformation.To produce F, the traditional learning approach uses both the non-causal frequency F 1 and the causal frequency F 2 to extract the discriminative features.• F → Y.The primary aim of learning via the frequency spectrum is to ascertain the attributes of the input data X.The classifier g ω will determine the prediction Y based on the frequency F; specifically, Y = g ω (iFFT(F)), where iFFT stands for the inverse fast Fourier transformation.
When scrutinizing our proposed SCM, we acknowledge the role of the non-causal frequency F 1 as a confounding factor between F 2 and Y, stemming from the existence of a backdoor path between F 2 and Y; specifically, F 2 ← X → F 1 → Z → Y.Even if F 2 lacks a direct connection to Y, the presence of this backdoor path can lead F 2 to exhibit a spurious correlation with Y, resulting in erroneous predictions based on the non-causal frequency F 1 instead of the causal frequency F 2 .Hence, it is imperative to enhance the transferability of FSIC by adjusting the frequencies and mitigating the influence of the non-causal frequency.In this paper, we propose a straightforward yet effective method, the FRSM, for assigning weights to each frequency.

Frequency Spectrum Mask
In this section, we introduce our FRSM to weight each frequency.Figure 3b illustrates the overall learnable process of the FRSM.Note that the FRSM is used in the test-tuning phase with novel classes C novel , which uses a frozen feature extractor f θ pretrained on the base classes C base .The reason for this is that the distribution shift between the base and novel classes makes it impossible for a feature extractor pretrained on the base classes to extract features suitable for novel classes.Therefore, we needed to weight the extracted features to meet the requirements of the novel classes.Following previous works [20,26], the FRSM was developed in the frequency space, as the frequency space has more information than the feature space.
In the test-tuning phase, firstly, a C-way K-shot episode T = ⟨S, Q⟩ is sampled from the novel classes C novel .Then, for each episode, the few labeled support images S = {x i , y i } CK i=1 are used to pre-learn a new classifier g ω and our FRSM M m parameterized by m, based on the frozen feature extractor f θ .Finally, the unlabeled query images Q are used to evaluate the performance of the test-tuned classifier g ω and FRSM M m .Next, we give a detailed description of the test-tuning phase.
We used the frozen f θ to extract the features Z S of each image in S (i.e., Z S = {z i } CK i=1 with z i = f θ (x i ) and z i ∈ R C× H× M, where C is the channel and H and M are the height and width of the feature representations, respectively).For each image feature, we transformed it into frequency space to find the corresponding frequency representation z F i with a fast Fourier transformation (FFT), which is formulated as follows: where ( h, w) ∈ ( H, W) is the height and width pair in the feature space and ( hF , wF ) ∈ ( HF , WF ) is the corresponding height and width pair in the frequency spectrum space.Following the literature [26], the frequency representation z F i includes the amplitude z A i and phase z P i components.Then, the two components z A i and z P i are calculated as follows: where R(z i ) and I(z i ) represent the real and imaginary parts of z F i ( hF , wF ), respectively.The operator arctan is the inverse tangent function.Since z A i represents the magnitude of the frequency, it is crucial to assign appropriate weights to it for accurate representation and analysis.To achieve this, we propose using our proposed FRSM, denoted as M m .By applying this method, we can effectively weight z A i , thereby enhancing the precision.The transformed frequency components are formulated as shown in the equation below: where M m ∈ R C× H× W is initialized to one and ⊙ is the element-wise product.The weighted amplitude ẑA i is combined with the original phase z P i to form a new frequency ẑF i : The new frequency representation ẑF i can be transferred to the original feature space via an inverse fast Fourier transformation (iFFT), which can be formulated as follows: The new feature representation ẑi is used to replace the previous feature representation z i in Equation (1).Then, the cross-entropy loss L ce is calculated to update the new classifier g ω and our proposed FRSM M m .

Experiment
In this section, the research questions guide the running of experiments as follows.Q1: Why does the transferability decrease as the distribution shifts between the base and novel classes increase?Q2: What information should be transferred from the base classes to the novel classes?Q3: How effective is our FRSM for few-shot image classification tasks in all testing datasets?

Experimental Set-Up
Datasets.In our experiments, for the training dataset D train , we selected the train split of miniImageNet [9] to pretrain the feature extractor due to the diversity and complexity of its images.The training dataset was also named miniImageNet-train.For the testing dataset D test , we chose nine benchmarks, including the subsets of ILSVRC-2012 [28] (i.e., the test split of miniImageNet, named miniImageNet-test, and tieredImageNet), all evaluation datasets of cross-domain few-shot learning (CDFSL) [29] (i.e., EuroSAT, CropDisease, ChestX, and ISIC), and the evaluation benchmarks proposed by Tseng et al. [15] (i.e., CUB, Cars and Plantae).All images in the above datasets were resized to 84 × 84 pixels, and data augmentation was used.For more details, please refer to LibFSL [30] and Table 1.Baseline.To evaluate the effectiveness of our FRSM method, we selected seven classic and effective methods for comparison.They included Baseline [3], Baseline++ [3], and SKD [23] from the fine-tuning-based methods, ProtoNet [10] and RelationNet [24] from the metric-based methods, and MAML [11] and LEO [12] from the meta-based methods, where SKD and LEO recognized the novel classes based on the pretrained model in the training phase.
Experimental details.Following the literature [30], we adopted two different feature extractors: Conv64F (see Table 2) and ResNet12 (see Figure 1).In the test-tuning phase, we used the pretrained feature extractor from [23].We utilized the SGD optimizer with a momentum of 0.9 and a weight decay of 1e−3.The learning rates for the re-learned classifier and our FRSM was initialized as 1e−2 and 3, respectively.For each example of testing data, we randomly sampled 600 episodes, and each episode contained 5 classes.Each class had 5 or 10 labeled images (support set) and an additional 15 unlabeled images (query set) for performance evaluation, formulating the 5 way 5 shot or 10 shot episode.

Experimental Results
In this section, we run experiments to answer the following questions.
Q1.The transferability from base to novel classes.The experimental results in Figure 4 aim to answer Q1.In Figure 4, we show the similarity of the testing and training datasets according to the amplitude ratio.The testing datasets' similarity to the training dataset was ordered as follows: miniImageNet-test > Cars > Plantae > EuroSAT > ISIC > ChestX.
The similarity is known to affect the transferability of the training dataset features into the testing datasets.To this end, we found that the datasets' similarity and few-shot difficulty simultaneously led to performance degradation.
Q2.The information from base to novel classes.We conducted experiments on testing datasets, with the results shown in Figure 2. We found that the high-frequency components in different datasets had more similarity than the low-frequency components.Therefore, compared with the low-frequency components, the high-frequency components contributed to the transferability more.

Q3. The performance of the FRSM.
To answer Q3, we conducted experiments on eight datasets.From Table 2, which reports the average classification accuracy, we have the following findings.(1) Our proposed FRSM achieved the best performance in most settings.This is because our FRSM assigned a large weight to the causal frequency and a low weight to the non-causal one, thereby avoiding the influence of the confounder on the transferability.(2) Surprisingly, our FRSM was weaker than the baseline for relatively similar testing data (e.g., tieredImageNet).The possible reason for this is that the confounder is generally simpler.For example, learning the backgrounds of images is easier than learning the foreground, and this can help the transferability from similar base classes to novel classes.Our FRSM suppressed the confounder and led to performance degradation.Nevertheless, we should focus more on testing datasets with larger distribution shifts, because this case is more in line with real-world environments.

Conclusions
In this paper, we explored the underlying reasons for the decline in performance and transferability encountered when facing distribution shifts between base and novel classes.We approached this investigation from the perspective of frequency spectrum analysis.Furthermore, we clarified which information ought to be transferred from base to novel classes.Through a causal perspective on FSIC, we delved into the phenomenon and demonstrated that non-causal frequencies can profoundly influence transferability as confounding factors.Therefore, we proposed a straightforward yet effective method, the FRSM, to dynamically weigh each frequency using a learnable paradigm.This approach helps mitigate the influence of confounders and enhances transferability.Extensive experiments demonstrated that our proposed FRSM method achieved new state-of-the-art results.
, we can observe that Figure1ademonstrates superior class clustering performance.This can be attributed to the training model using a training dataset that closely resembled the testing dataset during the training phase.

Figure 1 .
Figure 1.T-SNE[14] visualization of four datasets utilizing a model pretrained on the training set of miniImageNet (i.e., miniImageNet-train).Here, miniImageNet-test means the testing data of miniImageNet.From left to right, the performance of the class cluster declines, which can be attributed to the impoverished similarity between the training and testing datasets.

Figure 2 .
Figure 2. The average amplitudes of the eigenfrequencies across four datasets, applying a model pretrained on the training set of miniImageNet (i.e., miniImageNet-train).Here, miniImageNet-test means the testing data of miniImageNet.Proximity to the center signifies a low-frequency component, while being far from the center indicates a high-frequency component.

Figure 3 .
Figure 3. (a) A causal look at FSIC from the frequency spectrum perspective.(b) The test-tuning process.The FRSM weights the frequency learning from novel classes in the testing or fine-tuning phase, which uses a frozen feature extractor pretrained on base classes.

Figure 4 .
Figure 4.The average amplitude ratio between six testing datasets and the training dataset (i.e., miniImageNet-train).
•The experimental results indicate that the proposed FRSM method achieves superior performance compared with representative state-of-the-art methods in the few-shot image classification task.

Table 1 .
Summary of testing datasets we used in this paper.For each dataset, we picked 5 classes, and we show illustrative images.

Table 2 .
Few-shot image classification average accuracy (%) on eight testing datasets under the 5 way 5 shot or 5 way 10 shot settings.The model was trained on miniImageNet.The best results are bold.