Few-Shot Image Classification via Mutual Distillation

: Due to their compelling performance and appealing simplicity, metric-based meta-learning approaches are gaining increasing attention for addressing the challenges of few-shot image classification. However, many similar methods employ intricate network architectures, which can potentially lead to overfitting when trained with limited samples. To tackle this concern, we propose using mutual distillation to enhance metric-based meta-learning, effectively bolstering model generalization. Specifically, our approach involves two individual metric-based networks, such as prototypical networks and relational networks, mutually supplying each other with a regularization term. This method seamlessly integrates with any metric-based meta-learning approach. We undertake comprehensive experiments on two prevalent few-shot classification benchmarks, namely miniImageNet and Caltech-UCSD Birds-200-2011 (CUB), to demonstrate the effectiveness of our proposed algorithm. The results demonstrate that our method efficiently enhances each metric-based model through mutual distillation.


Introduction
Over the past decade, significant strides have been made in the field of image classification, propelled by notable advancements in deep neural networks [1,2].However, their performance is contingent upon access to extensive labeled datasets for training.Realworld applications often grapple with the challenge of obtaining large labeled datasets, especially in scenarios involving the identification of new species or rare diseases.In comparison, humans demonstrate the ability to generalize knowledge and swiftly adapt to new categories, often requiring only a limited set of instances.Few-shot learning, as a field, is directly influenced by this efficient learning paradigm.It aims to categorize previously unseen data instances (query samples) into novel categories using a restricted number of labeled instances within each class (support samples) [3].
Meta-learning emerges as a highly promising strategy to tackle the challenges in fewshot image classification.This innovative approach entails extracting meta-knowledge from a sequence of similar tasks, empowering models to rapidly adapt to novel tasks even with a scarcity of labeled samples.These meta-learning approaches can be classified into three primary types based on the diverse forms of extracted meta-knowledge: (1) model-based methods [4,5], which process the support samples in sequence to alter the internal state of the base model for directly predicting the categories of the query samples; (2) optimizationbased methods [6] target an optimization process at meta-learning for fine-tuning the base learner for swift and robust adaptation to the new task; (3) metric-based methods [7] adhere to the concept of acquiring a proficient embedding space to represent the image data, and depend on a similarity metric to differentiate images from diverse categories.All three types of meta-learning methods have collectively propelled the field of few-shot image classification forward.
In recent years, metric-based meta-learning techniques have demonstrated increasing effectiveness in tackling the challenges associated with few-shot image classification.These approaches provide an alluring combination of straightforwardness and exceptional efficiency [8].The first method used in this field was Siamese networks [9], which use paired networks to represent support and query images individually, using a weighted L1 distance metric to measure their similarity.Following that, Vinyals et al. [10] presented matching networks, which integrate memory and attention mechanisms to extract unique characteristics for support and query samples.They additionally implemented a comprehensive differentiable KNN classifier to generate predictions about the categories of query samples.Expanding upon this groundwork, Snell et al. [11] introduced prototypical networks, which are an expansion of the matching networks framework.The indication was that samples belonging to the same class tended to form a cluster around a singular prototype representation.Following that, the query samples were classified according to their closeness to the closest prototype.In contrast to the previously mentioned approaches, Sung et al. [12] presented relational networks, which simultaneously acquired feature embeddings and non-linear metrics.The feature embedding and similarity metric were both highlighted by these prominent models in metric-based meta-learning.It is worth mentioning that the recent progress in this domain has resulted in substantial enhancements in these two essential elements [13][14][15][16].
Few-shot image classification has been greatly improved by the above approaches.However, the key issue with few-shot learning is that the data of each class is not enough to express a concept adequately.The training of end-to-end networks in these meta-learning approaches involves utilizing sampled mini-batches referred to as episodes.Within each episode, the training is conducted with a limited number of labeled examples per class.Initially, these approaches utilize backbone networks with low complexity, such as fourlayer convolutional networks, to mitigate the risk of overfitting.To improve classification performance, more recent work has relied on a higher complexity network as the base learner to enhance feature representation.Neural networks such as Resnet12 and WRN28 are increasingly being employed as backbone networks in metric-based meta-learning techniques.With an escalation in network complexity, there is a corresponding expansion of the search space for network parameters, posing a heightened risk of overfitting.Particularly for models that learn both a feature embedding and a non-linear metric, such as relational networks, this matter will gain greater prominence.
One of the effective ways to solve the above issue is the regularization technique.This paper uses knowledge distillation [17] to achieve regularization for metric-based meta-learning methods.The applied regularization technique encompasses responsebased knowledge distillation, wherein the neural response from the final output layer of the teacher model is leveraged.This approach has become prevalent in the literature, showcasing its effectiveness in refining model predictions.Recent literature recognizes soft targets-the widely adopted response-based knowledge for image classification-as an effective regularizer [18].Typically, a vanilla knowledge distillation framework comprises one or more extensive pre-trained teacher model alongside a compact student model [19].Due to the lack of a pre-trained metric-based model instructor, the widely used framework is not suitable for the metric-based model.Consequently, we suggest the implementation of Mutual Distillation of Metric-based Meta-learning (MDMM) to enhance their performance.More precisely, it is composed of two separate metric-based networks that supply each other with a regularization term.The objective of the regularization term is to minimize the Kullback-Leibler divergence between the predictions of the two networks.As illustrated in Figure 1, owing to the decision boundaries of the networks being near the optimal boundary, the information exchange between these two networks will make each network in mutual learning refine its decision boundary to the optimal one.To sum up, our contributions are: (1) We use response-based knowledge distillation to regularize metric-based meta-learning methods.As far as we know, we are the pioneering contributors to the integration of mutual distillation within the metric-based meta-learning framework for few-shot learning; (2) We implement mutual distillation between two distinct and well-established models with varying parameters, such as prototypical networks and relational networks.This mutual learning approach, implemented through distillation, effectively enhances the model's generalization under few-shot conditions; (3) Our approach achieves a state-of-the-art performance across two benchmarks in fewshot learning research, underscoring that each metric-based model can obtain an excellent performance when supported by mutual distillation.

Related Work
The task of few-shot image classification focuses on recognizing novel categories with a limited number of instances available for each class.Recent research has mainly addressed this challenge through meta-learning, which has strong generalization capabilities.In essence, meta-learning acquires meta-knowledge from analogous tasks to aid in handling previously unseen tasks.These methods can be broadly categorized into three groups depending on the nature of their meta-knowledge: (1) model-based techniques; (2) optimization-based techniques; and (3) metric-based techniques.Our study matches the third category.
Metric-based meta-learning methods have a primary objective: to learn an embedding space where samples from the same class exhibit proximity, while samples belonging to distinct classes are positioned distantly, facilitating effective few-shot learning.Siamese networks [9] play a crucial role in this process by comparing the likeness between a query image and support images within the acquired embedding space.These networks undergo training using a triplet loss, which emphasizes the relative distances between the anchor (query image), positive, and negative samples.Matching networks [10] employ attention LSTM and bidirectional LSTM to extract features from both support and query sets.The classification task is then facilitated by comparing the extracted features using cosine similarity.Prototypical networks [11] improved upon matching networks by computing the dissimilarities between a query image and the prototype of each class with Euclidean distance.Different from the above works that used a pre-specified distance metric, relational networks [12] were proposed to learn a deep non-linear metric for comparing the connection between the query image and support images in the embedding space.
From the above well-known models, we can see that metric-based methods are characterized by two fundamental elements: a feature extractor and a metric module.A great deal of work is being carried out around these two key aspects.On one hand, the ongoing research focus is directed toward enhancing the discriminative representations generated by the feature extractor.Wu et al. [20] introduced a deformable feature extractor to address the sensitivity of CNN-based networks to spatial location relationships between semantic objects in comparative images.This innovative approach is complemented by a dual correlation attention mechanism designed to enhance local connectivity in the extracted features, contributing to improved discriminative power.Hou et al. [21] introduced a crossattention module that produces cross-attention maps for each pair of class features and query sample features.This approach accentuates target object areas, thereby augmenting the discriminative capabilities of the extracted features.Li et al. [13] introduced a Category Traversal Module, an advanced mechanism aimed at discerning features relevant to the task by considering both intra-class commonality and inter-class distinctiveness within the feature space.Simon et al. [22] introduced the deep subspace network (DSN), showcasing its effectiveness in generating expressive representations across a broad spectrum of supervised few-shot image classification tasks.Li et al. [8] introduced an adaptive margin principle for learning more discriminating embedding spaces with better generalization ability.Wu et al. [23] introduced a novel embedding structure that encodes relative spatial relationships between features, achieved through the application of a capsule network.
Conversely, the choice of metric holds a crucial role in metric-based methods.As an instance, Li et al. [24] presented DN4, wherein the conventional measure based on imagelevel features in the final layer is substituted using a local descriptor-based image-to-class measure.This modification enhances the metric's sensitivity to specific local features, potentially leading to improved performance in capturing fine-grained details.As the traditional Euclidean distance used in prototypical networks is sensitive to the correlation of the features, Bateni et al. [25] used the Mahalanobis distance for classification.Nguyen et al. [26] suggest that SENs modify the Euclidean distance to mitigate the dimensional curse of combining the Euclidean and norm distance in high dimensional space.Zhang et al. [27] utilized the Earth Mover's Distance (EMD) as a metric for computing structural distances among dense image representations.The EMD, along with a cross-reference mechanism, facilitates distance computation by taking into account the inherent structure of the data.This choice of metric proves particularly useful in scenarios with cluttered backgrounds and substantial intra-class appearance variations.
Prior works in few-shot image classification commonly rely on complex backbone networks, such as ResNet12, to improve feature representation.However, the utilization of such sophisticated networks can pose challenges, particularly in scenarios with extremely limited support samples per class, leading to potential overfitting issues.Therefore, we propose the Mutual Distillation of Metric-based Meta-learning (MDMM) to improve their generalization performance.The technique involves two individual metric-based networks, such as prototypical networks and relational networks.These networks mutually supply each other with a regularization term, introducing a form of knowledge distillation for mutual learning.Our approach is distinguished by its simplicity and effectiveness, making it easily integrated with any metric-based meta-learning method.

Problem Definition
This study's training methodology is systematically designed with a focus on fewshot learning principles.Initially, a comprehensive labeled dataset D is meticulously partitioned into three exclusive subsets-D train , D val , and D test -ensuring no overlap in labels.Adopting the widely acknowledged episodic paradigm, the study leverages its effectiveness in the existing literature to facilitate knowledge transfer.Each training episode comprises two key components: (1) The assemblage of S train involves the stochastic extraction of K samples from every N class within D train .Here, S train = {S n } N n=1 denotes N classes, each comprising K labeled support samples; (2) Q train supplements this process by encompassing the residual images originating from the identical N classes.The primary training objective is N-way K-shot classification, wherein the model discerns and classifies query samples into one of the N distinct classes.This is achieved through the utilization of K labeled support samples allocated for each class.The training process involves the systematic generation and execution of multiple episodes until model convergence.Hyperparameter optimization utilizes D val , and the model's performance is evaluated on D test through N-way K-shot episodes.This configuration, recognized as N-way Kshot classification using the episodic paradigm, stands as a well-established and effective approach in the landscape of few-shot learning, facilitating adept generalization to new tasks with limited labeled examples.

Preliminary: Metric-Based Meta-Learning
Metric-based methodologies in few-shot learning are intricately designed with the primary aim of constructing a highly effective embedding space.This specialized space is crafted to discern images belonging to different categories, employing a specific metric for classification.Comprising two pivotal components, namely the Embedding Module f φ and the Metric Module g θ , these methodologies integrate seamlessly to form an architecture (Figure 2).The Embedding Module f φ , often implemented using convolutional neural networks (CNNs), extracts feature representations from both query and support images.Simultaneously, the Metric Module g θ applies a designated metric within this embedding space to effectively differentiate images from various categories.This architecture is meticulously designed to create a feature space where images from distinct categories can be discerned.Within the procedure, a support image x i extracted from the i-th image in the n-th class, along with a query image x j sourced from the Q train set, are fed into the Embedding Module.This operation produces respective feature maps, represented as  The assessment of similarity is a crucial step in few-shot image classification using nonparametric methods after extracting features from query and support images.Two prevalent approaches are commonly employed for this purpose: (1) The weighted summation approach interprets similarity as a weighted summation of similarity scores between the query image x j and support images in each class S n .Mathematically, the similarity score S jn between the query input and the n-th support class is expressed as: where g θ is the metric assessing similarity, and a ji is a weight proportional to the similarity score between the query input and support images; (2) in contrast, the prototype-based approach characterizes a class by the mean or prototype of its support examples.The similarity S jn between the query input and the prototype of each class is computed as: where g θ is the metric assessing similarity between the query input and the prototype, computed as the mean of the support examples in S n .The subsequent step involves the application of the softmax function over the set S jn to derive the prediction for the query image, articulated as follows: The mathematical formulation of the loss function for each episodic training is given by: where N × M denotes the total number of query examples in each training episode.

Mutual Distillation of Metric-Based Meta-Learning
To enhance the classification performance further, we introduce mutual distillation for metric-based meta-learning methods.Our algorithm leverages a pair of metric-based networks and facilitates the exchange of information between them, resulting in the creation of new networks to improve their generalization capabilities.An overview of our framework for few-shot classification is depicted in Figure 3. Detailed explanations of our algorithm are provided as follows.In the mutual distillation framework, we start with a pair of metric-based models with distinct parameters, denoted as I 1 = { f φ1 , g θ1 } and I 2 = { f φ2 , g θ2 }.The mutual distillation of I 1 and I 2 occurs through episodic learning.In each training episode, both the query input x j and the support set S train = {S n } N n=1 for n = 1 to N are input into I 1 and I 2 to generate their respective feature maps.The similarity score between the query input x j and the n-th support class can be computed using either Equation (1) or Equation (2) as discussed in Section 3.2.Importantly, each metric-based model in mutual distillation can utilize either different or the same types of similarity metrics.We will realize and investigate both of these conditions in our experiments.After the steps mentioned above, the predictions of the two individual networks, which are represented by function (3) in Section 3.2, are denoted as P 1 (y = n|x j ) and P 2 (y = n|x j ).
As the two individual networks are trained to correctly predict the true labels of query images, the exchange of information between them is a crucial aspect.To quantify the information exchange, we utilize the Kullback-Leibler (KL) divergence of prediction probabilities between the two individual metric-based networks.The KL distance from P 1 to P 2 is calculated as follows [28]: The KL distance from P 2 to P 1 is calculated as follows: As metric-based methods increasingly employ more complex convolutional networks as the backbone feature extractor for few-shot learning, it becomes crucial to introduce regularization techniques to prevent overfitting.Additionally, since the decision boundaries of the paired individual networks are located around the optimal boundary, they can mutually provide each other with a regularization term.Therefore, the overall loss functions L(φ1, θ1) and L(φ2, θ2) for networks I 1 and I 2 , respectively, are formulated as follows: The hyper-parameter λ controls the impact of regularization in the loss functions.Gradient descent is employed to update the parameters of I 1 and I 2 , and the update process can be expressed as follows:

Few-Shot Evaluation
After the meta-training process, any individual metric-based model in mutual distillation is capable of conducting a few-shot evaluation on the D test dataset.To carry out this evaluation, numerous episodes are created by randomly sampling support images and query images per class from D test , forming N-way-K-shot tasks.In this context, the support set is denoted as S test = {S n } N n=1 , and each S n consists of K examples for the n-th class, represented as S n = {(x i , y i )} K i=1 .The query set is designated as Q test = {x j } N×M j=1 .The objective of the few-shot evaluation is to predict the labels of the query set.Therefore, we maintain the network parameters φ and θ of I 1 and I 2 , which were trained through mutual distillation.We then input S test and Q test into either I 1 or I 2 .Each query image x j is processed sequentially, starting with feature extraction via the embedding module, followed by the calculation of image-to-class similarity using Equation (1) or Equation (2).Finally, the label of each query image is predicted using Equation (3).
The pseudo-code for our method is shown in Algorithm 1.

Algorithm 1: Mutual Distillation of Metric-based Meta-learning.
Require: Training set with N-way K-shot episodese 1 ,. . ..e L , , with every e l containing support S train = {S n } N n=1 with S n = {(x i , y i )} K i=1 and query set Q train = {(x j , y j )} N×M j=1 Require: regularization parameter λ and learning rate γ while not done do Initialize: initialize two paired metric-based model Compute the query-to-class similarity for I 1 and I 2 via Equations ( 1) and ( 2) Compute predictions for I 1 and I 2 via Equation (3) Compute the loss for I 1 via Equation ( 7) and update the parameter by ( 9) Compute the loss for I 2 via Equation ( 8) and update the parameter by ( 10) end
Caltech-UCSD Bird-200-2011(CUB).CUB, known for its significance in fine-grained visual categorization, functions as a benchmark dataset, featuring 11,788 annotated images spanning 200 distinct bird species.Each image in the dataset is carefully annotated with details such as bounding boxes, part locations, and attribute labels.Following the methodology detailed in [31], the dataset is partitioned into three subsets: 100 classes for meta-training, 50 classes for meta-validation, and an additional 50 classes for meta-testing.

Implementation Details
Experiments were conducted on the Ubuntu platform, utilizing the Pytorch library, and executed on a single consumer-level NVIDIA 3090Ti GPU.The experiments focused on 5-way, 1-shot and 5-shot scenarios, aligning with the common evaluation setup in few-shot learning tasks.A 12-layer residual network (ResNet12) served as the backbone for feature extraction in the embedding function.A random sampling of 120,000 episodes was performed from the training dataset D train .Each episode consisted of five classes (5-way) with 15 query samples selected for each class (M = 15).The regularization factor λ was set to 0.5, and optimization was carried out using the Adam optimizer, initialized with a learning rate of 0.001.Learning rate halving occurred every 30,000 episodes for 5-shot tasks and every 15,000 episodes for 1-shot tasks.During the testing phase, 2000 episodes were randomly sampled from the test dataset D test .Each test episode maintained a 5-way structure, consistent with the 15 query samples per class in the training setup.Accuracy metrics were reported as the mean top-1 accuracy overall episodes, accompanied by a 95% confidence interval to facilitate statistical evaluation of the model's performance.

Ablation Study
In this section, a series of experiments are conducted to evaluate the effectiveness of our algorithm, which is based on the concept of mutual distillation.Consequently, we will compare the performance of metric-based meta-learning models before and after the application of mutual distillation.As delineated in Section 3.2, there exist two distinct forms of similarity metrics.The typical model of one form is a prototypical network (PN), and the typical model of the other form is a relation network (RN).To ensure a fair comparison, we re-implement the results of PN and RN with the same setting and investigate the effectiveness of mutual distillation in the following two respects.
Individual networks with different similarity metrics.We examine the effects of mutual distillation between PN and RN, which employ distinct similarity metrics.In this configuration, the individual networks involved in mutual distillation are denoted as MDMM-PN and MDMM-RN.The results are presented in Table 1, where the following observations can be made: (1) Except for the 1-shot task on MiniImageNet, PN outperforms RN under all other conditions.This discrepancy suggests that the non-linear metric module with learnable parameters in PN may make it more susceptible to overfitting.
(2) After the mutual distillation of PN and RN, the accuracy of RN noticeably improves.For instance, on the CUB dataset, RN's accuracy increases from 72.69% to 74.19% for the 1-shot task and from 86.98% to 91.03% for the 5-shot task.Similar trends are observed on MiniImageNet, where RN's accuracy rises from 54.68% to 54.93% for the 1-shot task and from 71.61% to 73.58% for the 5-shot task.This analysis highlights that PN can effectively serve as a regularizer for RN, and the process of mutual distillation substantially enhances the performance of metric-based meta-learning models.
Table 1.The results of mutual distillation among individual networks using various similarity metrics are presented, reporting mean accuracies along with 95% confidence intervals for the MiniImageNet and CUB datasets in the context of both 5-way, 1-shot and 5-shot tasks.The second column provides information on the backbone used for the embedding function.

Method Backbone 5-Way Accuracy (%)
MiniImageNet CUB Individual networks with same similarity metrics.In this section, we delve into the impact of mutual distillation between individual networks that share the same form of similarity metrics.Specifically, we explore mutual distillation between two prototypical networks and two relation networks, denoted as MDMM-PN&PN and MDMM-RN&RN, respectively.The outcomes of the mutual distillation process are summarized in Table 2, and the following observations are made: (1) On MiniImageNet, MDMM-PN&PN achieves 56.93% accuracy for the 1-shot task and 77.33% for the 5-shot task.These results signify improvements of 1.54% and 2.06%, respectively, compared to using the prototypical network (PN) in isolation.Similarly, on the CUB dataset, MDMM-PN&PN demonstrates an accuracy of 76.50% for the 1-shot task and 91.29% for the 5-shot task, showcasing enhancements of 0.09% and 1.89%, respectively, compared to the performance of PN individually.This analysis highlights that individual prototypical networks in mutual distillation effectively supply each other with a regularization term.
(2) MDMM-RN&RN achieves 58.09% accuracy for 1-shot tasks and 72.99% for 5-shot tasks on MiniImageNet, showing a respective improvement of 1.54% and 1.38% compared to RN.On CUB, MDMM-RN&RN performs at 74.64% for 1-shot tasks and 87.49% for 5-shot tasks, demonstrating improvements of 1.95% and 0.51% for the respective task types.This analysis highlights the effectiveness of mutual distillation between individual RN models, as they provide each other with valuable regularization terms.Ablation results further confirm the efficacy of our approach in enhancing the performance of each metric-based model within mutual distillation.
Table 2.The results of the mutual distillation between Individual networks with the same similarity metrics.Mean accuracies, along with 95% confidence intervals, are reported for MiniImageNet and CUB datasets under both 5-way, 1-shot and 5-shot tasks.The second column denotes the backbone used for the embedding function.

T-SNE Visualization of Features
The primary reason for the effectiveness of our approach lies in the reciprocal supply of a regularization term for feature embedding learning among the individual networks in mutual distillation.To showcase the generalization and discriminative capabilities of our learned feature embedding, we visualize features for unseen class samples and compare them with those of PN and RN.In the case of MiniImageNet, we randomly choose five classes, each containing 200 samples, from D test .Similarly, for CUB, we randomly select five classes, with each class comprising 40 samples from D test .Employing t-SNE for visualization, the results are presented in Figures 4 and 5.In Figure 4, on MiniImageNet, our feature embedding effectively separates the five classes, surpassing the performance of PN and RN.A similar outcome is observed in Figure 5.

Comparison with Other Methods
We perform an exhaustive comparative examination of our methodology against several contemporary state-of-the-art techniques.These approaches are stratified into three distinct categories of meta-learning: model-based methods, optimization-based methods, and metric-based methods.The detailed results for this comparative evaluation can be found in the respective original publications.A condensed summary of the comparative outcomes is delineated in Tables 3 and 4.
Table 3. Comparative analysis of accuracy with other methods is conducted under 5-way, 1-shot and 5-shot tasks on MiniImageNet.The symbol ' †' denotes a result that has been re-implemented, while '-' indicates that the author did not provide the result.The results presented in Table 3 reveal the following comparisons:

Method
(1) In terms of model-based methodologies, our approach exhibits notable performance advantages, surpassing the Meta Network by over 8.88% in the 1-shot scenario and outperforming SNAIL by margins of 2.38% and 8.45% in the 1-shot and 5-shot scenarios, respectively.
(2) Comparisons with optimization-based techniques reveal substantial improvements over classical approaches (MAML, Meta-SGD, Reptile) by up to 10% and 14% in the 1-shot and 5-shot settings, respectively.Additionally, our method outperforms LLAMA and DKT by 8.69% and 8.36%, respectively, in the 1-shot setting, and demonstrates clear superiority over BOIL and OVE PG GP + Cosine (ML) by approximately 8% and 12% in the 1-shot and 5-shot settings, respectively.
(3) In the metric-based category, our approach showcases significant advancements, surpassing the matching network by 11.49% in the 1-shot task and 17.33% in the 5-shot task, while also demonstrating improvements over prototypical network, relation network, and various state-of-the-art methods (Cross Module, PARN, DN4, SalNet).Furthermore, our method outperforms TPN and MCGN with improvements ranging from 0.64% to 10.68.
The outcomes presented in Table 4 for the fine-grained dataset CUB in few-shot image classification highlight the remarkable efficacy of the proposed approach.
(1) In the realm of optimization-based methods, specifically OVE PG GP + Cosine (ML), our approach outperforms counterparts by 12.52% and 12.75% in 1-shot and 5-shot accuracies, respectively, with a widening gap compared to antecedent methods.
(2) Within metric-based comparisons, our approach surpasses the Prototypical Network by marginal yet consequential increments of 0.09% and 1.89% in the 1-shot and 5-shot settings, respectively.Substantive advancements are evident against the relation network and DeepEMD, with gains ranging from 3.81% to 4.31%, and a notable advantage of 0.85% to 2.6% in the 1-shot and 5-shot settings, achieved without pre-training.
(3) In comparisons with formidable contenders like AFHN, K-tuplet, and MsSoSN, our method exhibits a distinctive 5.93% advantage in the 1-shot task and a discernible 3.75% lead in the 5-shot task, all without resorting to methodologies like GAN-based sample generation.
In summation, these findings underscore the robustness and efficacy of the proposed approach in few-shot image classification tasks on both the miniImageNet and CUB datasets, positioning it as a preeminent contender among state-of-the-art methods across diverse categories.

Conclusions and Future Work
In this study, we introduced a mutual distillation algorithm aimed at enhancing the performance of any metric-based meta-learning method for few-shot classification.The efficacy of the proposed approach stems from the exchange of information between pairs of metric-based models.Extensive experimentation, encompassing various similarity metrics and scenarios with shared similarity metrics, underscores the versatility and effectiveness of mutual distillation among individual models.The results on two widely used benchmark datasets affirm the efficiency of the proposed method.In our future work, we plan to explore alternative forms of information exchange among the individual models and consider the extension of mutual distillation from pairs of models to scenarios involving multiple networks.

Figure 1 .
Figure 1.Illustration of our motivation under 5-way, 5-shot few-shot task.Class boundary 1 and class boundary 2 are, respectively, produced by the metric-based models in mutual distillation.As shown, our method can push class boundary 1 and boundary 2 to the optimal class boundary.
where d denotes the dimension of the feature space.

Figure 2 .
Figure 2. Demonstration of the foundational structure of metric-based meta-learning for a few-shot learning task in a 5-way, 1-shot configuration.The structure is delineated by an embedding module f φ , assigned to feature extraction, and a metric module g θ , formulated to assess the similarity between a provided query image and each support class.

Figure 3 .
Figure 3. Overview of our proposed MDMM under the 5-way, 1-shot setting.As illustrated, our framework comprises a pair of metric-based models with distinct parameters.Specifically, Model I 1 undergoes training with the cross-entropy loss L CE1 and a regularization loss of KL-Divergence D KL (P 2 |P 1 ) from distribution P 1 to P 2 .On the other hand, Model I 2 is trained with the cross-entropy loss L CE2 and a regularization loss of KL-Divergence D KL (P 1 |P 2 ) from distribution P 2 to P 1 .

Figure 4 .
Figure 4. T-SNE visualization of features in PN, RN, and our method on unseen samples from MiniImageNet.Each dot represents a query sample and marked with different colors according to the real labels.

Figure 5 .
Figure 5. T-SNE visualization of features in PN, RN, and our method on unseen samples from CUB.Each dot represents a query sample and marked with different colors according to the real labels.

Table 4 .
[47]arative analysis of accuracy with other methods is conducted under 5-way, 1-shot and 5-shot tasks on CUB.The symbol ' †' denotes a result that has been re-implemented, while '-' indicates that the author did not provide the result.The symbol '*' signifies that the result is reported in[47].