Few-Shot Image Classification via Mutual Distillation

Zhang, Tianshu; Dai, Wenwen; Chen, Zhiyu; Yang, Sai; Liu, Fan; Zheng, Hao

doi:10.3390/app132413284

Open AccessArticle

Few-Shot Image Classification via Mutual Distillation

by

Tianshu Zhang

^1,†

,

Wenwen Dai

^1,†,

Zhiyu Chen

¹,

Sai Yang

^2,*,

Fan Liu

¹

and

Hao Zheng

³

¹

College of Computer and Information, Hohai University, Nanjing 210024, China

²

School of Electrical Engineering, Nantong University, Nantong 226019, China

³

Key Laboratory of Intelligent Information Processing, Nanjing Xiaozhuang University, Nanjing 211171, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2023, 13(24), 13284; https://doi.org/10.3390/app132413284

Submission received: 6 November 2023 / Revised: 6 December 2023 / Accepted: 12 December 2023 / Published: 15 December 2023

(This article belongs to the Special Issue Recent Advances in Few-Shot Learning for Computer Vision Tasks)

Download

Browse Figures

Versions Notes

Abstract

:

Due to their compelling performance and appealing simplicity, metric-based meta-learning approaches are gaining increasing attention for addressing the challenges of few-shot image classification. However, many similar methods employ intricate network architectures, which can potentially lead to overfitting when trained with limited samples. To tackle this concern, we propose using mutual distillation to enhance metric-based meta-learning, effectively bolstering model generalization. Specifically, our approach involves two individual metric-based networks, such as prototypical networks and relational networks, mutually supplying each other with a regularization term. This method seamlessly integrates with any metric-based meta-learning approach. We undertake comprehensive experiments on two prevalent few-shot classification benchmarks, namely miniImageNet and Caltech-UCSD Birds-200-2011 (CUB), to demonstrate the effectiveness of our proposed algorithm. The results demonstrate that our method efficiently enhances each metric-based model through mutual distillation.

Keywords:

metric-based meta-learning; knowledge distillation; few-shot image classification

1. Introduction

Over the past decade, significant strides have been made in the field of image classification, propelled by notable advancements in deep neural networks [1,2]. However, their performance is contingent upon access to extensive labeled datasets for training. Real-world applications often grapple with the challenge of obtaining large labeled datasets, especially in scenarios involving the identification of new species or rare diseases. In comparison, humans demonstrate the ability to generalize knowledge and swiftly adapt to new categories, often requiring only a limited set of instances. Few-shot learning, as a field, is directly influenced by this efficient learning paradigm. It aims to categorize previously unseen data instances (query samples) into novel categories using a restricted number of labeled instances within each class (support samples) [3].

Meta-learning emerges as a highly promising strategy to tackle the challenges in few-shot image classification. This innovative approach entails extracting meta-knowledge from a sequence of similar tasks, empowering models to rapidly adapt to novel tasks even with a scarcity of labeled samples. These meta-learning approaches can be classified into three primary types based on the diverse forms of extracted meta-knowledge: (1) model-based methods [4,5], which process the support samples in sequence to alter the internal state of the base model for directly predicting the categories of the query samples; (2) optimization-based methods [6] target an optimization process at meta-learning for fine-tuning the base learner for swift and robust adaptation to the new task; (3) metric-based methods [7] adhere to the concept of acquiring a proficient embedding space to represent the image data, and depend on a similarity metric to differentiate images from diverse categories. All three types of meta-learning methods have collectively propelled the field of few-shot image classification forward.

In recent years, metric-based meta-learning techniques have demonstrated increasing effectiveness in tackling the challenges associated with few-shot image classification. These approaches provide an alluring combination of straightforwardness and exceptional efficiency [8]. The first method used in this field was Siamese networks [9], which use paired networks to represent support and query images individually, using a weighted L1 distance metric to measure their similarity. Following that, Vinyals et al. [10] presented matching networks, which integrate memory and attention mechanisms to extract unique characteristics for support and query samples. They additionally implemented a comprehensive differentiable KNN classifier to generate predictions about the categories of query samples. Expanding upon this groundwork, Snell et al. [11] introduced prototypical networks, which are an expansion of the matching networks framework. The indication was that samples belonging to the same class tended to form a cluster around a singular prototype representation. Following that, the query samples were classified according to their closeness to the closest prototype. In contrast to the previously mentioned approaches, Sung et al. [12] presented relational networks, which simultaneously acquired feature embeddings and non-linear metrics. The feature embedding and similarity metric were both highlighted by these prominent models in metric-based meta-learning. It is worth mentioning that the recent progress in this domain has resulted in substantial enhancements in these two essential elements [13,14,15,16].

Few-shot image classification has been greatly improved by the above approaches. However, the key issue with few-shot learning is that the data of each class is not enough to express a concept adequately. The training of end-to-end networks in these meta-learning approaches involves utilizing sampled mini-batches referred to as episodes. Within each episode, the training is conducted with a limited number of labeled examples per class. Initially, these approaches utilize backbone networks with low complexity, such as four-layer convolutional networks, to mitigate the risk of overfitting. To improve classification performance, more recent work has relied on a higher complexity network as the base learner to enhance feature representation. Neural networks such as Resnet12 and WRN28 are increasingly being employed as backbone networks in metric-based meta-learning techniques. With an escalation in network complexity, there is a corresponding expansion of the search space for network parameters, posing a heightened risk of overfitting. Particularly for models that learn both a feature embedding and a non-linear metric, such as relational networks, this matter will gain greater prominence.

One of the effective ways to solve the above issue is the regularization technique. This paper uses knowledge distillation [17] to achieve regularization for metric-based meta-learning methods. The applied regularization technique encompasses response-based knowledge distillation, wherein the neural response from the final output layer of the teacher model is leveraged. This approach has become prevalent in the literature, showcasing its effectiveness in refining model predictions. Recent literature recognizes soft targets—the widely adopted response-based knowledge for image classification—as an effective regularizer [18]. Typically, a vanilla knowledge distillation framework comprises one or more extensive pre-trained teacher model alongside a compact student model [19]. Due to the lack of a pre-trained metric-based model instructor, the widely used framework is not suitable for the metric-based model. Consequently, we suggest the implementation of Mutual Distillation of Metric-based Meta-learning (MDMM) to enhance their performance. More precisely, it is composed of two separate metric-based networks that supply each other with a regularization term. The objective of the regularization term is to minimize the Kullback–Leibler divergence between the predictions of the two networks. As illustrated in Figure 1, owing to the decision boundaries of the networks being near the optimal boundary, the information exchange between these two networks will make each network in mutual learning refine its decision boundary to the optimal one. To sum up, our contributions are:

(1): We use response-based knowledge distillation to regularize metric-based meta-learning methods. As far as we know, we are the pioneering contributors to the integration of mutual distillation within the metric-based meta-learning framework for few-shot learning;
(2): We implement mutual distillation between two distinct and well-established models with varying parameters, such as prototypical networks and relational networks. This mutual learning approach, implemented through distillation, effectively enhances the model’s generalization under few-shot conditions;
(3): Our approach achieves a state-of-the-art performance across two benchmarks in few-shot learning research, underscoring that each metric-based model can obtain an excellent performance when supported by mutual distillation.

Figure 1. Illustration of our motivation under 5-way, 5-shot few-shot task. Class boundary 1 and class boundary 2 are, respectively, produced by the metric-based models in mutual distillation. As shown, our method can push class boundary 1 and boundary 2 to the optimal class boundary.

2. Related Work

The task of few-shot image classification focuses on recognizing novel categories with a limited number of instances available for each class. Recent research has mainly addressed this challenge through meta-learning, which has strong generalization capabilities. In essence, meta-learning acquires meta-knowledge from analogous tasks to aid in handling previously unseen tasks. These methods can be broadly categorized into three groups depending on the nature of their meta-knowledge: (1) model-based techniques; (2) optimization-based techniques; and (3) metric-based techniques. Our study matches the third category.

Metric-based meta-learning methods have a primary objective: to learn an embedding space where samples from the same class exhibit proximity, while samples belonging to distinct classes are positioned distantly, facilitating effective few-shot learning. Siamese networks [9] play a crucial role in this process by comparing the likeness between a query image and support images within the acquired embedding space. These networks undergo training using a triplet loss, which emphasizes the relative distances between the anchor (query image), positive, and negative samples. Matching networks [10] employ attention LSTM and bidirectional LSTM to extract features from both support and query sets. The classification task is then facilitated by comparing the extracted features using cosine similarity. Prototypical networks [11] improved upon matching networks by computing the dissimilarities between a query image and the prototype of each class with Euclidean distance. Different from the above works that used a pre-specified distance metric, relational networks [12] were proposed to learn a deep non-linear metric for comparing the connection between the query image and support images in the embedding space.

From the above well-known models, we can see that metric-based methods are characterized by two fundamental elements: a feature extractor and a metric module. A great deal of work is being carried out around these two key aspects. On one hand, the ongoing research focus is directed toward enhancing the discriminative representations generated by the feature extractor. Wu et al. [20] introduced a deformable feature extractor to address the sensitivity of CNN-based networks to spatial location relationships between semantic objects in comparative images. This innovative approach is complemented by a dual correlation attention mechanism designed to enhance local connectivity in the extracted features, contributing to improved discriminative power. Hou et al. [21] introduced a cross-attention module that produces cross-attention maps for each pair of class features and query sample features. This approach accentuates target object areas, thereby augmenting the discriminative capabilities of the extracted features. Li et al. [13] introduced a Category Traversal Module, an advanced mechanism aimed at discerning features relevant to the task by considering both intra-class commonality and inter-class distinctiveness within the feature space. Simon et al. [22] introduced the deep subspace network (DSN), showcasing its effectiveness in generating expressive representations across a broad spectrum of supervised few-shot image classification tasks. Li et al. [8] introduced an adaptive margin principle for learning more discriminating embedding spaces with better generalization ability. Wu et al. [23] introduced a novel embedding structure that encodes relative spatial relationships between features, achieved through the application of a capsule network.

Conversely, the choice of metric holds a crucial role in metric-based methods. As an instance, Li et al. [24] presented DN4, wherein the conventional measure based on image-level features in the final layer is substituted using a local descriptor-based image-to-class measure. This modification enhances the metric’s sensitivity to specific local features, potentially leading to improved performance in capturing fine-grained details. As the traditional Euclidean distance used in prototypical networks is sensitive to the correlation of the features, Bateni et al. [25] used the Mahalanobis distance for classification. Nguyen et al. [26] suggest that SENs modify the Euclidean distance to mitigate the dimensional curse of combining the Euclidean and norm distance in high dimensional space. Zhang et al. [27] utilized the Earth Mover’s Distance (EMD) as a metric for computing structural distances among dense image representations. The EMD, along with a cross-reference mechanism, facilitates distance computation by taking into account the inherent structure of the data. This choice of metric proves particularly useful in scenarios with cluttered backgrounds and substantial intra-class appearance variations.

Prior works in few-shot image classification commonly rely on complex backbone networks, such as ResNet12, to improve feature representation. However, the utilization of such sophisticated networks can pose challenges, particularly in scenarios with extremely limited support samples per class, leading to potential overfitting issues. Therefore, we propose the Mutual Distillation of Metric-based Meta-learning (MDMM) to improve their generalization performance. The technique involves two individual metric-based networks, such as prototypical networks and relational networks. These networks mutually supply each other with a regularization term, introducing a form of knowledge distillation for mutual learning. Our approach is distinguished by its simplicity and effectiveness, making it easily integrated with any metric-based meta-learning method.

3. Our Method

3.1. Problem Definition

This study’s training methodology is systematically designed with a focus on few-shot learning principles. Initially, a comprehensive labeled dataset D is meticulously partitioned into three exclusive subsets—

D_{t r a i n}

,

D_{v a l}

, and

D_{t e s t}

—ensuring no overlap in labels. Adopting the widely acknowledged episodic paradigm, the study leverages its effectiveness in the existing literature to facilitate knowledge transfer. Each training episode comprises two key components: (1) The assemblage of

S_{t r a i n}

involves the stochastic extraction of K samples from every N class within

D_{t r a i n}

. Here,

S_{t r a i n} = {S_{n}}_{n = 1}^{N}

denotes N classes, each comprising K labeled support samples; (2)

Q_{t r a i n}

supplements this process by encompassing the residual images originating from the identical N classes. The primary training objective is N-way K-shot classification, wherein the model discerns and classifies query samples into one of the N distinct classes. This is achieved through the utilization of K labeled support samples allocated for each class. The training process involves the systematic generation and execution of multiple episodes until model convergence. Hyperparameter optimization utilizes

D_{v a l}

, and the model’s performance is evaluated on

D_{t e s t}

through N-way K-shot episodes. This configuration, recognized as N-way K-shot classification using the episodic paradigm, stands as a well-established and effective approach in the landscape of few-shot learning, facilitating adept generalization to new tasks with limited labeled examples.

3.2. Preliminary: Metric-Based Meta-Learning

Metric-based methodologies in few-shot learning are intricately designed with the primary aim of constructing a highly effective embedding space. This specialized space is crafted to discern images belonging to different categories, employing a specific metric for classification. Comprising two pivotal components, namely the Embedding Module

f_{φ}

and the Metric Module

g_{θ}

, these methodologies integrate seamlessly to form an architecture (Figure 2). The Embedding Module

f_{φ}

, often implemented using convolutional neural networks (CNNs), extracts feature representations from both query and support images. Simultaneously, the Metric Module

g_{θ}

applies a designated metric within this embedding space to effectively differentiate images from various categories. This architecture is meticulously designed to create a feature space where images from distinct categories can be discerned. Within the procedure, a support image

x_{i}

extracted from the i-th image in the n-th class, along with a query image

x_{j}

sourced from the

Q_{t r a i n}

set, are fed into the Embedding Module. This operation produces respective feature maps, represented as

f_{φ} (x_{i}) \in ℜ^{d}

and

f_{φ} (x_{j}) \in ℜ^{d}

, where d denotes the dimension of the feature space.

The assessment of similarity is a crucial step in few-shot image classification using nonparametric methods after extracting features from query and support images. Two prevalent approaches are commonly employed for this purpose: (1) The weighted summation approach interprets similarity as a weighted summation of similarity scores between the query image

x_{j}

and support images in each class

S_{n}

. Mathematically, the similarity score

S_{j n}

between the query input and the n-th support class is expressed as:

S_{j n} = \sum_{i \in S_{n}} a_{j i} g_{θ} (x_{j}, x_{i}),

(1)

where

g_{θ}

is the metric assessing similarity, and

a_{j i}

is a weight proportional to the similarity score between the query input and support images; (2) in contrast, the prototype-based approach characterizes a class by the mean or prototype of its support examples. The similarity

S_{j n}

between the query input and the prototype of each class is computed as:

S_{j n} = g_{θ} (x_{j}, \frac{1}{| S_{n} |} \sum_{i \in S_{n}} x_{i}),

(2)

where

g_{θ}

is the metric assessing similarity between the query input and the prototype, computed as the mean of the support examples in

S_{n}

.

The subsequent step involves the application of the softmax function over the set

S_{j n}

to derive the prediction for the query image, articulated as follows:

p (y = n | x_{j}) = \frac{e x p (S_{j n})}{\sum_{n = 1}^{N} e x p (S_{j n})} .

(3)

The mathematical formulation of the loss function for each episodic training is given by:

L (φ, θ) = - \sum_{j = 1}^{N \times M} \sum_{n = 1}^{N} l o g P (y = n | x_{j}),

(4)

where

N \times M

denotes the total number of query examples in each training episode.

3.3. Mutual Distillation of Metric-Based Meta-Learning

To enhance the classification performance further, we introduce mutual distillation for metric-based meta-learning methods. Our algorithm leverages a pair of metric-based networks and facilitates the exchange of information between them, resulting in the creation of new networks to improve their generalization capabilities. An overview of our framework for few-shot classification is depicted in Figure 3. Detailed explanations of our algorithm are provided as follows.

In the mutual distillation framework, we start with a pair of metric-based models with distinct parameters, denoted as

I_{1} = {f_{φ 1}, g_{θ 1}}

and

I_{2} = {f_{φ 2}, g_{θ 2}}

. The mutual distillation of

I_{1}

and

I_{2}

occurs through episodic learning. In each training episode, both the query input

x_{j}

and the support set

S_{t r a i n} = {S_{n}}_{n = 1}^{N}

for n = 1 to N are input into

I_{1}

and

I_{2}

to generate their respective feature maps. The similarity score between the query input

x_{j}

and the n-th support class can be computed using either Equation (1) or Equation (2) as discussed in Section 3.2. Importantly, each metric-based model in mutual distillation can utilize either different or the same types of similarity metrics. We will realize and investigate both of these conditions in our experiments. After the steps mentioned above, the predictions of the two individual networks, which are represented by function (3) in Section 3.2, are denoted as

P_{1} (y = n | x_{j})

and

P_{2} (y = n | x_{j})

.

As the two individual networks are trained to correctly predict the true labels of query images, the exchange of information between them is a crucial aspect. To quantify the information exchange, we utilize the Kullback–Leibler (KL) divergence of prediction probabilities between the two individual metric-based networks. The KL distance from

P_{1}

to

P_{2}

is calculated as follows [28]:

D_{K L} (P_{2} | P_{1}) = \sum_{j = 1}^{N \times M} \sum_{n = 1}^{N} P_{2} (y = n | x_{j}) \frac{P_{2} (y = n | x_{j})}{P_{1} (y = n | x_{j})} .

(5)

The KL distance from

P_{2}

to

P_{1}

is calculated as follows:

D_{K L} (P_{1} | P_{2}) = \sum_{j = 1}^{N \times M} \sum_{n = 1}^{N} P_{1} (y = n | x_{j}) \frac{P_{1} (y = n | x_{j})}{P_{2} (y = n | x_{j})} .

(6)

As metric-based methods increasingly employ more complex convolutional networks as the backbone feature extractor for few-shot learning, it becomes crucial to introduce regularization techniques to prevent overfitting. Additionally, since the decision boundaries of the paired individual networks are located around the optimal boundary, they can mutually provide each other with a regularization term. Therefore, the overall loss functions

L (φ 1, θ 1)

and

L (φ 2, θ 2)

for networks

I_{1}

and

I_{2}

, respectively, are formulated as follows:

\{\begin{matrix} L (φ 1, θ 1) = - \sum_{j = 1}^{N \times M} \sum_{n = 1}^{N} l o g P_{1} (y = n | x_{j}) + λ D_{K L} (P_{2} | P_{1}) \\ L (φ 2, θ 2) = - \sum_{j = 1}^{N \times M} \sum_{n = 1}^{N} l o g P_{2} (y = n | x_{j}) + λ D_{K L} (P_{1} | P_{2}) . \end{matrix}

(7)

The hyper-parameter

λ

controls the impact of regularization in the loss functions. Gradient descent is employed to update the parameters of

I_{1}

and

I_{2}

, and the update process can be expressed as follows:

\{\begin{matrix} (φ 1, θ 1) \leftarrow (φ 1, θ 1) - γ \frac{ϑ L (φ 1, θ 1)}{ϑ (φ 1, θ 1)} \\ (φ 2, θ 2) \leftarrow (φ 2, θ 2) - γ \frac{ϑ L (φ 2, θ 2)}{ϑ (φ 2, θ 2)} . \end{matrix}

(8)

3.4. Few-Shot Evaluation

After the meta-training process, any individual metric-based model in mutual distillation is capable of conducting a few-shot evaluation on the

D_{t e s t}

dataset. To carry out this evaluation, numerous episodes are created by randomly sampling support images and query images per class from

D_{t e s t}

, forming N-way-K-shot tasks. In this context, the support set is denoted as

S_{t e s t} = {S_{n}}_{n = 1}^{N}

, and each

S_{n}

consists of K examples for the n-th class, represented as

S_{n} = {(x_{i}, y_{i})}_{i = 1}^{K}

. The query set is designated as

Q_{t e s t} = {x_{j}}_{j = 1}^{N \times M}

. The objective of the few-shot evaluation is to predict the labels of the query set. Therefore, we maintain the network parameters

φ

and

θ

of

I_{1}

and

I_{2}

, which were trained through mutual distillation. We then input

S_{t e s t}

and

Q_{t e s t}

into either

I_{1}

or

I_{2}

. Each query image

x_{j}

is processed sequentially, starting with feature extraction via the embedding module, followed by the calculation of image-to-class similarity using Equation (1) or Equation (2). Finally, the label of each query image is predicted using Equation (3).

The pseudo-code for our method is shown in Algorithm 1.

Algorithm 1: Mutual Distillation of Metric-based Meta-learning.

4. Experiements

4.1. Dataset Description

In our few-shot classification experiments, the proposed method undergoes evaluation on two prominent benchmark datasets: miniImageNet and Caltech-UCSD Birds-200-2011 (CUB).

miniImageNet. Originating from the LISVRC-12 dataset [29], miniImageNet is a pivotal benchmark for few-shot learning, notably introduced in [10]. Comprising 100 classes, each with 600 uniformly resized images (84 × 84 pixels), the dataset follows a partitioning strategy outlined in [30]. It allocates 64 classes for meta-training, 16 for meta-validation, and 20 for meta-testing.

Caltech-UCSD Bird-200-2011(CUB). CUB, known for its significance in fine-grained visual categorization, functions as a benchmark dataset, featuring 11,788 annotated images spanning 200 distinct bird species. Each image in the dataset is carefully annotated with details such as bounding boxes, part locations, and attribute labels. Following the methodology detailed in [31], the dataset is partitioned into three subsets: 100 classes for meta-training, 50 classes for meta-validation, and an additional 50 classes for meta-testing.

4.2. Implementation Details

Experiments were conducted on the Ubuntu platform, utilizing the Pytorch library, and executed on a single consumer-level NVIDIA 3090Ti GPU. The experiments focused on 5-way, 1-shot and 5-shot scenarios, aligning with the common evaluation setup in few-shot learning tasks. A 12-layer residual network (ResNet12) served as the backbone for feature extraction in the embedding function. A random sampling of 120,000 episodes was performed from the training dataset

D_{t r a i n}

. Each episode consisted of five classes (5-way) with 15 query samples selected for each class (

M = 15

). The regularization factor

λ

was set to 0.5, and optimization was carried out using the Adam optimizer, initialized with a learning rate of 0.001. Learning rate halving occurred every 30,000 episodes for 5-shot tasks and every 15,000 episodes for 1-shot tasks. During the testing phase, 2000 episodes were randomly sampled from the test dataset

D_{test}

. Each test episode maintained a 5-way structure, consistent with the 15 query samples per class in the training setup. Accuracy metrics were reported as the mean top-1 accuracy overall episodes, accompanied by a 95% confidence interval to facilitate statistical evaluation of the model’s performance.

4.3. Ablation Study

In this section, a series of experiments are conducted to evaluate the effectiveness of our algorithm, which is based on the concept of mutual distillation. Consequently, we will compare the performance of metric-based meta-learning models before and after the application of mutual distillation. As delineated in Section 3.2, there exist two distinct forms of similarity metrics. The typical model of one form is a prototypical network (PN), and the typical model of the other form is a relation network (RN). To ensure a fair comparison, we re-implement the results of PN and RN with the same setting and investigate the effectiveness of mutual distillation in the following two respects.

Individual networks with different similarity metrics. We examine the effects of mutual distillation between PN and RN, which employ distinct similarity metrics. In this configuration, the individual networks involved in mutual distillation are denoted as MDMM-PN and MDMM-RN. The results are presented in Table 1, where the following observations can be made:

(1) Except for the 1-shot task on MiniImageNet, PN outperforms RN under all other conditions. This discrepancy suggests that the non-linear metric module with learnable parameters in PN may make it more susceptible to overfitting.

(2) After the mutual distillation of PN and RN, the accuracy of RN noticeably improves. For instance, on the CUB dataset, RN’s accuracy increases from 72.69% to 74.19% for the 1-shot task and from 86.98% to 91.03% for the 5-shot task. Similar trends are observed on MiniImageNet, where RN’s accuracy rises from 54.68% to 54.93% for the 1-shot task and from 71.61% to 73.58% for the 5-shot task. This analysis highlights that PN can effectively serve as a regularizer for RN, and the process of mutual distillation substantially enhances the performance of metric-based meta-learning models.

Individual networks with same similarity metrics. In this section, we delve into the impact of mutual distillation between individual networks that share the same form of similarity metrics. Specifically, we explore mutual distillation between two prototypical networks and two relation networks, denoted as MDMM-PN&PN and MDMM-RN&RN, respectively. The outcomes of the mutual distillation process are summarized in Table 2, and the following observations are made:

(1) On MiniImageNet, MDMM-PN&PN achieves 56.93% accuracy for the 1-shot task and 77.33% for the 5-shot task. These results signify improvements of 1.54% and 2.06%, respectively, compared to using the prototypical network (PN) in isolation. Similarly, on the CUB dataset, MDMM-PN&PN demonstrates an accuracy of 76.50% for the 1-shot task and 91.29% for the 5-shot task, showcasing enhancements of 0.09% and 1.89%, respectively, compared to the performance of PN individually. This analysis highlights that individual prototypical networks in mutual distillation effectively supply each other with a regularization term.

(2) MDMM-RN&RN achieves 58.09% accuracy for 1-shot tasks and 72.99% for 5-shot tasks on MiniImageNet, showing a respective improvement of 1.54% and 1.38% compared to RN. On CUB, MDMM-RN&RN performs at 74.64% for 1-shot tasks and 87.49% for 5-shot tasks, demonstrating improvements of 1.95% and 0.51% for the respective task types. This analysis highlights the effectiveness of mutual distillation between individual RN models, as they provide each other with valuable regularization terms. Ablation results further confirm the efficacy of our approach in enhancing the performance of each metric-based model within mutual distillation.

4.4. T-SNE Visualization of Features

The primary reason for the effectiveness of our approach lies in the reciprocal supply of a regularization term for feature embedding learning among the individual networks in mutual distillation. To showcase the generalization and discriminative capabilities of our learned feature embedding, we visualize features for unseen class samples and compare them with those of PN and RN. In the case of MiniImageNet, we randomly choose five classes, each containing 200 samples, from

D_{t e s t}

. Similarly, for CUB, we randomly select five classes, with each class comprising 40 samples from

D_{t e s t}

. Employing t-SNE for visualization, the results are presented in Figure 4 and Figure 5. In Figure 4, on MiniImageNet, our feature embedding effectively separates the five classes, surpassing the performance of PN and RN. A similar outcome is observed in Figure 5.

4.5. Comparison with Other Methods

We perform an exhaustive comparative examination of our methodology against several contemporary state-of-the-art techniques. These approaches are stratified into three distinct categories of meta-learning: model-based methods, optimization-based methods, and metric-based methods. The detailed results for this comparative evaluation can be found in the respective original publications. A condensed summary of the comparative outcomes is delineated in Table 3 and Table 4.

The results presented in Table 3 reveal the following comparisons:

(1) In terms of model-based methodologies, our approach exhibits notable performance advantages, surpassing the Meta Network by over 8.88% in the 1-shot scenario and outperforming SNAIL by margins of 2.38% and 8.45% in the 1-shot and 5-shot scenarios, respectively.

(2) Comparisons with optimization-based techniques reveal substantial improvements over classical approaches (MAML, Meta-SGD, Reptile) by up to 10% and 14% in the 1-shot and 5-shot settings, respectively. Additionally, our method outperforms LLAMA and DKT by 8.69% and 8.36%, respectively, in the 1-shot setting, and demonstrates clear superiority over BOIL and OVE PG GP + Cosine (ML) by approximately 8% and 12% in the 1-shot and 5-shot settings, respectively.

(3) In the metric-based category, our approach showcases significant advancements, surpassing the matching network by 11.49% in the 1-shot task and 17.33% in the 5-shot task, while also demonstrating improvements over prototypical network, relation network, and various state-of-the-art methods (Cross Module, PARN, DN4, SalNet). Furthermore, our method outperforms TPN and MCGN with improvements ranging from 0.64% to 10.68.

The outcomes presented in Table 4 for the fine-grained dataset CUB in few-shot image classification highlight the remarkable efficacy of the proposed approach.

(1) In the realm of optimization-based methods, specifically OVE PG GP + Cosine (ML), our approach outperforms counterparts by 12.52% and 12.75% in 1-shot and 5-shot accuracies, respectively, with a widening gap compared to antecedent methods.

(2) Within metric-based comparisons, our approach surpasses the Prototypical Network by marginal yet consequential increments of 0.09% and 1.89% in the 1-shot and 5-shot settings, respectively. Substantive advancements are evident against the relation network and DeepEMD, with gains ranging from 3.81% to 4.31%, and a notable advantage of 0.85% to 2.6% in the 1-shot and 5-shot settings, achieved without pre-training.

(3) In comparisons with formidable contenders like AFHN, K-tuplet, and MsSoSN, our method exhibits a distinctive 5.93% advantage in the 1-shot task and a discernible 3.75% lead in the 5-shot task, all without resorting to methodologies like GAN-based sample generation.

In summation, these findings underscore the robustness and efficacy of the proposed approach in few-shot image classification tasks on both the miniImageNet and CUB datasets, positioning it as a preeminent contender among state-of-the-art methods across diverse categories.

5. Conclusions and Future Work

In this study, we introduced a mutual distillation algorithm aimed at enhancing the performance of any metric-based meta-learning method for few-shot classification. The efficacy of the proposed approach stems from the exchange of information between pairs of metric-based models. Extensive experimentation, encompassing various similarity metrics and scenarios with shared similarity metrics, underscores the versatility and effectiveness of mutual distillation among individual models. The results on two widely used benchmark datasets affirm the efficiency of the proposed method. In our future work, we plan to explore alternative forms of information exchange among the individual models and consider the extension of mutual distillation from pairs of models to scenarios involving multiple networks.

Author Contributions

Conceptualization, T.Z., S.Y. and F.L.; Methodology, T.Z., W.D., S.Y. and F.L.; Validation, H.Z.; Investigation, W.D. and Z.C.; Resources, H.Z.; Writing—original draft, T.Z., W.D. and Z.C.; Writing—review & editing, S.Y. and F.L.; Supervision, S.Y., F.L. and H.Z.; Funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by Joint Fund of Ministry of Education for Equipment Pre-research (8091B022123), Research Fund from Science and Technology on Underwater Vehicle Technology Laboratory (2021JCJQ-SYSJJ-LB06905), Key Laboratory of Information System Requirements, No: LHZZ 2021-M04, Water Science and Technology Project of Jiangsu Province under grant No. 2021063, Qinglan Project of Jiangsu Province.

Data Availability Statement

The datasets used in this paper can be accessed by https://github.com/yaoyao-liu/mini-imagenet-tools and https://www.vision.caltech.edu/datasets/cub_200_2011/.

Conflicts of Interest

The authors declare no conflict of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Li, X.; Yu, L.; Fu, C.W.; Fang, M.; Heng, P.A. Revisiting metric learning for few-shot image classification. Neurocomputing 2020, 406, 49–58. [Google Scholar] [CrossRef]
Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; Lillicrap, T. One-shot learning with memory-augmented neural networks. arXiv 2016, arXiv:1605.06065. [Google Scholar]
Munkhdalai, T.; Yu, H. Meta networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2554–2563. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Wang, Z.; Zhao, Y.; Li, J.; Tian, Y. Cooperative Bi-path Metric for Few-shot Learning. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1524–1532. [Google Scholar]
Li, A.; Huang, W.; Lan, X.; Feng, J.; Li, Z.; Wang, L. Boosting few-shot learning with adaptive margin loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12576–12584. [Google Scholar]
Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In Proceedings of the ICML Deep Learning Workshop, Lille, France, 6–11 July 2015; Volume 2. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 2016, 29, 3630–3638. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R.S. Prototypical networks for few-shot learning. arXiv 2017, arXiv:1703.05175. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1199–1208. [Google Scholar]
Li, H.; Eigen, D.; Dodge, S.; Zeiler, M.; Wang, X. Finding task-relevant features for few-shot learning by category traversal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1–10. [Google Scholar]
Wang, X.; Yu, F.; Wang, R.; Darrell, T.; Gonzalez, J.E. Tafe-net: Task-aware feature embeddings for low shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1831–1840. [Google Scholar]
Ye, H.J.; Hu, H.; Zhan, D.C.; Sha, F. Learning embedding adaptation for few-shot learning. arXiv 2019, arXiv:1812.03664. [Google Scholar]
Liu, Y.; Lee, J.; Park, M.; Kim, S.; Yang, E.; Hwang, S.J.; Yang, Y. Learning to propagate labels: Transductive propagation network for few-shot learning. arXiv 2018, arXiv:1805.10002. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Ding, Q.; Wu, S.; Sun, H.; Guo, J.; Xia, S.T. Adaptive regularization of labels. arXiv 2019, arXiv:1908.05474. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Wu, Z.; Li, Y.; Guo, L.; Jia, K. PARN: Position-aware relation networks for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6659–6667. [Google Scholar]
Hou, R.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Cross attention network for few-shot classification. arXiv 2019, arXiv:1910.07677. [Google Scholar]
Simon, C.; Koniusz, P.; Nock, R.; Harandi, M. Adaptive subspaces for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4136–4145. [Google Scholar]
Wu, F.; Smith, J.S.; Lu, W.; Pang, C.; Zhang, B. Attentive prototype few-shot learning with capsule network-based embedding. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 237–253. [Google Scholar]
Li, W.; Wang, L.; Xu, J.; Huo, J.; Gao, Y.; Luo, J. Revisiting local descriptor based image-to-class measure for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 7260–7268. [Google Scholar]
Bateni, P.; Goyal, R.; Masrani, V.; Wood, F.; Sigal, L. Improved few-shot visual classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14493–14502. [Google Scholar]
Nguyen, V.N.; Løkse, S.; Wickstrøm, K.; Kampffmeyer, M.; Roverso, D.; Jenssen, R. SEN: A Novel Feature Normalization Dissimilarity Measure for Prototypical Few-Shot Learning Networks. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 118–134. [Google Scholar]
Zhang, C.; Cai, Y.; Lin, G.; Shen, C. DeepEMD: Few-Shot Image Classification with Differentiable Earth Mover’s Distance and Structured Classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12203–12213. [Google Scholar]
Zhang, Y.; Xiang, T.; Hospedales, T.M.; Lu, H. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4320–4328. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Mishra, N.; Rohaninejad, M.; Chen, X.; Abbeel, P. A simple neural attentive meta-learner. arXiv 2017, arXiv:1707.03141. [Google Scholar]
Hilliard, N.; Phillips, L.; Howland, S.; Yankov, A.; Corley, C.D.; Hodas, N.O. Few-shot learning with metric-agnostic conditional embeddings. arXiv 2018, arXiv:1802.04376. [Google Scholar]
Li, Z.; Zhou, F.; Chen, F.; Li, H. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv 2017, arXiv:1707.09835. [Google Scholar]
Nichol, A.; Achiam, J.; Schulman, J. On first-order meta-learning algorithms. arXiv 2018, arXiv:1803.02999. [Google Scholar]
Franceschi, L.; Frasconi, P.; Salzo, S.; Grazzi, R.; Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1568–1577. [Google Scholar]
Munkhdalai, T.; Yuan, X.; Mehri, S.; Trischler, A. Rapid adaptation with conditionally shifted neurons. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 3664–3673. [Google Scholar]
Grant, E.; Finn, C.; Levine, S.; Darrell, T.; Griffiths, T. Recasting gradient-based meta-learning as hierarchical bayes. arXiv 2018, arXiv:1801.08930. [Google Scholar]
Patacchiola, M.; Turner, J.; Crowley, E.J.; O’Boyle, M.; Storkey, A. Bayesian meta-learning for the few-shot setting via deep kernels. Adv. Neural Inf. Process. Syst. 2020, 33, 16108–16118. [Google Scholar]
Oh, J.; Yoo, H.; Kim, C.; Yun, S.Y. Boil: Towards representation change for few-shot learning. arXiv 2020, arXiv:2008.08882. [Google Scholar]
Snell, J.; Zemel, R. Bayesian Few-Shot Classification with One-vs-Each Pólya-Gamma Augmented Gaussian Processes. arXiv 2020, arXiv:2007.10417. [Google Scholar]
Prol, H.; Dumoulin, V.; Herranz, L. Cross-modulation networks for few-shot learning. arXiv 2018, arXiv:1812.00273. [Google Scholar]
Zhang, H.; Zhang, J.; Koniusz, P. Few-shot learning via saliency-guided hallucination of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 2770–2779. [Google Scholar]
Hui, B.; Zhu, P.; Hu, Q.; Wang, Q. Self-attention relation network for few-shot learning. In Proceedings of the 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shanghai, China, 8–12 July 2019; pp. 198–203. [Google Scholar]
Hao, F.; He, F.; Cheng, J.; Wang, L.; Cao, J.; Tao, D. Collect and select: Semantic alignment metric learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8460–8469. [Google Scholar]
Song, H.; Torres, M.T.; Özcan, E.; Triguero, I. L2AE-D: Learning to aggregate embeddings for few-shot learning with meta-level dropout. Neurocomputing 2021, 442, 200–208. [Google Scholar] [CrossRef]
Li, W.; Xu, J.; Huo, J.; Wang, L.; Gao, Y.; Luo, J. Distribution consistency based covariance metric networks for few-shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8642–8649. [Google Scholar]
Tang, S.; Chen, D.; Bai, L.; Liu, K.; Ge, Y.; Ouyang, W. Mutual CRF-GNN for Few-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2329–2339. [Google Scholar]
Li, K.; Zhang, Y.; Li, K.; Fu, Y. Adversarial feature hallucination networks for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13470–13479. [Google Scholar]
Ravi, S.; Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the International Conference on Learning Representations, San Juan, DC, USA, 2–4 May 2016. [Google Scholar]
Zhang, H.; Torr, P.H.; Koniusz, P. Few-shot Learning with Multi-scale Self-supervision. arXiv 2020, arXiv:2001.01600. [Google Scholar]

Figure 2. Demonstration of the foundational structure of metric-based meta-learning for a few-shot learning task in a 5-way, 1-shot configuration. The structure is delineated by an embedding module

f_{φ}

, assigned to feature extraction, and a metric module

g_{θ}

, formulated to assess the similarity between a provided query image and each support class.

Figure 2. Demonstration of the foundational structure of metric-based meta-learning for a few-shot learning task in a 5-way, 1-shot configuration. The structure is delineated by an embedding module

f_{φ}

, assigned to feature extraction, and a metric module

g_{θ}

, formulated to assess the similarity between a provided query image and each support class.

Figure 3. Overview of our proposed MDMM under the 5-way, 1-shot setting. As illustrated, our framework comprises a pair of metric-based models with distinct parameters. Specifically, Model

I_{1}

undergoes training with the cross-entropy loss

L_{C E 1}

and a regularization loss of KL-Divergence

D_{K L} (P_{2} | P_{1})

from distribution

P_{1}

to

P_{2}

. On the other hand, Model

I_{2}

is trained with the cross-entropy loss

L_{C E 2}

and a regularization loss of KL-Divergence

D_{K L} (P_{1} | P_{2})

from distribution

P_{2}

to

P_{1}

.

Figure 3. Overview of our proposed MDMM under the 5-way, 1-shot setting. As illustrated, our framework comprises a pair of metric-based models with distinct parameters. Specifically, Model

I_{1}

undergoes training with the cross-entropy loss

L_{C E 1}

and a regularization loss of KL-Divergence

D_{K L} (P_{2} | P_{1})

from distribution

P_{1}

to

P_{2}

. On the other hand, Model

I_{2}

is trained with the cross-entropy loss

L_{C E 2}

and a regularization loss of KL-Divergence

D_{K L} (P_{1} | P_{2})

from distribution

P_{2}

to

P_{1}

.

Figure 4. T-SNE visualization of features in PN, RN, and our method on unseen samples from MiniImageNet. Each dot represents a query sample and marked with different colors according to the real labels.

Figure 5. T-SNE visualization of features in PN, RN, and our method on unseen samples from CUB. Each dot represents a query sample and marked with different colors according to the real labels.

Table 1. The results of mutual distillation among individual networks using various similarity metrics are presented, reporting mean accuracies along with 95% confidence intervals for the MiniImageNet and CUB datasets in the context of both 5-way, 1-shot and 5-shot tasks. The second column provides information on the backbone used for the embedding function.

Method	Backbone	5-Way Accuracy (%)
Method	Backbone	MiniImageNet		CUB
		1-Shot	5-Shot	1-Shot	5-Shot
RN	ResNet12	54.68 ± 0.29	71.61 ± 0.24	72.69 ± 0.49	86.98 ± 0.28
PN	ResNet12	55.39 ± 0.29	75.27 ± 0.2	76.41 ± 0.49	89.40 ± 0.24
MDMM-RN	ResNet12	54.93 ± 0.30	73.58 ± 0.23	74.19 ± 0.50	91.03 ± 0.22
MDMM-PN	ResNet12	57.73 ± 0.30	75.32 ± 0.22	75.36 ± 0.50	89.00 ± 0.27

Table 2. The results of the mutual distillation between Individual networks with the same similarity metrics. Mean accuracies, along with 95% confidence intervals, are reported for MiniImageNet and CUB datasets under both 5-way, 1-shot and 5-shot tasks. The second column denotes the backbone used for the embedding function.

Method	Backbone	5-Way Accuracy (%)
Method	Backbone	MiniImageNet		CUB
-	-	1-Shot	5-Shot	1-Shot	5-Shot
RN	ResNet12	56.68 ± 0.29	71.61 ± 0.24	72.69 ± 0.49	86.98 ± 0.28
PN	ResNet12	55.39 ± 0.29	75.27 ± 0.23	76.41 ± 0.49	89.40 ± 0.24
MDMM-RN&RN	ResNet12	58.09 ± 0.30	72.99 ± 0.23	74.64 ± 0.50	87.49 ± 0.27
MDMM-PN&PN	ResNet12	56.93 ± 0.30	77.33 ± 0.22	76.50 ± 0.48	91.29 ± 0.22

Table 3. Comparative analysis of accuracy with other methods is conducted under 5-way, 1-shot and 5-shot tasks on MiniImageNet. The symbol ‘†’ denotes a result that has been re-implemented, while ‘-’ indicates that the author did not provide the result.

Method	Backbone	Type	Reference	5-Way Accuracy (%)
Method	Backbone	Type	Reference	1-Shot	5-Shot
Meta network [5]	Conv	Model	ICML’17	49.21 ± 0.96	-
SNAIL [30]	ResNet	Model	ICLR’18	55.71 ± 0.99	68.88 ± 0.92
MAML [6]	Conv4	Optimization	ICML’17	48.70 ± 1.75	63.11 ± 0.92
Meta-SGD [32]	Conv4	Optimization	Arxiv’17	50.47 ± 1.87	64.03 ± 0.94
Reptile [33]	Conv4	Optimization	Arxiv’18	49.97 ± 0.32	65.99 ± 0.58
Bilevel Programming [34]	ResNet12	Optimization	ICML’18	50.54 ± 0.85	64.53 ± 0.68
adaResNet [35]	ResNet12	Optimization	ICML’18	56.88 ± 0.62	71.94 ± 0.57
LLAMA [36]	Conv4	Optimization	ICLR’18	49.40 ± 1.83	-
DKT [37]	Conv4	Optimization	NIPS’20	49.73 ± 0.07	-
BOIL [38]	Conv4	Optimization	ICLR’21	49.61 ± 0.16	66.45 ± 0.37
OVE PG GP + Cosine (ML) [39]	Conv4	Optimization	ICLR’21	50.02 ± 0.35	64.58 ± 0.31
Matching Network [10]	Conv4	Metric	NIPS’16	46.6	60.0
Prototypical Network [11] †	ResNet12	Metric	NIPS’17	55.39 ± 0.29	75.27 ± 0.23
Relation Network [12] †	ResNet12	Metric	CVPR’18	56.68 ± 0.29	71.61 ± 0.24
Cross Module [40]	Conv4	Metric	NIPS’18	50.94 ± 0.61	66.65 ± 0.67
PARN [20]	Conv4	Metric	ICCV’19	55.22 ± 0.84	71.55 ± 0.66
DN4 [24]	Conv4	Metric	CVPR’19	51.24 ± 0.74	71.02 ± 0.64
SalNet [41]	Conv4	Metric	CVPR’19	57.45 ± 0.80	72.01 ± 0.75
TPN [16]	ResNet8	Metric	ICLR’19	55.51 ± 0.86	69.86 ± 0.65
SARN [42]	Conv	Metric	ICMEW’19	51.62 ± 0.31	66.16 ± 0.51
SAML [43]	Conv	Metric	ICCV’19	57.69 ± 0.20	73.03 ± 0.16
L2AE-D [44]	Conv4	Metric	Arxiv’19	54.26 ± 0.87	70.76 ± 0.67
CovaMNet [45]	Conv4	Metric	AAAI’19	51.19 ± 0.76	67.65 ± 0.63
MCGN [46]	64-96-128-256	Metric	CVPR’21	57.89 ± 0.87	73.58 ± 0.87
Our Method	ResNet12	Metric	-	58.09 ± 0.30	77.33 ± 0.22

Table 4. Comparative analysis of accuracy with other methods is conducted under 5-way, 1-shot and 5-shot tasks on CUB. The symbol ‘†’ denotes a result that has been re-implemented, while ‘-’ indicates that the author did not provide the result. The symbol ‘*’ signifies that the result is reported in [47].

Method	Backbone	Type	Reference	5-Way Accuracy (%)
Method	Backbone	Type	Reference	1-Shot	5-Shot
MAML [6] *	Conv4	Optimization	ICML’17	38.43	59.15
Meta-SGD [32] *	Conv4	Optimization	Arxiv’17	66.9	77.10
MACO [31] *	Conv4	Optimization	Arxiv’18	60.76	74.96
Meta-LSTM [48] *	Conv4	Optimization	ICLR’17	40.43	49.65
DKT [37]	Conv4	Optimization	ICLR’20	63.37 ± 0.19	-
OVE PG GP + Cosine (ML) [39]	Conv4	Optimization	NIPS’20	63.98 ± 0.43	77.44 ± 0.18
Matching Network [10] *	Conv4	Metric	NIPS’16	49.34	59.31
Prototypical Network [11] †	ResNet12	Metric	NIPS’17	76.41 ± 0.4	89.40 ± 0.24
Relation Network [12] †	ResNet12	Metric	CVPR’18	72.69 ± 0.49	86.98 ± 0.28
DN4 [24]	Conv4	Metric	CVPR’19	53.15 ± 0.84	81.90 ± 0.60
DeepEMD [27]	ResNet12	Metric	CVPR’20	75.65 ± 0.83	88.69 ± 0.50
SAML [43]	Conv	Metric	ICCV’19	69.33 ± 0.22	81.56 ± 0.15
CovaMNet [45]	Conv4	Metric	AAAI’19	52.42 ± 0.76	63.76 ± 0.64
MsSoSN [49]	Conv4	Metric	Arxiv’20	52.82	68.37
K-tuplet [3]	Conv4	Metric	Neurocomputing’20	40.16 ± 0.68	56.96 ± 0.65
AFHN [47]	Restnet12	Metric	CVPR’20	70.53 ± 1.01	83.95 ± 0.63
Our Method	ResNet12	Metric	-	76.50 ± 0.48	91.29 ± 0.22

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, T.; Dai, W.; Chen, Z.; Yang, S.; Liu, F.; Zheng, H. Few-Shot Image Classification via Mutual Distillation. Appl. Sci. 2023, 13, 13284. https://doi.org/10.3390/app132413284

AMA Style

Zhang T, Dai W, Chen Z, Yang S, Liu F, Zheng H. Few-Shot Image Classification via Mutual Distillation. Applied Sciences. 2023; 13(24):13284. https://doi.org/10.3390/app132413284

Chicago/Turabian Style

Zhang, Tianshu, Wenwen Dai, Zhiyu Chen, Sai Yang, Fan Liu, and Hao Zheng. 2023. "Few-Shot Image Classification via Mutual Distillation" Applied Sciences 13, no. 24: 13284. https://doi.org/10.3390/app132413284

APA Style

Zhang, T., Dai, W., Chen, Z., Yang, S., Liu, F., & Zheng, H. (2023). Few-Shot Image Classification via Mutual Distillation. Applied Sciences, 13(24), 13284. https://doi.org/10.3390/app132413284

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Image Classification via Mutual Distillation

Abstract

1. Introduction

2. Related Work

3. Our Method

3.1. Problem Definition

3.2. Preliminary: Metric-Based Meta-Learning

3.3. Mutual Distillation of Metric-Based Meta-Learning

3.4. Few-Shot Evaluation

4. Experiements

4.1. Dataset Description

4.2. Implementation Details

4.3. Ablation Study

4.4. T-SNE Visualization of Features

4.5. Comparison with Other Methods

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI