You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

26 May 2023

MDPN: Multilevel Distribution Propagation Network for Few-Shot Image Classification

,
,
,
,
,
and
1
School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
2
School of Information Science and Engineering, Zhejiang Sci-Tech University, Hangzhou 310018, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Application of Artificial Intelligence in Engineering

Abstract

Due to a shortage of labeled examples, few-shot image classification frequently experiences noise interference and insufficient feature extraction. In this paper, we present a two-stage framework based on the distribution propagation graph neural network (DPGN) called the multilevel distribution propagation network (MDPN). An instance-segmentation-based object localization (ISOL) module and a graph-based multilevel distribution propagation (GMDP) module are both included in the MDPN. To create a clear and full object zone, the ISOL module generates a mask that eliminates background and pseudo-object noises. The GMDP module enriches the level of features. We carried out comprehensive experiments on the few-shot dataset CUB-200-2011 to show the usefulness of MDPN. The results demonstrate that MDPN indeed outperforms DPGN in terms of few-shot image classification accuracy. Under 5-way 1-shot and 5-way 5-shot settings, the classification accuracy of MDPN exceeds the baseline by 8.17% and 1.24%, respectively. MDPN also outperforms the majority of the existing few-shot classification methods in the same setting.

1. Introduction

A significant amount of labeled data is needed for traditional deep-learning techniques. However, there are times when there are very few samples accessible because of the security, morality, resource, and expense concerns associated with data collection. Few-shot learning (FSL) was used to find a solution to this issue. FSL seeks to develop models that perform well when trained on small-scale data. Additionally, FSL can significantly lower the cost of manual annotation and has a broad range of potential applications in data-scarce areas such as uncommon disease data and human-computer interaction.
However, the backbone can only be set up as a lightweight network with shallow depth and narrow width, such as ConvNet4, Resnet12 [1], Resnet18, WRN-28-10, etc., to minimize underfitting because of the limited quantity of labeled data in FSL. Lightweight backbones can typically only do straightforward feature extraction; therefore, additional post-processing is required. We discovered that several studies have demonstrated that deep networks [2,3] integrate shallow, intermediate, and high levels of image features [4]. As network layers are added, the “level” of features steadily gets richer. Higher layers of the network pay more attention to the semantic information in the image, whereas lower layers concentrate more on the detailed information. This is true because the receptive fields in lower layers are typically smaller and their overlapping regions are smaller than those in higher levels. As a result, the lower layers of the network can acquire more precise information. The receptive fields and the overlap regions gradually expand with increasing downsampling. The expression of one pixel in the feature map corresponds to a certain region’s information in the original image, which contains more in-depth abstract information, or semantic information. In this paper, we present a multilevel feature extraction module to get different levels of features by stacking graph neural networks (GNNs) in sequence, with the goal of improving the feature extraction capability of the lightweight backbone.
The impact of noise on prediction accuracy will be particularly clear when the backbone of FSL has limited feature extraction capabilities. The dataset CUB 200-2011 that we utilize in particular for our research is a dataset of birds in a natural scene. The backgrounds of the birds in CUB 200-2011 primarily consist of the sky, branches, flowers, stones, and water, as seen in Figure 1. These backgrounds are intricate, and occasionally the birds even blend in with them. Eliminating pseudo-object interference and precisely extracting the target object from the image in FSL is also a big challenge for the model. As shown in Figure 2, several types of bird images contain a particular sort of pseudo-object, such as human hands, tree stumps, cages, water dispensers, oranges, etc. When a specific type of pseudo-object is included in the support set and query set of an episode (support set, query set, and episode are all words used in the FSL domain). An episode means a task; the support set equals the training set, and the query set equals the testing set. The model may mistakenly treat the pseudo-object as the target object and misclassify the two photos of distinct types of birds as belonging to the same category. In this study, we present an instance-segmentation-based object localization module that creates a new picture containing only the target object by precisely segmenting along the object’s edge in order to exactly remove the interference of backdrop and pseudo-object.
Figure 1. Complex backgrounds in CUB-200-2011.
Figure 2. Variable pseudo-objects in CUB-200-2011.
In this paper, we present an MPDN for few-shot image classification that combines an instance-segmentation-based object localization (ISOL) module with a graph-based multilevel distribution propagation (GMDP) module to address these issues. The instance segmentation adopted by the ISOL module is based on prior knowledge. Using the previously known segmentation of the images, ISOL divides the raw images into segments based on the outer contour of the target object and masks the non-object portions. The final outputs of ISOL are the regions included in each object’s minimum bounding rectangle (MBR). The GMDP module, which consists of three graph networks concatenated in series, is used to post-process the features. The outcomes of GPDN are three layers of distributions with increasing abstraction. We then use these three distributions to update the original features that were supplied to the GMDP module, and we send the revised features back to the module to be used in the following iteration. Iterations serve the objective of making the final output features contain the information of the entire graph by repeatedly computing the distribution.
The steps for training the MDPN are as follows: Images are first supplied to the ISOL module. The ISOL module crops the images in accordance with the object’s MBR. After that, the cropped images are sent to the backbone to extract features. The GMDP module is then used to extract the three levels of distribution features from the object features, which are subsequently utilized to update the original object features. Following a number of iterations in the GMDP module, the cross-entropy loss between the output features of the GMDP module and the ground truth labels is determined.
Our main contributions are as follows:
  • As far as we are aware, MDPN is the first model to stack graph networks in series to extract several levels of distribution in few-shot image classification. Our follow-up ablation trials have shown its usefulness.
  • In order to increase the precision and completeness of the localization, we incorporated past knowledge into the target object region localization by utilizing segmentation information. The accuracy of the prior-knowledge-based method is significantly higher than that of the supervised-learning-based target area localization method.
  • We perform in-depth analyses on the CUB-200-2011 dataset. In both 5-way 1-shot and 5-way 5-shot settings, the classification accuracy of MDPN outperforms the baseline by 8.17% and 1.24%, respectively. In the same setting, it outperforms most of the existing few-shot classification methods.

3. Proposed Method

We propose an MDPN framework that contains an ISOL module and a GMDP module. Different from the attention mechanism method mentioned in Section 2.2, which does not eliminate noise completely, the ISOL module completely eliminates background and pseudo-objects by generating masks so as to obtain a clean and complete target area. Different from the GNN methods mentioned in Section 2.3, which all use GNNS alone or stacked in parallel, the GMDP module obtains richer features with deeper abstraction levels layer by layer through series-stacked GNNs. As shown in Figure 3, MDPN first sends the input images to the ISOL module, which is based on instance segmentation. For each input image, the ISOL module creates a new image that only contains the object region. Then, the new images are sent to the backbone to obtain the object features. After that, three levels of distribution features of the object features are extracted by the GMDP module and then used to update the object features. The updated object features go on to the next iteration. We calculate the class prediction using the updated object features at the end of each iteration.
Figure 3. The abstract flow of multilevel distribution propagation network (MDPN).

3.1. Problem Definition

In standard few-shot classification, there are two datasets: the training set D b a s e = x i ,   y i ,   y i C b a s e and the test set D n o v e l = x i ,   y i ,   y i C n o v e l , where C b a s e and C n o v e l represent the base classes in the training set and novel classes in the test set, and they do not overlap ( C b a s e C n o v e l = ). Training and testing for few-shot image classification consist of a number of episodes. Each episode is constructed by randomly sampling K categories, which consist of N labeled images and U unlabeled images, i.e., the k-way n-shot setting. The labeled images are called support set S = x i ,   y i j = 1 K × N , and the unlabeled images are called query set Q = x i ,   y i j = 1 K × U , where they do not overlap ( S Q = ).
Take the 5-way 1-shot (that is, K = 5, N = 1) episode setting as an instance:
(1)
Divide the dataset into a training set D b a s e and a test set D n o v e l by category.
(2)
Sample five categories from D b a s e . Then, sample 1 image from each category as a labeled image to form the support set S, and U images from each category as unlabeled images to form the query set Q (U can be set according to your needs, such as 1, 15, etc.). In our experiment, U = 1, S Q = .
The model learns the features of the images in S according to the specific algorithm and predicts the labels of the images in Q. Calculate the loss between predicted labels and ground truth labels, then backpropagate the loss.
(3)
Repeat step (2) until the preset number of episodes is reached.
(4)
Sample five categories from D n o v e l . Then, sample 1 image from each category as a labeled image to form support set S and U images from each category as unlabeled images to form query set Q′ ( U can be set according to your needs, such as 1, 15, etc.). In our experiment, U =1, S Q = .
(5)
Repeat step (4) until the preset number of episodes is reached.
(6)
Repeat steps from (2) to (5) until the preset number of epochs is reached.
The K-way N-shot setting aims to train a classifier that can accurately map a query set to its label based only on a small support set. The 5-way setting is chosen in our experiments instead of more categories, such as 20-way, 36-way, 50-way, etc. [30,31,32] Compared to the Q-way (Q > 5), the 5-way setting requires less data to reduce the risk of model overfitting and is more suitable for training with a small amount of data or limited time. Therefore, we use a 5-way setting in our study.

3.2. Instance-Segmentation-Based Object Localization

CNN typically extracts features from the entire image, regardless of the background or target object. Images of natural scenes, however, contain complex backgrounds and some pseudo-objects. Consequently, CNN may extract irrelevant or even interfering features, which is particularly detrimental to subsequent class predictions. Therefore, before extracting features, it is crucial to locate the target object. Based on prior knowledge of segmentation images, we propose an object localization method in our ISOL module. Prior knowledge, as we all know, is knowledge that existed before an experience. When a person learns something new, their brain will naturally make references to previously learned information. The brain will quickly assimilate new information if it can uncover parallels or connected ideas. Similarly to this, the model does not need to learn localization from scratch when we inject prior knowledge of segmentation-based object localization into it. Instead, the model gains localization capability right away.
The localization approach based on prior knowledge of segmentation images has the ability to achieve accurate object localization and good region wrapping, in contrast to the localization method based on the supervised-learning-based attention mechanism. As shown in Figure 4, the ISOL module works as follows: Firstly, the region of the object is precisely segmented based on its outside shape. Secondly, the non-object areas are covered with a mask. Finally, the output images of the ISOL module are the region inside the MBR of the object region.
Figure 4. Instance-segmentation-based object localization (ISOL) module.

3.3. Graph-Based Multilevel Distribution Propagation

In this section, we will introduce the GMDP module in detail. As shown in Figure 5, the GMDP module consists of l iterations, and each iteration consists of an object graph G l o b j e c t = V l o b j e c t ,   E l o b j e c t , a fist-level distribution graph G l d i s t 1 = V l d i s t 1 ,   E l d i s t 1 , and a second-level distribution graph G l d i s t 2 = V l d i s t 2 ,   E l d i s t 2 , and a third-level G l d i s t 3 = V l d i s t 3 ,   E l d i s t 3 , the subscript l means the l - th iterations. Firstly, the object feature V l o b j e c t was extracted from images by the backbone:
v 0 ,   i o b j e c t = f e m b x i ,
where f e m b x i is the backbone. As nodes in G l o b j e c t , V l o b j e c t is used to calculate the edge feature E l o b j e c t . Second, we fuse E l o b j e c t and V l 1 d i s t 1 , which is the nodes of G l 1 d i s t 1 , to get V l d i s t 1 , which are the nodes of G l d i s t 1 . V l d i s t 1 is initialized in the order of G l o b j e c t and used to calculate the edge feature E l d i s t 1 , which represents the distribution of V l d i s t 1 , i.e., fist-level distribution; E l d i s t 1 is then directly sent to construct the nodes V l d i s t 2 in G l d i s t 2 . V l d i s t 2 is initialized in the order of G l d i s t 1 and is used to calculate E l d i s t 2 , which represents the distribution of V l d i s t 2 , i.e., second-level distribution; Similarly, E l d i s t 2 are directly used to construct the nodes V l d i s t 3 in G l d i s t 3 . V l d i s t 3 is initialized in the order of G l d i s t 2 and used to calculate E l d i s t 3 , which represents the distribution of node V l d i s t 3 , i.e., third-level distribution. Finally, these three levels of distribution E l d i s t 1 ,   E l d i s t 2 ,   and   E l d i s t 3 will be used to update the object feature V l o b j e c t to produce a new node V l + 1 o b j e c t , which are the nodes of the object graph G l + 1 o b j e c t . This is one complete iteration.
Figure 5. Details about MDPN. A 2 way-1 shot task is presented as an example. MLP-1 is the FC-ReLU blocks mentioned in O2D Aggregation and MLP-2 is the Conv-BN-ReLU blocks mentioned in D2O Aggregation.
V l o b j e c t ,   E l o b j e c t ,   V l d i s t 1 ,   E l d i s t 1 ,   V l d i s t 2 ,   E l d i s t 2 ,   V l d i s t 3 ,   E l d i s t 3 are defined as follows: V l o b j e c t = { v l ,   i o b j e c t }, E l o b j e c t = { e l ,   i j o b j e c t }, V l d i s t 1 = { v l ,   i d i s t 1 }, E l d i s t 1 = {   e l ,   i j d i s t 1 }, V l d i s t 2 = { v l ,   i d i s t 2 },   E l d i s t 2 = {   e l ,   i j d i s t 2 }, V l d i s t 3 = { v l ,   i d i s t 3 },   E l d i s t 3 = {   e l ,   i j d i s t 3 } where i ,   j = 1 ,   ,   T . T = N × K + T ¯ denotes the total number of samples in a training episode, N × K denotes the number of support samples and T ¯ denotes the number of query samples.
Object Propagation. Each edge in the object’s graph stands for the object similarity. The edge e l ,   i j o b j e c t in the object graph represents the distribution of the object features, and e 0 ,   i j o b j e c t is updated as follows:
e 0 ,   i j o b j e c t = f e 0 o b j e c t v 0 ,   i o b j e c t v 0 ,   j o b j e c t 2 ,                                                       l = 0 , f   e l o b j e c t v l ,   i o b j e c t v l ,   j o b j e c t 2 .   e l 1 ,   i j o b j e c t ,                           l > 0 ,
where e 0 ,   i j o b j e c t . f   e 0 o b j e c t :   m is an encoding network that transforms a distribution into a metric. f   e 0 o b j e c t is made of two Conv-BN-ReLU [33,34] blocks and a sigmoid layer. Finally, a normalization operation is conducted on e l ,   i j o b j e c t .
O2D Aggregation. The object-distribution aggregation integrates E l o b j e c t and V l 1 d i s t 1 to get V l d i s t 1 . v 0 ,   i d i s t 1 is initialized as follows:
  v 0 ,   i d i s t 1 = | | j = 1 N s u p δ y i ,   y j ,               i f   x i   i s   l a b e l e d , 1 N s u p ,   ,   1 N s u p ,                       o t h e r w i s e ,
where v 0 ,   i d i s t 1 N s u p . || is the concatenation operator. y i and y j are labels. δ · is the Kronecker delta function, which outputs 1 when y i = y j , 0 otherwise.
When the iteration number l > 0 , v l ,   i d i s t 1 is updated as follows:
  v l ,   i d i s t 1 = O 2 D | |   j = 1 N s u p     e l ,   i j o b j e c t ,   v l 1 ,   i d i s t 1   ,
where O 2 D :   N s u p ,   N s u p   N s u p is the object-distribution aggregation network. O 2 D first connects two features together and then transforms it: 2 N s u p   N s u p , this transformation contains a fully connected (FC) layer and a ReLU.
Distribution Propagation. The first-level distribution e l ,   i j d i s t 1 is updated as follows:
  e 0 ,   i j d i s t 1 = f   e 0 d i s t 1 v 0 ,   i d i s t 1 v 0 ,   j d i s t 1 2 ,                                   l = 0 , f   e l d i s t 1 v l ,   i d i s t 1 v l ,   j d i s t 1 2 .   e l 1 ,   i j d i s t 1 ,           l > 0 .
where e 0 ,   i j d i s t 1 . The encoding network f   e 0 d i s t 1 :   N K   N K is composed of two Conv-BN-ReLU blocks. Finally, a normalization operation is conducted on e l ,   i j d i s t 1 . The edge of the first-level distribution e 0 ,   i j d i s t 1 is directly used as a node of the second-level distribution:   e 0 ,   i j d i s t 1 = v 0 ,   i d i s t 2 . Similarly in the third-level graph: e 0 ,   i j d i s t 2 = v 0 ,   i d i s t 3 . The second-level distribution e l ,   i j d i s t 2 and the third-level distribution e l ,   i j d i s t _ 3 are updated as follows:
  e l ,   i j d i s t 2 = f e l d i s t 2 v l ,   i d i s t 2 v l ,   j d i s t 2 2 ,                                   l = 0 , f e l d i s t 2 v l ,   i d i s t 2 v l ,   j d i s t 2 2 · e l 1 ,   i j d i s t 2 ,           l > 0 .
  e l ,   i j d i s t 3 = f e l d i s t 3 v l ,   i d i s t 3 v l ,   j d i s t 3 2 ,                                     l = 0 , f e l d i s t 3 v l ,   i d i s t 3 v l ,   j d i s t 3 2   · e l 1 ,   i j d i s t 3 ,           l > 0 .
D2O Aggregation. In the D2O module, three levels of distributions: e l ,   i j d i s t 1 , e l ,   i j d i s t 2 , e l ,   i j d i s t 3 are used to update the v l ,   i o b j e c t into v l + 1 ,   i o b j e c t . v l + 1 ,   i o b j e c t is updated as follows:
v l fuse x = j = 1 T e l ,   i j d i s t x · v l ,   j o b j e c t ,       x = 1 ,   2 ,   3 ,
v l c a t = v l ,   i o b j e c t v l f u s e 1 v l f u s e 2 | | v l f u s e 3 ,
v l + 1 ,   i o b j e c t = D 2 O v l c a t ,
where   v l ,   i o b j e c t m , D2O:   m ,   m m is the distribution-object aggregation network. D2P consists of two Conv-BN-ReLU blocks. Benefiting from the aggregation, v l + 1 ,   i o b j e c t contains the multilevel distribution information.

3.4. Objective

The class prediction of each node in the object graph and in the three distribution graphs is calculated as follows:
P o b j e c t ( y j | x   i ) = S o f t m a x j = 1 N K e l ,   i j o b j e c t · o n e h o t y j ,
P d i s t x y j | x   i = S o f t m a x j = 1 N K e l ,   i j d i s t x · o n e - h o t y j , x = 1 ,   2 ,   3 ,
where x   i is the query sample and y j is the label of j - th support sample. P y j | x   i is the probability distribution over classes given sample x   i . e l ,   i j o b j e c t and e l ,   i j d i s t x are the edges connecting node x   i and node x   j in the l - th iteration (where x i is the query sample and x j is the support sample). o n e - h o t y j is the one-hot encoding of the y j .
Object Loss. The object loss is inherited from baseline(DPGN [26]):
L l o b j e c t = L C E P o b j e c t y j | x i ,   y j ,
where L C E is the cross-entropy loss function. P o b j e c t y j | x   i is calculated by (11).
Distribution Loss. The distribution loss in the l - th iteration is calculated as follows:
L l d i s t x = L C E P d i s t x y ^ i | x   i ,     y i ,         x = 1 ,   2 ,   3 ,
L l d i s t = λ d i s t 1 L l d i s t 1 + λ d i s t 2 L l d i s t 2 + λ d i s t 3 L l d i s t 3 / N λ ,       N λ = x = 1 ,   2 ,   3 λ d i s t x ! = 0 ,
where P d i s t x y j | x   i is calculated by (12). L l d i s t 1 , L l d i s t 2 , and L l d i s t 3 are the three distribution losses. λ d i s t 1 , λ d i s t 2 , and λ d i s t 3 are the weight factors of the corresponding distribution loss. λ d i s t 1 ,   λ d i s t 2 ,   λ d i s t 3 = 0.0 ,   0.0 ,   1.0 in a 5-way 1-shot setting and λ d i s t 1 ,   λ d i s t 2 ,   λ d i s t 3 = 1.0 ,   1.0 ,   1.0 in a 5-way 5-shot setting. The reason for the setting of the weight factors will be explained in Section 4.
The objective function is made up of these two losses:
L = l = 1 l ^ λ o b j e c t L l o b j e c t + λ d i s t L l d i s t ,
where l ^ is the total number of iterations in the GMDP module. The two weight factors λ o b j e c t and λ d i s t are used to measure the importance of the two losses. We follow the setting of the baseline: λ o b j e c t , λ d i s t = 1.0 ,   0.1 .

4. Experiment

4.1. Dataset

We assess MDPN using the common benchmarks for few-shot learning, CUB-200-2011 [35]. A total of 200 bird species are represented by 11,788 photos in CUB-200-201, which is broken down into 100 base classes, 50 validation classes, and 50 novel classes.

4.2. Experiment Setups

Our experiment setting is consistent with the baseline to guarantee that our model is comparable to it.
Network Architecture. We use ResNet12 as the backbone. It mainly has four blocks, which include one residual block. The last feature of the backbone is processed by global average pooling, then goes through an FC layer with batch normalization (BN) to obtain a 128-dimensional object embedding as the final output of the backbone network.
Implementation Details. We perform data augmentation prior to training, e.g., horizontal flipping, random cropping, and color jitter (brightness, contrast, and saturation), as mentioned in [36,37]. We set the number of episodes to 28 in each epoch. In our experiments, we use the Adam optimizer with an initial learning rate of 10 3 . The decay of the learning rate is set to 0.1 per 15,000 iterations and the weight decay to 10 5 .
Evaluation Metrics. We evaluated MDPN in 5-way 1-shot and 5-way 5-shot settings on CUB-200-2011. Following the evaluation process of previous methods [25,37,38], we randomly sampled 10,000 tasks to calculate the mean accuracy with 95% confidence intervals as the final result. Our experiments are implemented in PyTorch.
Evaluation Metrics. The number of iterations in GMDP is chosen to be 6 as a trade-off between convergence time and accuracy.

4.3. Experiment Results

Main Results. We contrast MDPN’s performance with that of the well-known ProtoNet and the current top models, such as DeepEMD [39], FEAT [40], and FRN [41]. We tested the approach with the same dataset, CUB-200-2011, and backbone, ResNet12, for an accurate comparison. Table 1 demonstrates that MDPN outperforms the baseline (DPGN [26]) as well as the majority of the existing methods.
Table 1. The accuracy (%) of 5-way 1-shot and 5-way 5-shot settings on CUB-200-2011. The best outcomes are highlighted in bold.

5. Discussion

5.1. The Impact of ISOL

The localization method via the supervised learning-based attention mechanism, which is an inaccurate localization focusing on only part of the object region and losing other information, is substantially worse than our prior-knowledge-based ISOL. The accuracy and integrity of the location of the object region directly affect feature extraction and are also reflected in classification accuracy. We conducted a number of ablation experiments on CUB-200-2011 in 5-way 1-shot and 5-way 5-shot settings to demonstrate that our ISOL outperforms the non-localization approach and attention mechanism. The data in Table 2 demonstrates that, when compared to the other two methods, our MDPN, which is based on instance segmentation, has the highest accuracy. We choose CBAM for two reasons when evaluating the impact of attention mechanisms: (1) it is a plug-and-play module, and (2) it combines CAM and SAM [47]. CAM concentrates on channel, while SAM concentrates on spatial. CBAM can produce better results than CAM, Grad-CAM, and SENet [48], which exclusively concentrate on channels.
Table 2. The accuracy (%) of different localization methods in 5-way 1-shot and 5-way 5-shot settings on CUB-200-2011. The best outcomes are highlighted in bold. A0 denotes the MDPN without ISOL module. A1 denotes the MDPN, with CBAM replacing ISOL module. A2 denotes the MDPN, which contains ISOL module.

5.2. The Impact of GMDP

We perform a series of ablation experiments on various stacking numbers of distribution propagation on CUB-200-2011 in 5-way 1-shot and 5-way 5-shot configurations to confirm the viability of GMDP. Table 3 demonstrates that when the degree of distribution propagation rises, the classification accuracy of MDPN steadily improves and peaks with a configuration of three levels.
Table 3. The accuracy (%) of distribution propagation with various numbers of levels in 5-way 1-shot and 5-way 5-shot settings on cub-200-2011. The best outcomes are highlighted in bold. B0 denotes one level of distribution propagation. B1 denotes two levels of distribution propagation. B2 denotes the MDPN, which contains three levels of distribution propagation. B3 denotes four levels of distribution propagation. B4 denotes five levels of distribution propagation.

5.3. The Impact of Weight Factors

As mentioned in Section 3.4, we set the weight control factors λ d i s t _ 1 , λ d i s t _ 2 , λ d i s t _ 3 , in (16) as follows: λ d i s t _ 1 ,   λ d i s t _ 2 ,   λ d i s t _ 3 = 0.0 ,   0.0 ,   1.0 in a 5-way 1-shot setting, and λ d i s t _ 1 ,   λ d i s t _ 2 ,   λ d i s t _ 3 = 1.0 ,   1.0 ,   1.0 in a 5-way 5-shot setting. In order to verify the effectiveness of the weight factors that we set to improve the classification accuracy, we conducted a series of experiments, as shown in Table 4.
Table 4. The accuracy (%) of different settings of three weight factors in the loss function in 5-way 1-shot and 5-way 5-shot settings on cub-200-2011. The best results are shown in bold. C0 denotes λ d i s t 1 ,   λ d i s t 2 ,   λ d i s t 3 = 0.0 ,   0.0 ,   1.0 . C1 denotes λ d i s t 1 ,   λ d i s t 2 ,   λ d i s t 3 = 1.0 ,   1.0 ,   0.0 . C2 denotes λ d i s t 1 ,   λ d i s t 2 ,   λ d i s t 3 = 1.0 ,   1.0 ,   1.0 .

6. Conclusions

The MDPN proposed in this paper is optimized for the problems of noise interference and inadequate feature extraction in few-shot classification. The ISOL module dramatically reduces background and pseudo-object noise effects, and multilevel distributions are generated by the GMDP module. Benefiting from these, MDPN performs well on the CUB-200-2011: MDPN exceeds the baseline by 8.17% under the 5-way 1-shot setting and 1.24% under the 5-way 5-shot setting. For future work, we aim to validate our model on more small datasets.

Author Contributions

Conceptualization, J.W., H.Z. (Haixinag Zhang) and J.F.; methodology, J.W., H.Z. (Haixinag Zhang), J.F. and M.S.; software, J.W.; validation, J.W.; formal analysis, J.W., H.Z. (Haixinag Zhang), J.F., H.Z. (Huaxiong Zhang) and M.S.; investigation, J.W. and H.Z. (Haixinag Zhang); resources, H.Z. (Haixinag Zhang), J.F. and H.M.; data curation, J.W.; writing—original draft preparation, J.W.; writing—review and editing, J.W. and H.Z. (Haixinag Zhang), J.F., H.Z. (Huaxiong Zhang) and M.S.; visualization, J.W.; supervision, H.Z. (Haixinag Zhang), J.F., H.M., H.Z. (Huaxiong Zhang) and M.J.; project administration, H.Z. (Haixinag Zhang), J.F., H.M. and M.J.; funding acquisition, H.Z. (Haixinag Zhang), J.F., H.M. and M.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61672466 62011530130, Joint Fund of the Zhejiang Provincial Natural Science Foundation, grant number LSZ19F010001, and the Key Research and Development Program of Zhejiang Province, grant number 2020C03060.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data are unavailable due to privacy.

Acknowledgments

Thanks to my teachers and friends for their support in my research.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  2. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  3. LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; DHoward, R.; Hubbard, W.; Jackel, L. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
  4. Jiang, Z.; Kang, B.; Zhou, K.; Feng, J. Few-shot Classification via Adaptive Attention. arXiv 2020, arXiv:2008.02465. [Google Scholar]
  5. Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; Lillicrap, T. Meta-learning with memory-augmented neural networks. In Proceedings of the International Conference on Machine Learning PMLR, New York, NY, USA, 19–24 June 2016; pp. 1842–1850. [Google Scholar]
  6. Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems 28, Proceedings of the Annual Conference on Neural Information Processing Systems 2015, Montreal, Quebec, Canada, 7–12 December 2015; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; NeurIPS: San Diego, CA, USA, 2015; p. 28. [Google Scholar]
  7. Munkhdalai, T.; Yu, H. Meta networks. In Proceedings of the International Conference on Machine Learning PMLR, Sydney, NSW, Australia, 6–11 August 2017; pp. 2554–2563. [Google Scholar]
  8. Cai, Q.; Pan, Y.; Yao, T.; Yan, C.; Mei, T. Memory matching networks for one-shot image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4080–4088. [Google Scholar]
  9. Kaiser, Ł.; Nachum, O.; Roy, A.; Bengio, S. Learning to remember rare events. arXiv 2017, arXiv:1703.03129. [Google Scholar]
  10. Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning PMLR, Sydney, NSW, Australia, 6 August 2017; pp. 1126–1135. [Google Scholar]
  11. Li, Z.; Zhou, F.; Chen, F.; Li, H. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv 2017, arXiv:1707.09835. [Google Scholar]
  12. Nichol, A.; Achiam, J.; Schulman, J. On first-order meta-learning algorithms. arXiv 2018, arXiv:1803.02999. [Google Scholar]
  13. Lee, K.; Maji, S.; Ravichandran, A.; Soatto, S. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10657–10665. [Google Scholar]
  14. Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; NeurIPS: San Diego, CA, USA, 2016; p. 29. [Google Scholar]
  15. Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; NeurIPS: San Diego, CA, USA, 2017; p. 30. [Google Scholar]
  16. Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.S.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 2018; pp. 1199–1208. [Google Scholar]
  17. Zhu, Y.; Liu, C.; Jiang, S. Multi-attention Meta Learning for Few-shot Fine-grained Image Recognition. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan, 7–15 January 2021; pp. 1090–1096. [Google Scholar]
  18. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
  19. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Las Vegas, NV, USA, 27–30 June 2017; pp. 618–626. [Google Scholar]
  20. Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 839–847. [Google Scholar]
  21. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  22. Gao, T.; Han, X.; Liu, Z.; Sun, M. Hybrid attention-based prototypical networks for noisy few-shot relation classification. Proc. AAAI Conf. Artif. Intell. 2019, 33, 6407–6414. [Google Scholar] [CrossRef]
  23. Garcia, V.; Bruna, J. Few-shot learning with graph neural networks. arXiv 2017, arXiv:1711.04043. [Google Scholar]
  24. Liu, Y.; Lee, J.; Park, M.; Kim, S.; Yang, E.; Hwang, S.; Yang, Y. Learning to propagate labels: Transductive propagation network for few-shot learning. arXiv 2018, arXiv:1805.10002. [Google Scholar]
  25. Kim, J.; Kim, T.; Kim, S.; Chang, D.Y. Edge-labeling graph neural network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11–20. [Google Scholar]
  26. Yang, L.; Li, L.; Zhang, Z.; Zhou, X.; Zhou, E.; Liu, Y. Dpgn: Distribution propagation graph network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13390–13399. [Google Scholar]
  27. Gidaris, S.; Komodakis, N. Generating classification weights with gnn denoising autoencoders for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 21–30. [Google Scholar]
  28. Zhang, Y.; Li, W.; Zhang, M.; Wang, S.; Tao, R.; Du, Q. Graph information aggregation cross-domain few-shot learning for hyperspectral image classification. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–14. [Google Scholar] [CrossRef] [PubMed]
  29. Zhong, X.; Gu, C.; Ye, M.; Huang, W.; Lin, C.W. Graph complemented latent representation for few-shot image classification. IEEE Trans. Multimed. 2022, 1. [Google Scholar] [CrossRef]
  30. Shalam, D.; Korman, S. The self-optimal-transport feature transform. arXiv 2022, arXiv:2204.03065. [Google Scholar]
  31. Hu, Y.; Pateux, S.; Gripon, V. Squeezing backbone feature distributions to the max for efficient few-shot learning. Algorithms 2022, 15, 147. [Google Scholar] [CrossRef]
  32. Zhang, H.; Cao, Z.; Yan, Z.; Zhang, C. Sill-net: Feature augmentation with separated illumination representation. arXiv 2021, arXiv:2102.03539. [Google Scholar]
  33. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning PMLR, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
  34. Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
  35. Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200–2011 Dataset; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
  36. Gidaris, S.; Komodakis, N. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4367–4375. [Google Scholar]
  37. Ye, H.J.; Hu, H.; Zhan, D.C.; Sha, F. Learning embedding adaptation for few-shot learning. arXiv 2018, arXiv:1812.03664. [Google Scholar]
  38. Rusu, A.A.; Rao, D.; Sygnowski, J.; Vinyals, O.; Pascanu, R.; Osindero, S.; Hadsell, R. Meta-learning with latent embedding optimization. arXiv 2018, arXiv:1807.05960,. [Google Scholar]
  39. Zhang, C.; Cai, Y.; Lin, G.; Shen, C. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12203–12213. [Google Scholar]
  40. Ye, H.J.; Hu, H.; Zhan, D.C.; Sha, F. Few-shot learning via embedding adaptation with set-to-set functions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8808–8817. [Google Scholar]
  41. Wertheimer, D.; Tang, L.; Hariharan, B. Few-shot classification with feature map reconstruction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8012–8021. [Google Scholar]
  42. Chen, W.Y.; Liu, Y.C.; Kira, Z.; Wang, Y.C.F.; Huang, J.B. A closer look at few-shot classification. arXiv 2019, arXiv:1904.04232. [Google Scholar]
  43. Liu, Y.; Zheng, T.; Song, J.; Cai, D.; He, X. Dmn4: Few-shot learning via discriminative mutual nearest neighbor neural network. Proc. AAAI Conf. Artif. Intell. 2022, 36, 1828–1836. [Google Scholar] [CrossRef]
  44. Kang, D.; Kwon, H.; Min, J.; Cho, M. Relational embedding for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 8822–8833. [Google Scholar]
  45. Rodríguez, P.; Laradji, I.; Drouin, A.; Lacoste, A. Embedding propagation: Smoother manifold for few-shot classification. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Part XXVI 16, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 121–138. [Google Scholar]
  46. Chen, C.; Yang, X.; Xu, C.; Huang, X.; Ma, Z. Eckpn: Explicit class knowledge propagation network for transductive few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6596–6605. [Google Scholar]
  47. Laskar, Z.; Kannala, J. Context aware query image representation for particular object retrieval. In Proceedings of the Image Analysis: 20th Scandinavian Conference, SCIA 2017, Tromsø, Norway, 12–14 June 2017; Springer International Publishing: Cham, Switzerland, 2017; pp. 88–99. [Google Scholar]
  48. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.