Prototype-Based Self-Adaptive Distribution Calibration for Few-Shot Image Classiﬁcation

: Deep learning has ﬂourished in large-scale supervised tasks. However, in many practical conditions, rich and available labeled data are a luxury. Thus, few-shot learning (FSL) has recently received boosting interest and achieved signiﬁcant progress, which can learn new classes from several labeled samples. The advanced distribution calibration approach estimates the ground-truth distribution of few-shot classes by reusing the statistics of auxiliary data. However, there is still a signiﬁcant discrepancy between the estimated distributions and ground-truth distributions, and artiﬁcially set hyperparameters cannot be adapted to different application scenarios (i.e., datasets). This paper proposes a prototype-based self-adaptive distribution calibration framework for estimating ground-truth distribution accurately and self-adaptive hyperparameter optimization for different application scenarios. Speciﬁcally, the proposed method is divided into two components. The prototype-based representative mechanism is for obtaining and utilizing more global information about few-shot classes and improving classiﬁcation performance. The self-adaptive hyperparameter optimization algorithm searches robust hyperparameters for the distribution calibration of different application scenarios. The ablation studies verify the effectiveness of the various components of the proposed framework. Enormous experiments are conducted on three standard benchmarks such as mini ImageNet, CUB-200-2011, and CIFAR-FS. The competitive results and compelling visualizations indicate that the proposed framework achieves state-of-the-art performance.


Introduction
Humans have a remarkable ability to recognize novelty after only looking at a few examples. However, the enormous development of deep learning is inseparable from largescale datasets and networks. As a bridge between human ability and artificial intelligence, few-shot learning (FSL) has recently obtained considerable attention [1,2], particularly for image classification [3]. Under the few-shot challenge, the image classification model learns to classify images when only a few samples per class are provided to the model for training.
Most methods of few-shot image classification are proposed based on meta-learning [4,5] and metric-learning [6,7], making the model adapt quickly to unseen tasks to improve model generalization ability. Furthermore, some researchers try to avoid model overfitting through data augmentation. Bendre et al. [8] utilize a multimodal method to reconstruct features with semantic and image knowledge from the latent space. Li et al. [9] utilize a conditional Wasserstein Generative Adversarial Network to synthesize various discriminative features to alleviate sample shortage. Current few-shot image classification methods focus on deep neural network training strategies to directly describe the class-level sample distributions. Unlike these methods, Yang et al. [10] recently estimated the ground-truth distribution of the samples by distribution calibration (DC) in a simple and hand-crafted manner but achieving very competitive performance. On the one hand, the classification accuracies of the existing methods are still not satisfactory. On the other hand, the distribution • A Prototype-based Representative Mechanism (PRM) is proposed to utilize the fewshot class centers to participate in the distribution calibration, resulting in a more accurate estimation of the ground-truth distributions; • A Self-adaptive Hyperparameter Optimization Algorithm (SHOA) is proposed for self-adaptive hyperparameter optimization for the distribution calibration of different application scenarios; • We propose a Prototype-based Self-adaptive Distribution Calibration (PSDC) framework to estimate the ground-truth distributions and improve the model performance in the few-shot image classification task; • Comprehensive experiments are conducted to evaluate the effectiveness of the proposed framework, including the comparison with SOTA methods, ablation studies, and visualization verification.

Few-Shot Image Classification
Massive few-shot image classification methods have been proposed in the recent decade. They follow the task mechanism to construct massive tasks from related datasets to simulate the condition of a few samples. These typical few-shot learning methods can be divided into four types: optimization-based, metric-based, fine-tuning-based, and generation-based.
The optimization-based methods [4,5] use an alternate optimization strategy to learn how to update model parameters more quickly. As a result, the networks have a good initialization, updated direction, and learning rate to adapt quickly to tasks. The metricbased methods classify samples by distinguishing different distances between the images of the query set and the representatives of few-shot classes [6,14]. The fine-tuning-based methods perform model pre-training on base class data, and a new classifier is fine-tuned with little novel class data [15,16]. In particular, Mangla et al. [17] fine-tune the backbone rather than the classifier using the Manifold Mixup technique.
The generation-based methods [8,9,[18][19][20][21] build complex networks for generating more additional samples or features. Bendre et al. [8] reconstruct features by employing a multimodal strategy including semantic and image information. Li et al. [9] introduce a conditional Wasserstein GAN (WGAN) to synthesize fake features. By contrast, Hong et al. [21] combine the matching procedure with GANs [22] to generate image samples instead of features. Yang et al. [10] propose a novel generation-based method called distribution calibration, which calibrates the biased feature distributions of few-shot classes to approximate the corresponding ground-truth distributions. Sufficient labeled features can be generated from the calibrated distribution for supervised training. Compared with this calibration strategy, our method can more accurately estimate the ground truth distributions and adapt to different application scenarios.

Simulated Annealing Algorithm
The simulated annealing [23] (SA) algorithm is one of the most favored metaheuristics. The metal annealing process inspires the simulated annealing algorithm, built to simulate the disordered metal atoms in a high temperature gradually reaching an equilibrium state at a low temperature.
The SA algorithm is a stochastic optimization algorithm that searches the neighborhood of the current solution and decides whether to accept the new solution with a certain probability. The simulated annealing algorithm comprises an inner loop and an outer loop. In the inner loop, a slight perturbation is applied to the current solution for a new solution under the condition of the current temperature value. Then, the new solution is accepted by a certain probability, and the current optimal solution is updated. In the outer loop, the temperature is initially set at a high value and gradually decreases to a pre-set stop temperature.
Considerable researchers have recently used the simulated annealing algorithm in many fields, such as software defect estimation [24], deep feature selection [25], and deep neural networks [26][27][28]. Rere et al. [26] introduce the SA algorithm for updating the deep convolutional neural network (DCNN). Ayumi et al. [27] improve the model performance by optimizing the training process of DCNN with a variant of the SA algorithm. Hu et al. [28] adopt the SA algorithm for the initial weights optimization of the fully connected layer. Unlike these papers applying the simulated annealing algorithm into weight optimization, we creatively introduce its idea into the hyperparameter optimization process.

Method
In the following sections, we present the problem definition of the few-shot classification setting in Section 3.1 and the Prototype-based Representative Mechanism (PRM) in Section 3.2. Section 3.3 describes how to generate new robust samples and train the classifier. Algorithm 1 shows the detailed training process on the N-way-K-shot task, and the pipeline of distribution calibration with PRM is presented in Figure 1. The details of the Self-adaptive Hyperparameter Optimization Algorithm (SHOA) is clarified in Section 3.4. Figure 2 shows the basic block diagram of SHOA, and Figure 3 presents the pipeline of SHOA.
Algorithm 1 Training process on the N-way-K-shot task Input: base class data D base and support set S T = {(X j , y j )} N×K j=1 . Output: the optimal parameters of the classifier f θ .
as Equation (8); Build a set D containing L2 norm values of the difference betweenx i and {µ i } as Equation (9); 10: Build a set I k containing k closest base classes as Equation (10); 11: Calculate the calibrated meanμ and covarianceΣ as Equations (11) and (12); 12: Sample feature vectors for label y i from the calibrated distribution as Equation (14); 13: end for 14: Train a task-specific classifier f θ with all sampled features and support set features as Equation (15).

Base Classes
L classifIcation + L rotation  The basic block diagram of the Self-Adaptive Hyperparameter Optimization Algorithm, which contains two loops. In the inner loop, a slight perturbation is applied to the current hyperparameters, which are accepted with a certain probability. In the outer loop, the temperature decreases from a high value to a pre-set stop temperature.

Randomly initialize current hyperparameters
Initial the current and optimal values of the objective function Jn = 1 − Acc(Xn), J * = Jn Initial the optimal hy-perparametersX * =Xn New value of the objective func- Add a slight random perturbation on current hyperparametersXn Generate a random number η ∼ U(0, 1) Update the optimal value of the objective function J * = J n+1 Enough iterations? Figure 3. The flow diagram of the Self-Adaptive Hyperparameter Optimization Algorithm. As the temperature decreases, the acceptance probability decreases, helping jump out of the local minimum.

Problem Definition
Following the traditional few-shot classification setting, the whole dataset consists of data-label pairs D = {(X i , y i )}, where X i ∈ R H×W×3 is the ith image, y i ∈ C is the class label of X i , and C contains the labels of all classes. C can be divided into the label set of base classes C base and the label set of novel classes C novel , where C base ∪ C novel = C and C base ∩ C nnovel = ∅. The dataset D is accordingly divided into two subsets, namely, the base class data D base and the novel class data D novel . We adopt meta-learning with episodic training [4,6,14] fashion for evaluating the rapid adaptation ability of the model. Numerous tasks are sampled from novel class data D novel in the N-way-K-shot way. Only K labeled samples are in each of the N classes, randomly chosen from novel classes. Each task T is a tuple (S T , Q T ), including support set S T = {(X j , y j )} N×K j=1 and query set , where X j ∈ R H×W×3 is the jth image, y j ∈ C novel is the class label of X j , and q samples of each class are used for testing. A task-specific classifier is trained on the support set S T and evaluated on the query set Q T .

Distribution Calibration with a Prototype-Based Representative Mechanism
A pre-trained feature extractor [17] F θ * (·) with the parameters θ * is used to extract the d-dimensional feature vectors for each image. It could provide an initial feature space learned from base class data D base with sufficient samples. Yang et al. [10] assume that the data of every feature dimension from the same class follows a Gaussian distribution. Thus, the mean vector and the covariance matrix of feature vectors from the ith base class are calculated as: where n i is the number of samples in the ith base class, and X j is the jth image in the ith base class. The same pre-trained feature extractor F θ * (·) extracts the feature vectors of each sample from novel classes: where X j is the jth image in novel class data D novel . After feature extraction, the support and the query sets are updated as follows: where x j ∈ R d is the jth feature vector, and y j is its the class label. The feature distributions of samples in novel classes may be skewed due to their property. Therefore, Tukey's Ladder of Power transformation [29] is introduced to make them consistent with the Gaussian distribution. The feature vectors of samples in novel classes are transformed as follows: where x j is the jth feature vector, and β is a hyperparameter to modify the distribution skewness. The prototype-based representative in the support set can be calculated as: wherex j is the jth feature vector of the ith novel class. The prototype-based representatives containing more global information about the few-shot class are merged into the support set as follows: After expanding the support set with prototype-based representatives, the statistics is transferred from the k closest base classes to the few-shot class. We use the Euclidean distance between each feature in the support set and mean values of feature vectors from different base classes to measure similarity as the foundation for the transfer. The k closest base classes are selected as follows: wherex j is jth the feature vector in the support set, I k contains the class labels of the k most similar base classes, and Topk(·) is the operation to select the top k elements from the distance set D. Then, the calibrated mean and covariance of the calibrated distribution are given as follows:μ where α is the compensation for the within-class variation of few-shot classes. The above distribution calibration procedure is performed on each feature vector in the support set S T for a more precise estimation of the ground-truth distribution.

Sample Generation and Classifier Training
After transferring the statistics from the k closest classes to a few-shot class, calibrated distributions for label y i form a set as follows: whereμ i represents the mean of the ith calibrated distribution, andΣ i represents the covariance of the ith calibrated distribution. The set C y contains K + 1 calibrated distributions belonging to the same novel class. More diverse and robust features for label y i are generated from these calibrated distributions, constructed into a set as follows: The total number of sampled feature vectors is a hyperparameter M, and M/(K + 1) feature vectors are sampled from each calibrated distribution. For each few-shot classification task, a task-specific classifier is trained on the original features in the support set and the sampled features. The training loss is given by: where Y T contains all labels in task T , L CE is the cross-entropy loss, and the parameters of the task-specific classifier are denoted by by θ.

Simulated Annealing Algorithm
The simulated annealing algorithm [23] is a probabilistic-based optimization strategy, usually utilized to search the global optimum of an objective function that may have multiple local optima. In the simulated annealing algorithm, the optimization problem is to find a collection of variablesX = {X 1 , X 2 , . . . , X n }, which maximizes or minimizes the value of the objective function f (X ). The hyperparameters of distribution calibration with prototype-based representative mechanism include the power of Tukey's transformation β, the number of selected similar classes k, the compensation for the within-class variation α, and the number of generated features M. The objective is to seek optimal hyperparameters that lead to the smallest values of mean error rate on 10,000 few-shot classification tasks. The objective function can be given by: where Acc(·) is the operator to calculate mean accuracy on 10,000 classification tasks. The simulated annealing algorithm is a process that simulates a solid being heated and then slowly cooled. In the natural sciences, the temperature has physical significance. However, for the optimization problem, the temperature is only a control parameter. The initial temperature and the stop temperature control the iteration number of the algorithm, represented by T initial and T stop . The cooling schedule is given by: where γ is a constant less than one as the cooling coefficient, T t is the current temperature in the outer loop, and T t+1 is the next temperature in the outer loop.

Metropolis Algorithm
The simulated annealing process aims to find parameters that maximize or minimize the objective function. The hyperparameters change into the neighborhood of the current position with some tiny random perturbation. Correspondingly, the difference in the objective function value caused by this perturbation is denoted as ∆J = J n+1 − J n , where J n+1 is the new objective function value resulting from new hyperparameters after perturbation, and J n is the current value resulting from current hyperparameters before perturbation. When the objective function value decreases (∆J < 0), new hyperparameters are accepted as the starting point for the next perturbation. When the objective function value increases (∆J > 0), new hyperparameters may still be accepted with a certain probability, helping escape from the local optimum. The Metropolis algorithm [30] was introduced to generate the probability that determines whether to accept the new parameters. The acceptance probability is given by: where ∆J is the difference in the objective function value, and T t is the current temperature in the outer loop. They control the acceptance probability together. When ∆J is the same, the probability is larger at higher temperatures and smaller at lower temperatures. This probabilistic acceptance can be achieved by comparing P with a random number η sampled from a continuous uniform distribution U(0, 1). When η < P, the new parameters are accepted.

Evaluation Criteria
Experiments are performed on 10,000 tasks randomly sampled from novel classes, following 5-way-1-shot and 5-way-5-shot settings. The task-specific classifiers are evaluated on 15 query images per class in each task. The evaluation metric is the average top-1 accuracy on 10,000 tasks.

Implementation Details
Following the previous work [17], a WideResNet model is trained on base classes as the feature extractor for each dataset. As the task-specific classifier, we use the logistic regression classifier with L2 regularization. After classifier training, we use it for testing novel classes. The configuration used for all experiments consists of an Nvidia Quadro RTX 5000 (16 GB), 64 GB of RAM, Ubuntu 20.04, and torch 1.7.1.

Comparison to State-of-the-Art
We compared the proposed framework with other state-of-the-art methods, including optimization-based, metric-based, fine-tuning-based, and generation-based. Tables 1 and 2 present the classification results of our framework for 5-way-1-shot and 5-way-5-shot on miniImageNet and CUB-200-2011, respectively. Table 3 displays the results on CIFAR-FS. All tables show that our framework achieves the best performance for 1-shot and 5-shot compared with the other state-of-the-art on miniImageNet, CUB-200-2011, and CIFAR-FS. Table 1. 5-way-1-shot and 5-way-5-shot classification accuracy (%) on miniImageNet with 95% confidence intervals.

Visualization of Generated Samples
A task for 5-way-1-shot is randomly sampled from novel classes of miniImageNet. Figure 4 shows the t-SNE [48] visualization of the ground-truth distributions and the generated features on the task. Figure 4a presents the support set in the task, which includes five samples. We show the features generated by PSDC in Figure 4b, the features generated by DC [10] in Figure 4c, and the ground-truth feature distributions in Figure 4d. From Figure 4, we can observe that the feature distributions generated by PSDC are more aggregated and consistent with the corresponding ground-truth distributions than those generated by DC. The visualization illustrates the inherent plausibility of our method in the accuracy improvement of the few-shot classification task.

Ablation Study
The results of the ablation study are reported in Table 4 to show the different effects of multiple ingredients in the proposed framework, including the effects of Prototype-based Representative Mechanism (PRM) and the Self-adaptive Hyperparameter Optimization Algorithm (SHOA). Experiments of the ablation study are conducted on 10,000 tasks, and the reported results are the average classification accuracy with a 95% confidence interval. The baseline is the classification performance of the model when no components of the Prototype-based Adaptive Distribution Calibration (PSDC) framework are used. In the case of the 5-way-1-shot, there are five classes in the support set, and each class has a single sample that is the prototype-based representative. Therefore, the Prototypebased Representative Mechanism is applied for 5-shot, and we use "-" to replace the same result in Table 4. In the case of the 5-way-5-shot, when the model is trained without SHOA, PRM can improve the performance by 0.13% on miniImageNet, and the performance gain is 0.08% on CUB-200-2011. When the model is trained with SHOA, PRM can improve the performance by 0.1% and 0.07% on the miniImageNet and CUB-200-2011, respectively. Table 4 shows that, when the model is trained without PRM, SHOA can improve the performance by 0.16% and 0.13% on miniImageNet for 1-shot and 5-shot, respectively. Moreover, the model performance has been improved by 0.14% and 0.29% on the CUB-200-2011 for 1-shot and 5-shot, respectively. When the model is trained with PRM, SHOA can improve the performance by 0.1% and 0.28% for 5-shot on miniImageNet and CUB-200-2011, respectively. The results of ablation experiments illustrate that SHOA can not only self-adaptively optimize hyperparameters for the distribution calibration of different application scenarios (i.e., datasets) but also build a robust distribution calibration model for few-shot image classification. Table 4 shows that, when the model is trained with PRM and SHOA, model performance compared with baseline can be improved by 0.23% and 0.36% for 5-shot on miniImageNet and CUB-200-2011, respectively. The results of the ablation experiment illustrate that the classification performance of models can be improved by using both PRM and SHOA in the few-shot image classification task.

Hyperparameter Searching
The hyperparameters are searched on the validation set. The optimal hyperparameters achieve the lowest error rate, i.e., the highest accuracy. Before hyperparameter searching, some parameters need to be initialized, including the current and optimal hyperparameters of distribution calibration, the temperature for controlling acceptance probability, and the current and optimal values of the objective function (i.e., accuracy or error rate).
In the inner loop, a slight perturbation is added to a random hyperparameter, resulting in a new error rate. We compare the new error rate with the current error rate. In the first case, the new hyperparameters are accepted if the new error rate is less than the current error rate. In the other case, if the new error rate is greater than or equal to the current error rate, the acceptance probability is determined by the current temperature and the difference between the new error rate and the current error rate as Equation (18), which helps jump out of the local minimum. Suppose new hyperparameters are accepted, and the new error rate is lower than the optimal error rate. In that case, the optimal error rate and the optimal hyperparameters are updated by the new error rate and new hyperparameter.
As the number of outer loops increases, the temperature decay is performed according to Equation (17). The acceptance probability decreases considerably to almost zero as the temperature decays. The whole hyperparameter optimization is repeated continuously by a certain number of iterations, and robust hyperparameters can be found that are approximately the global optimum. Our method can make the objective function converge to the globally optimal value by jumping out of the locally optimal value with a certain probability. Furthermore, it can automatically search for the appropriate values of the hyperparameters rather than manually setting them, implementing the self-adaptive hyperparameter optimization for different application scenarios (i.e., datasets).
We present the process of hyperparameter optimization in Figure 5. Figure 5a-c show the variation of the model error rate on three different validation sets as the temperature iterations increase. The red line displays the evolution of the best model performance. The blue line represents the model performance at the last iteration of the inner loop. In addition, we show the variable acceptance probability at the last iteration of the inner loop as the number of temperature iterations increases in Figure 5d. Figure 5 shows that, as the number of temperature iterations increases, the model classification performance gradually coincides with the optimal classification performance, and the acceptance probability gradually decreases.

Conclusions
This paper proposes a novel few-shot image classification framework-Prototypebased Self-adaptive Distribution Calibration (PSDC), which includes a prototype-based representative mechanism and a self-adaptive hyperparameter optimization algorithm. To the best of our knowledge, PSDC is the first attempt to utilize a self-adaptive strategy in the hyperparameter optimization of distribution calibration. Extensive experiments are conducted on the miniImageNet, CUB-200-2011, and CIFAR-FS. Experimental results indicate that PSDC surpasses other few-shot learning methods on three benchmarks. Furthermore, the visualization of the generated features verifies that PSDC can better estimate the ground-truth distribution. All results verify the effectiveness of the proposed framework and demonstrate that PSDC achieves state-of-the-art. However, since the current algorithm for estimating the ground-truth distribution is still hand-crafted, future work will focus on learning distribution by deep neural networks to estimate the ground-truth distributions accurately.