Effect of Probabilistic Similarity Measure on Metric-Based Few-Shot Classiﬁcation

: In developing a few-shot classiﬁcation model using deep networks, the limited number of samples in each class causes difﬁculty in utilizing statistical characteristics of the class distributions. In this paper, we propose a method to treat this difﬁculty by combining a probabilistic similarity based on intra-class statistics with a metric-based few-shot classiﬁcation model. Noting that the probabilistic similarity estimated from intra-class statistics and the classiﬁer of conventional few-shot classiﬁcation models have a common assumption on the class distributions, we propose to apply the probabilistic similarity to obtain loss value for episodic learning of embedding network as well as to classify unseen test data. By deﬁning the probabilistic similarity as the probability density of difference vectors between two samples with the same class label, it is possible to obtain a more reliable estimate of the similarity especially for the case of large number of classes. Through experiments on various benchmark data, we conﬁrm that the probabilistic similarity can improve the classiﬁcation performance, especially when the number of classes is large.


Introduction
Pattern recognition methods using deep learning techniques have shown good results in many applications [1][2][3][4][5]. However, these results can only be obtained with a sufficiently large number of training data. Unlike the conventional deep learning models, humans can classify patterns with only a small number of samples. In order to realize this ability in a deep learning model, studies on few-shot learning have been attracting attention recently [6][7][8].
In few-shot learning for classification tasks, a classifier is required to recognize classes that are unseen in the learning phase, with a very limited number of samples. To achieve this goal, there have been proposed a number of few-shot classification models which are composed of two modules: an embedding module and a classification module [8][9][10][11][12]. The embedding module extracts appropriate features through mapping given input to an embedding space, and the classification module tries to classify newly given samples (query data) with only a few training samples (support data) by using the features from the embedding module. Since a new learning strategy, called episodic learning [8], was proposed to obtain a good embedding function under the few-shot scenario in [8], most of the subsequent works have mainly focused on designing a good embedding function model that can provide an efficient and general representation for recognizing unseen test classes [13][14][15][16][17].
On the other hand, there has been relatively little interest in the classification module where linear classifiers or simple distance-based classifiers have been mainly adopted [18][19][20]. This approach seems to be appropriate for the situation that the number of labeled samples for test classes is very limited because classifiers with high complexity can be easily overfitted to the few given samples. Although computational experiments have shown that a simple classifier combined with a well-generalized embedding module can achieve good performance [21], there would be further room for improvement in the classification module [22][23][24]. In this paper, we try to improve the performance of the few-shot classifiers by elaborating on the distance-based classification module.
It is well-known that the accuracy of a distance-based classifier can be improved by using a statistical measure such as Mahalanobis distance [25] rather than simple geometric measures such as Euclidean distance. Under the situation of few-shot classification, however, it is difficult to utilize the statistical measure because the estimation of accurate distribution is hard due to the extremely limited number of samples. Inaccurate estimation on the class distributions lead to poor distance measure, resulting in low classification accuracy.
To overcome this limitation, we propose to combine the probabilistic similarity based on intra-class statistics [26][27][28][29] with the prototypical network [9] that is a representative few-shot classification model. In [26,27], the probabilistic similarity between two samples is defined as a probability that they belong to the same class, and its probability density function is estimated under the assumption that a data point is generated from two factors: a class-specific factor and a class-independent factor. The class-specific factor can be represented by a prototype vector that is defined as the mean vector of support samples in the prototypical network [9]. The class-independent factor can be considered as an environmental factor that is irrelevant to each class, and thus can be estimated through an episodic learning strategy developed for few-shot learning [8,9,30].
Based on these considerations, we develop a method for applying the probabilistic similarity measure in the learning of embedding function as well as in the recognition of unseen classes. Moreover, by exploiting the similarity in learning of the embedding function as well, it is also expectable to obtain better feature representation, which is more suitable to the assumption on the data distribution for the prototype-based classifier. Additionally, since the distribution of the class-independent factor can be estimated more accurately as the number of classes increases even when each class has few samples, the proposed method is expected to be more effective in the case of a large number of classes. This can be an advantage of the proposed method, which cannot be expected from the conventional works using Euclidean distance.
The aim and main contributions of our work are summarized below: • In order to improve the performance of few-shot classification, we propose to combine the probabilistic similarity measure with deep embedding function networks.

•
We define an explicit function of probabilistic similarity based on the intra-class statistics and propose a modified episodic learning algorithm that simultaneously performs estimation of the similarity and optimization of the embedding function.

•
Whereas the conventional methods have been tested for a limited number of classes, we evaluate the change of performance as the number of classes increases, and confirm the apparent superiority of the proposed method, especially in the case of many classes.

•
Although we adopted the prototypical network for the experiments, the proposed method is not constrained by the embedding network model, and thus it can be extended to various forms using more sophisticated deep network models.
In Section 2, we describe the few-shot classification problem and briefly review the previous works to solve it, focusing on the metric-based method. In Section 3, we explain the probabilistic similarity measure used in our proposed method, and the overall process of the proposed method is described in Section 4. Section 5 presents experimental results on benchmark datasets comparing its performance with the existing methods. Our conclusion is made in Section 6.

Few-Shot Classification Problem
The few-shot classification task is used to classify newly given samples by using an extremely limited number of training samples. The set of given training samples is called support set S, and the set of new samples to be classified is called query set Q. Usually, we consider the N-way K-shot problem, where the number of classes is N and the number of Appl. Sci. 2021, 11, 10977 3 of 15 support samples per class is K. Since the value of K is very small, it is difficult to obtain a good classification model with only support samples. Therefore, in order to develop a deep learning model for few-shot classification, it is normal to use a separate dataset that is in the same domain but has completely distinct class labels. In this approach, the main goal of the learning is to find a deep learning model that can recognize query samples from the new test classes with only a few support samples.
One of the representative methods for achieving this goal is the metric-based method which tries to find an appropriate metric for classification [8][9][10]12]. As shown in Figure 1, the overall structure of the metric-based method is largely composed of two modules: the embedding network and the few-shot classifier. During the learning phase, a deep network model learns to find an embedding function that maps raw inputs to feature vectors on the embedding metric space. The few-shot classifier then predicts the class of query data based on their similarity to the support data, which is measured on the metric space. The loss from the classifier is then transmitted for learning of the embedding network.
the classes in the test phase are unseen during the learning phase, and thus it is important to acquire a good embedding function that can generally apply to unseen test classes. To address this problem, the matching networks [8] introduced the episodic learning strategy, which is one of the meta-learning techniques. In the episodic learning strategy, subsets in the form of an N-way K-shot classification are generated by random selections from the whole training data, which are called episodes. The neural network model updates the network parameters by learning one episode at a time. By proceeding through numerous episodes, the model learns a variety of cases composed of various classes and samples. In this way, the model is not limited to the given classes but can learn more generally about the domain of the classes. That is, information about unseen classes is obtained by the use of various combinations of classes that share some common factors of the domain.
Based on the episodic learning strategy, Snell et al. proposed the prototypical network [9], which combines a nonlinear embedding function network and a simple distance-based classifier. Under the assumption that there exists an embedding space on which samples from each class are clustered around a single prototype, it tries to find a good embedding network through episodic learning. Once a good embedding space is found, the classification is conducted by simply finding the nearest class prototype defined as the mean of the support samples. The Euclidean distances between queries and the prototype on the embedding space are used for defining the loss function for learning of the embedding network as well, so as to create an embedding space where samples from each class are gathered near its prototype. Since the prototypical network [9] has shown better performance than more complicated few-shot learning models [6][7][8], there have been a number of extensions based on Unlike the usual metric-learning problem, the few-shot learning task assumes that the classes in the test phase are unseen during the learning phase, and thus it is important to acquire a good embedding function that can generally apply to unseen test classes. To address this problem, the matching networks [8] introduced the episodic learning strategy, which is one of the meta-learning techniques. In the episodic learning strategy, subsets in the form of an N-way K-shot classification are generated by random selections from the whole training data, which are called episodes. The neural network model updates the network parameters by learning one episode at a time. By proceeding through numerous episodes, the model learns a variety of cases composed of various classes and samples. In this way, the model is not limited to the given classes but can learn more generally about the domain of the classes. That is, information about unseen classes is obtained by the use of various combinations of classes that share some common factors of the domain.
Based on the episodic learning strategy, Snell et al. proposed the prototypical network [9], which combines a nonlinear embedding function network and a simple distancebased classifier. Under the assumption that there exists an embedding space on which samples from each class are clustered around a single prototype, it tries to find a good embedding network through episodic learning. Once a good embedding space is found, the classification is conducted by simply finding the nearest class prototype defined as the mean of the support samples. The Euclidean distances between queries and the prototype on the embedding space are used for defining the loss function for learning of the embedding network as well, so as to create an embedding space where samples from each class are gathered near its prototype.
Since the prototypical network [9] has shown better performance than more complicated few-shot learning models [6][7][8], there have been a number of extensions based on the same structure shown in Figure 1 [10,[13][14][15]. Sung et al. [10] added the relation module after the embedding module for more fine-grained classification. Li et al. [13] proposed to use local features as additional information to image-level features in the embedding module. Based on the prototypical networks, Wertheimer et al. [14] use a concatenation of foreground and background vector representations as feature vectors. Kim et al. [15] introduced the variational autoencoder (VAE) structure [31] into the embedding module for training the prototype images.
These works focused on obtaining a good embedding network and there has not been much interest in the classification module. This is primarily due to the limitations of the few-shot classification task. Even though it is known that other alternatives such as Mahalanobis distance [25] can be adopted instead of the simple deterministic distance, it is difficult to apply the statistical distance, because it needs distribution information of samples which are not sufficiently given in the few-shot classification tasks.
As an attempt to overcome these difficulties, Fort [22] tried confidence region estimation in the embedding space in the form of a Gaussian covariance matrix and used it to construct metrics. Liu et al. [24] proposed to use of a new metric learning formula based on Mahalanobis distance [25] to avoid the tendency to overfit the training class. However, these methods still have difficulty in estimating the covariance matrix of each class under the few-shot setting. Li et al. [23] proposed a method relatively free from this problem by defining a local covariance that is obtained from local features in the embedding modules. Though this method shows the efficiency of the second-order statistics, it needs a more complex classifier with specially designed metrics.
In this paper, we try to find some possibility of improving the classifier by using a probabilistic similarity based on the intra-class statistics, which can be estimated robustly especially when the number of classes increases. Unlike [20], our proposed method does not depend on the structure of the embedding module, and it can be combined with the original prototypical network as well as other sophisticated models.

Probabilistic Similarity Measure
In the probabilistic similarity measure [26][27][28][29], the similarity is defined as the probability that two data x and x belong to the same class c k , which can be written as: An explicit function of the probability can be obtained by defining a generation model of data x with two components [28]: class component c and environmental component ε such as: The environmental components ε originates from some environmental variations such as illuminations and is assumed to be independent of the class source. On the contrary, the class component c is originated from a class-specific source determined differently for each class.
Additionally, Ref. [28] further assumes that the class-specific component c k for each class c k can be regarded as a unique prototype, and the intra-class variations are caused by the environmental component ε. The environmental component is also assumed to be independent of the class and be identical regardless of the class-label. Although these assumptions may be considered rather strict for application in real data, they are consistent with the assumption placed on the classifier of the prototypical network. More precisely, the simple classifier used in the prototypical network can be regarded as a particular case of the distance-based classifier using the probabilistic similarity used in [26][27][28][29].
In order to obtain an explicit form of the probabilistic similarity, let us consider a difference vector between two samples x and x belonging to a single class c k , which can be written as: By subtracting two vectors belonging to the same class, we can infer that the classspecific component disappears and only the environmental component ε remains in the Appl. Sci. 2021, 11, 10977 5 of 15 intra-class difference vector. Then, the probability that the two data belong to the same class can be obtained by estimating the probability density function p(δ).
According to the assumption in [28], all classes have the same environmental component ε. Therefore, all of the difference vectors will follow a single distribution regardless of class-labels. We can specify the distribution using this set of difference vectors and the characteristics of the environmental component ε. Noting that the environmental component ε is caused by diverse sources, we can assume that ε is subject to Gaussian distribution, and so does the difference vector δ = ε − ε .
In order to estimate the mean and variance of Gaussian pdf p(δ), we compose the set of intra-class difference vectors Ω using support samples, which can be defined as: Then, the mean vector µ Ω and the covariance matrix ∑ Ω can be estimated from the set Ω, and the density function p(δ) that we want to know can be written as: Noting that the higher value of p(δ) implies a higher likelihood that the two data making up δ belong to the same class, the similarity measure S G (x, x ) for two samples x and x is defined as the value proportional to p(δ), such as: The efficiency of the obtained similarity value has been confirmed in various application problems [26,27,32]. In the prototypical network, its classifier uses Euclidean distance, which is the special case of the probabilistic similarity with µ Ω = 0 and unit covariance matrix. In this paper, we apply the general covariance matrix in the learning of embedding function as well as classification. It should also be remarked that this similarity is different from the conventional Mahalanobis distance [25] that uses covariance of original samples x. By using intra-class difference vectors, the larger number of samples can be used to estimate the covariance matrix ∑ Ω , and can obtain more accurate estimation.

Few-Shot Classification Using Probabilistic Similarity
The assumption for deriving the explicit form of Equation (6) is rather impractical to deal with diverse variations of real data, but it could be effectively applied to the data representation obtained from the well-trained embedding function. Based on this consideration, we propose to combine the probabilistic similarity with the metric-based few-shot classification model. Though the probabilistic similarity does not depend on the structure of the embedding network and can be combined with various few-shot learning models, we adopt the prototypical network, the primary and representative model, in order to focus on the effect of the similarity.
When the probabilistic similarity S G (x, x ) of Equation (6) is applied to the fewshot classification model, x is a query data, x is a support data, and the average µ Ω of the difference vectors can be set to zero. Additionally, with the few-shot classification model, we have an embedding function f φ with parameter φ, and the embedding vector representation f φ (x) can be used instead of the raw data x. The similarity function is then written as: With the nearest neighbor classifier, we assign query data x to the classes including the support data with maximum similarity values. For the case of prototype-based classifier, a Appl. Sci. 2021, 11, 10977 6 of 15 prototype vector c k for each class c k is calculated first by taking the mean of embedding support vectors f φ (x) in c k , which is written as: Then, the class-label of query data x is determined by the similarity between embedding query vector f φ (x) and the prototype c k for each class c k ; According to the format of the distance-based classifier, the similarity function defined above is rewritten as a distance function and we finally obtain: Here, the covariance matrix ∑ Ω is a parameter to be estimated during learning of embedding network as well as classification. Note that this probabilistic similarity has an advantage in that the number of samples for estimating ∑ Ω is relatively large even under the few-shot situation. Since the set of intra-class difference vectors is used for estimation, the number of samples in Ω is finally N × K 2 in the case of the N-way K-shot problem.
In the few-shot classification, it is important to catch common distributional property shared by different classes at the learning phase and use them to classify the newly given classes in the test phase. Since the probabilistic similarity measure is derived from the distribution of environmental components shared by all classes, it can be estimated iteratively through episodic learning.
At t-th iteration of episodic learning, with the set of difference vectors Ω t , the covariance ∑ t is estimated as: where α(0 ≤ α ≤ 1) is a user-defined parameter to control the proportion of the previously obtained estimation at (t − 1)-th episode. In the test phase, we have the set of difference vectors Ω tst composed of support samples in test classes, and the covariance ∑ tst is estimated as: where ∑ trn is the covariance estimated by using the whole train set after learning is finished, and it is added to the covariance of the set of Ω tst with a user-defined coefficient α(0 ≤ α ≤ 1). We should note that the estimated covariance can be near singular under the few-shot settings, especially when the dimension of the embedding vector is larger than a number of samples in Ω. In that case, we need to add a regularization term (e.g., identity matrix) to prevent a singular condition of its inverse matrix. Figure 2 shows the overall structure of the proposed few-shot classification model using probabilistic similarity. The overall process follows the conventional metric-based few-shot classifier illustrated in Figure 1, but there is an additional module for obtaining probabilistic similarity.  Under the N-way K-shot classification scenario, a subset for an episode contains examples from N different classes, each of which is decomposed as the support set with K samples and the query set with the remaining samples ( = 1, … , ). Using the support set , the prototype vector for each class is calculated and the set of intra-class difference vectors is also composed. The samples in the query set are given to a few-shot classifier for conducting classification and evaluating loss value. The loss L for the training episode is defined by using softmax over distances between queries and prototypes, which can be written as:

Overall Process
The proposed episodic learning process is summarized in Algorithm 1.   = 1, . . . , N). Using the support set S k , the prototype vector for each class is calculated and the set of intra-class difference vectors is also composed. The samples in the query set Q k are given to a few-shot classifier for conducting classification and evaluating loss value. The loss L for the training episode is defined by using softmax over distances between queries and prototypes, which can be written as: The proposed episodic learning process is summarized in Algorithm 1.
end for x end for k Update network parameters φ using a gradient descent optimizer with loss L end for t In the test phase, samples from new classes that are not seen during learning are given. Each test class is also decomposed as support data and query data. After calculating the prototype vector and similarity function by using the optimized embedding function through the learning phase, query samples are assigned to the class of the closest prototype. In this case, the covariance matrix used for distance calculation is obtained by using Equation (12).

Experimental Results
In order to verify the performance of our proposed method, we conducted experiments using three datasets: Omniglot [33], Multi-PIE [32], and GTSRB [34]. Since the purpose of the experiments is to see the effect of the probabilistic similarity measure, we mainly compare its performance with the conventional model with Euclidean distance. Each dataset was divided into training data and test data. With the training set, the embedding network was trained using the Adam optimizer. The learning rate started at 10 −3 and halved every 5000 episodes. The training continued until convergence of loss value, which took at least 50,000 episodes. For the performance evaluation, the classification accuracies for 600 test episodes were calculated and averaged various N-way K-shot settings. Since the proposed method needs at least two samples to compose the difference vector set Ω, we set K = 5, which is a common setting in the conventional works. Instead, we investigated the performance change according to the increase in the number of ways N, which is practically more important but is not addressed in the previous works. Figure 3 shows some examples of Omniglot data [34] which are a handwritten dataset for various characters. It consists of 1623 characters collected from 50 alphabets, and each character has 20 samples drawn by different individuals. We follow the procedure of Vinyals et al. [8] for data preparation and augmentation. The original 105 × 105 data are resized to 28 × 28 and rotated by multiples of 90 degrees. By rotating the existing image, we obtained four times as many classes as the original one. The embedding network with four convolutional blocks transforms an image into a 64-dimensional feature. In the test phase, samples from new classes that are not seen during learning are given. Each test class is also decomposed as support data and query data. After calculating the prototype vector and similarity function by using the optimized embedding function through the learning phase, query samples are assigned to the class of the closest prototype. In this case, the covariance matrix used for distance calculation is obtained by using Equation (12).

Experimental Results
In order to verify the performance of our proposed method, we conducted experiments using three datasets: Omniglot [33], Multi-PIE [32], and GTSRB [34]. Since the purpose of the experiments is to see the effect of the probabilistic similarity measure, we mainly compare its performance with the conventional model with Euclidean distance. Each dataset was divided into training data and test data. With the training set, the embedding network was trained using the Adam optimizer. The learning rate started at 10 −3 and halved every 5000 episodes. The training continued until convergence of loss value, which took at least 50,000 episodes. For the performance evaluation, the classification accuracies for 600 test episodes were calculated and averaged various N-way K-shot settings. Since the proposed method needs at least two samples to compose the difference vector set Ω, we set = 5, which is a common setting in the conventional works. Instead, we investigated the performance change according to the increase in the number of ways N, which is practically more important but is not addressed in the previous works. Figure 3 shows some examples of Omniglot data [34] which are a handwritten dataset for various characters. It consists of 1623 characters collected from 50 alphabets, and each character has 20 samples drawn by different individuals. We follow the procedure of Vinyals et al. [8] for data preparation and augmentation. The original 105 × 105 data are resized to 28 × 28 and rotated by multiples of 90 degrees. By rotating the existing image, we obtained four times as many classes as the original one. The embedding network with four convolutional blocks transforms an image into a 64-dimensional feature. Since the embedding vector is 64 dimensions, the covariance matrix is 64 × 64, and thus we have 4096 parameters to be estimated, which is much larger than the number of data given in the few-shot task. This is prone to cause a singularity in the inversion of the covariance matrix during distance calculation. To avoid this, we added an identity matrix as a regularization term for estimating covariance, as shown in Equation (11). In the test phase, we also added the covariance trn obtained from training data as shown in Equation (12). Since the embedding vector is 64 dimensions, the covariance matrix is 64 × 64, and thus we have 4096 parameters to be estimated, which is much larger than the number of data given in the few-shot task. This is prone to cause a singularity in the inversion of the covariance matrix during distance calculation. To avoid this, we added an identity matrix as a regularization term for estimating covariance, as shown in Equation (11). In the test phase, we also added the covariance ∑ trn obtained from training data as shown in Equation (12). Table 1 compares the classification accuracy of the proposed method with the conventional methods, under the 5-shot settings. The Omniglot data are one of the representative benchmark sets for few-shot classification, but it is relatively simple. Thus, as shown in Table 1, all the methods show good results while the proposed one achieves the best results. In Figure 4, we compared the performance changes according to the number of test classes. In order to see the effectiveness of the proposed method, we compare the performance with the original prototypical network [9] as well as the Gaussian prototypical network [22] that uses class-wise covariance information. As shown in the graph, the performance degradation of the proposed method is gentler than the original prototypical network. Though the Gaussian prototypical network shows better performance than the original one, it can be seen that the proposed intra-class covariance gives a more effective distance measure than the class-wise covariance used in [19]. Since the proposed probabilistic similarity is estimated by using intra-class difference vectors which increase in proportion to the number of classes, it is possible to estimate the covariance matrix ∑ tst more accurately as N increases. This may act as a strength of the proposed method, showing better performance in the case of large N. In particular, when compared with the gaussian prototypical network using statistical characteristics, it can be seen that the performance gap increases as N increases.  [8,10,12,17,19], and the ones with ** are measured by experiments using codes provided by the authors [9,22].

5-Way 20-Way
MatchingNet [8] 98.9 * 97.0 * RelationNet [10] 99.6 * 98.6 * MAML [17] 99.7 * 98.7 * ConvNet [19] 99.6 * 98.6 * IMP [12] 99.5 * 98.6 * ProtoNet [9] 99.50 ** 98.40 ** GaussianProtoNet [22] 99.50 ** 98.40 ** Proposed 99.71 98.71 Table 1 compares the classification accuracy of the proposed method with the conventional methods, under the 5-shot settings. The Omniglot data are one of the representative benchmark sets for few-shot classification, but it is relatively simple. Thus, as shown in Table 1, all the methods show good results while the proposed one achieves the best results. In Figure 4, we compared the performance changes according to the number of test classes. In order to see the effectiveness of the proposed method, we compare the performance with the original prototypical network [9] as well as the Gaussian prototypical network [22] that uses class-wise covariance information. As shown in the graph, the performance degradation of the proposed method is gentler than the original prototypical network. Though the Gaussian prototypical network shows better performance than the original one, it can be seen that the proposed intra-class covariance gives a more effective distance measure than the class-wise covariance used in [19]. Since the proposed probabilistic similarity is estimated by using intra-class difference vectors which increase in proportion to the number of classes, it is possible to estimate the covariance matrix tst more accurately as N increases. This may act as a strength of the proposed method, showing better performance in the case of large N. In particular, when compared with the gaussian prototypical network using statistical characteristics, it can be seen that the performance gap increases as N increases.  [8,10,12,17,19], and the ones with ** are measured by experiments using codes provided by the authors [9,22].

5-Shot Acc. (%) 5-Way 20-Way
MatchingNet [8] 98.9 * 97.0 * RelationNet [10] 99.6 * 98.6 * MAML [17] 99.7 * 98.7 * ConvNet [19] 99.6 * 98.6 * IMP [12] 99.5 * 98.6 * ProtoNet [9] 99.50 ** 98.40 ** GaussianProtoNet [22] 99.50 ** 98.40 ** Proposed 99.71 98.71   Figure 5 shows some examples of the Multi-PIE dataset that has been created as a benchmark for facial recognition. A total of 337 subjects participated in data collection, and shooting was conducted in four-time sessions. In the shooting, 20 patterns of lighting effects, 15 poses, and 6 types of emotional expression were mobilized, and over 2000 images were collected per subject in a one-time session. We transform the original data to suit the intention of the experiment according to the previous work [32]. The whole image is cropped so that only the face appears as shown in Figure 6. It is then converted to a black and white image and then resized to 28 × 28 pixels. The Multi-PIE data have rather simple variations compared to the recent benchmark for face verification. However, considering that this experiment is conducted with the simple convolutional network with the purpose of verifying the effect of the probabilistic similarity, Multi-PIE data are appropriate in the sense that it assorted environmental variations such as illumination, poses, expression, and time sessions. Figure 5 shows some examples of the Multi-PIE dataset that has been created as a benchmark for facial recognition. A total of 337 subjects participated in data collection, and shooting was conducted in four-time sessions. In the shooting, 20 patterns of lighting effects, 15 poses, and 6 types of emotional expression were mobilized, and over 2000 images were collected per subject in a one-time session. We transform the original data to suit the intention of the experiment according to the previous work [32]. The whole image is cropped so that only the face appears as shown in Figure 6. It is then converted to a black and white image and then resized to 28 × 28 pixels. The Multi-PIE data have rather simple variations compared to the recent benchmark for face verification. However, considering that this experiment is conducted with the simple convolutional network with the purpose of verifying the effect of the probabilistic similarity, Multi-PIE data are appropriate in the sense that it assorted environmental variations such as illumination, poses, expression, and time sessions. The modified dataset consists of a total of 184 classes, and we divide them into 122 training classes and 62 test classes for a few-shot classification problem. The training set contains 600 samples per class, and the test set contains 370 samples per class. For each training episode, 45 queries per class are used. We also should note that each class of Multi-PIE data has much more samples with diverse variations than Omniglot data while only five samples per class are used for support. This may cause some difficulties in estimating the covariance of difference vectors.  Figure 7 compares the performance of the proposed method using prototypical networks. From the graph, we can see that the proposed method can improve the accuracy by using probabilistic similarity, and the effect of performance improvement appears more clearly as the number of ways increases. This result is consistent with our argument that, as the number of ways increases, the accuracy of the estimation increases and thus more sophisticated classification becomes possible. Recognition performance could be suit the intention of the experiment according to the previous work [32]. The whole image is cropped so that only the face appears as shown in Figure 6. It is then converted to a black and white image and then resized to 28 × 28 pixels. The Multi-PIE data have rather simple variations compared to the recent benchmark for face verification. However, considering that this experiment is conducted with the simple convolutional network with the purpose of verifying the effect of the probabilistic similarity, Multi-PIE data are appropriate in the sense that it assorted environmental variations such as illumination, poses, expression, and time sessions. The modified dataset consists of a total of 184 classes, and we divide them into 122 training classes and 62 test classes for a few-shot classification problem. The training set contains 600 samples per class, and the test set contains 370 samples per class. For each training episode, 45 queries per class are used. We also should note that each class of Multi-PIE data has much more samples with diverse variations than Omniglot data while only five samples per class are used for support. This may cause some difficulties in estimating the covariance of difference vectors.  Figure 7 compares the performance of the proposed method using prototypical networks. From the graph, we can see that the proposed method can improve the accuracy by using probabilistic similarity, and the effect of performance improvement appears more clearly as the number of ways increases. This result is consistent with our argument that, as the number of ways increases, the accuracy of the estimation increases and thus more sophisticated classification becomes possible. Recognition performance could be The modified dataset consists of a total of 184 classes, and we divide them into 122 training classes and 62 test classes for a few-shot classification problem. The training set contains 600 samples per class, and the test set contains 370 samples per class. For each training episode, 45 queries per class are used. We also should note that each class of Multi-PIE data has much more samples with diverse variations than Omniglot data while only five samples per class are used for support. This may cause some difficulties in estimating the covariance of difference vectors. Figure 7 compares the performance of the proposed method using prototypical networks. From the graph, we can see that the proposed method can improve the accuracy by using probabilistic similarity, and the effect of performance improvement appears more clearly as the number of ways increases. This result is consistent with our argument that, as the number of ways increases, the accuracy of the estimation increases and thus more sophisticated classification becomes possible. Recognition performance could be further improved by using a more complex backbone network, but this is somewhat out of the scope of this study. In this experiment, we focused on confirming the effect of probabilistic similarity.

Multi-PIE Face Recognition
Appl. Sci. 2021, 11, x FOR PEER REVIEW 11 o further improved by using a more complex backbone network, but this is somewhat o of the scope of this study. In this experiment, we focused on confirming the effect of pr abilistic similarity.

GTSRB Traffic Sign Recognition
As the third dataset, we chose a more practical one that is likely to be observed real applications. GTSRB dataset [34] consists of various types of sign images as shown Figure 8. There are 43 types of signs and total 51,839 images, which are color images tak from various angles and lighting conditions with various resolutions. We resize all images to 84 × 84 pixel size. A total of 43 classes are divided into 22 and 21 classes us for training and testing, respectively. In order to maximize the generality of the emb ding network through learning, data augmentation for the training set was perform according to the previous work [13]. Similar to the case of Omniglot, the number of tra ing classes was increased by rotating the training images. Since it is a color image, the r data format becomes 84 × 84 × 3 and is converted into a 1600-dimensional feature vec through the embedding network. In the learning phase, the embedding network is train through episodes in the form of 20-way 5-shot. In the test phase, we start at 5-way 5-s classification, and increase the ways by 5, finally reaching up to 21-way.

GTSRB Traffic Sign Recognition
As the third dataset, we chose a more practical one that is likely to be observed in real applications. GTSRB dataset [34] consists of various types of sign images as shown in Figure 8. There are 43 types of signs and total 51,839 images, which are color images taken from various angles and lighting conditions with various resolutions. We resize all the images to 84 × 84 pixel size. A total of 43 classes are divided into 22 and 21 classes used for training and testing, respectively. In order to maximize the generality of the embedding network through learning, data augmentation for the training set was performed according to the previous work [13]. Similar to the case of Omniglot, the number of training classes was increased by rotating the training images. Since it is a color image, the raw data format becomes 84 × 84 × 3 and is converted into a 1600-dimensional feature vector through the embedding network. In the learning phase, the embedding network is trained through episodes in the form of 20-way 5-shot. In the test phase, we start at 5-way 5-shot classification, and increase the ways by 5, finally reaching up to 21-way. further improved by using a more complex backbone network, but this is somewhat out of the scope of this study. In this experiment, we focused on confirming the effect of probabilistic similarity.

GTSRB Traffic Sign Recognition
As the third dataset, we chose a more practical one that is likely to be observed in real applications. GTSRB dataset [34] consists of various types of sign images as shown in Figure 8. There are 43 types of signs and total 51,839 images, which are color images taken from various angles and lighting conditions with various resolutions. We resize all the images to 84 × 84 pixel size. A total of 43 classes are divided into 22 and 21 classes used for training and testing, respectively. In order to maximize the generality of the embedding network through learning, data augmentation for the training set was performed according to the previous work [13]. Similar to the case of Omniglot, the number of training classes was increased by rotating the training images. Since it is a color image, the raw data format becomes 84 × 84 × 3 and is converted into a 1600-dimensional feature vector through the embedding network. In the learning phase, the embedding network is trained through episodes in the form of 20-way 5-shot. In the test phase, we start at 5-way 5-shot classification, and increase the ways by 5, finally reaching up to 21-way.   Figure 9 shows the change of accuracy according to the number of classes for the proposed method and prototypical network. In these practical data, the effect of probabilistic similarity is observed more clearly. Especially, the superiority of the proposed method becomes clearer as the number of ways increases. The results are consistent with the assumptions about the correlation between the number of ways and the accuracy. In order to verify the efficiency of the proposed method compared with state-of-the-art methods, we also conducted experiments for 1-shot classification according to [34]. Since the proposed method cannot obtain difference vectors set Ω tst with a single support sample, episodic training was carried out with five support samples during the learning phase, and the covariance obtained from the train dataset was used for testing. From Table 2, we can see that the performance of the proposed method was higher than most conventional models except the VPE model with data augmentation, which is well-designed for the specific GTSRB data. Figure 9 shows the change of accuracy according to the number of classes for proposed method and prototypical network. In these practical data, the effect of probab istic similarity is observed more clearly. Especially, the superiority of the propos method becomes clearer as the number of ways increases. The results are consistent w the assumptions about the correlation between the number of ways and the accuracy. order to verify the efficiency of the proposed method compared with state-of-themethods, we also conducted experiments for 1-shot classification according to [34]. Sin the proposed method cannot obtain difference vectors set Ω tst with a single support sa ple, episodic training was carried out with five support samples during the learning ph and the covariance obtained from the train dataset was used for testing. From Table 2, can see that the performance of the proposed method was higher than most conventio models except the VPE model with data augmentation, which is well-designed for specific GTSRB data.   [15], and the ones with ** are measured by experiments using codes provided by the authors [9].

21-Way 1-Shot Acc. (%) Original Training Data
Train with Data Augmentation ProtoNet [9] 67.10 ** 74.58 ** QuadNet [20] 45.20 * -SiamNet [6] 22 To summarize the results of the three experiments, when the size of the way is lo the difference in performance from the existing models does not appear much. Howev from Figures 4, 7 and 9, we can see a noticeable difference from the original model for task with a larger number of ways, which has not been investigated in previous wor The larger the way, the more samples can be used to estimate the environmental distrib tion represented as the covariance matrix. Thanks to this advantage of the propos Figure 9. Change of Classification accuracy on GTSRB depending on the number of ways. Table 2. The 1-shot classification accuracies on GTSRB. The values marked with * are quoted from the VPE paper [15], and the ones with ** are measured by experiments using codes provided by the authors [9].

Original Training Data Train with Data Augmentation
ProtoNet [9] 67.10 ** 74.58 ** QuadNet [20] 45.20 * -SiamNet [6] 22 To summarize the results of the three experiments, when the size of the way is low, the difference in performance from the existing models does not appear much. However, from Figures 4, 7 and 9, we can see a noticeable difference from the original model for the task with a larger number of ways, which has not been investigated in previous works. The larger the way, the more samples can be used to estimate the environmental distribution represented as the covariance matrix. Thanks to this advantage of the proposed method based on intra-class statistics, the performance degradation according to the increase in ways is gentler than that of the conventional model. Although the classification task for a large number of classes has great practical importance, it has rarely been dealt with in conventional works on few-shot classification. This paper has significance in that it presents a method to solve the many-class few-shot classification problem.

Conclusions
For the conventional metric-based few-shot classification methods, the main focus is to find a good metric space on which the intra-class variations of unseen classes are minimized. Under the premise that this can be successfully achieved by using a deep embedding network and episodic learning strategy, the classification is performed by a simple distance-based classifier using standard distance such as cosine and Euclidean. In this paper, we suggest a way of improving the distance-based classifier by using a probabilistic similarity, which is derived from a class-independent environmental factor estimated by using intra-class difference vectors. Taking the intra-class difference vector, we can exclude the class-specific components that are hard to estimate with a limited number of samples per class.
Although the probabilistic similarity based on intra-class statistics has already been used in classical pattern recognition studies, the conventional works suggest a premise that a good feature representation for the input data is provided in advance. In the proposed method, however, the feature extraction module (embedding network) is also trained by using loss signals from the classifier with probability similarity. Essentially, the probabilistic similarity assumes that all the classes in a domain have the same intra-class variations, and this is consistent with the prototypical network model, which assumes that it is possible to find a good embedding space where each class has a single prototype and its variations are very limited. Owing to this consistency, the proposed method achieves improved performance in the experiments. However, it is noteworthy that the proposed algorithm has no constraint on the embedding network model and better performance can be expected by using a more complex embedding network. Finally, since the good performance for problems with many classes and the simplicity of implementation can be practical strengths of the proposed method, its practical applications in various fields can also be an interesting follow-up.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.