You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

28 October 2022

Few-Shot Classification with Dual-Model Deep Feature Extraction and Similarity Measurement

,
and
1
Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei 106335, Taiwan
2
Advanced Intelligent Image and Vision Technology Research Center, National Taiwan University of Science and Technology, Taipei 106335, Taiwan
*
Author to whom correspondence should be addressed.
This article belongs to the Collection Image and Video Analysis and Understanding

Abstract

From traditional machine learning to the latest deep learning classifiers, most models require a large amount of labeled data to perform optimal training and obtain the best performance. Yet, when limited training samples are available or when accompanied by noisy labels, severe degradation in accuracy can arise. The proposed work mainly focusses on these practical issues. Herein, standard datasets, i.e., Mini-ImageNet, CIFAR-FS, and CUB 200, are considered, which also have similar issues. The main goal is to utilize a few labeled data in the training stage, extracting image features and then performing feature similarity analysis across all samples. The highlighted aspects of the proposed method are as follows. (1) The main self-supervised learning strategies and augmentation techniques are exploited to obtain the best pretrained model. (2) An improved dual-model mechanism is proposed to train the support and query datasets with multiple training configurations. As examined in the experiments, the dual-model approach obtains superior performance of few-shot classification compared with all of the state-of-the-art methods.

1. Introduction

In the deep learning domain, whether it is image classification [], object detection [], or segmentation [], the quantity and quality of the dataset itself are critical for the training of the model. Specifically, in the case of general image classification, the quantity and quality of the datasets are critical in determining the performance of the model. However, the generation of large and complex image datasets through manual labelling results in huge labor costs and also involves significant labelling and curation time. Recently, the alternate approach for labelled dataset generation is performed using web crawling techniques to collect numerous images from the Internet along with the associated text descriptions. This can save tremendous labor costs in image generation, yet the collected images may suffer from massive mislabeling, leading to quality degradation of the overall dataset. Moreover, though many open datasets are available, their application scope is limited, and even in a dataset, there exist multiple issues such as out-of-distribution data and data versatility. For example, in the case of industrial defect detection, the open datasets cannot be used directly. Sometimes, techniques such as transfer learning are not applicable as the source and test datasets have different distributions and impose a strong domain shifting problem. Hence, dataset generation is tricky and tedious as the labelling team has to be incorporated with the production team and decide the product label based on multiple factors. Hence, it is indeed challenging to obtain sufficient annotated data, and if the dataset is small in quantity, many general supervised learning approaches cannot be trained properly, resulting in poor classifiers.
Therefore, the main objective of the proposed work focuses on the development of a few-shot-based classification algorithm, which is suited for practical applications, including small or incorrect data involvement. As in Figure 1, the few-shot training involves fewer images than that of the standard classifiers and still can achieve the classification of new classes in the testing stage []. The first challenge is to figure out how to learn effectively with a small amount of training data, and the key is to enable the model to extract effective information from the small dataset and, accordingly, improvised classification performance. The second direction is to maximize the image information, and thus, the model can learn efficiently and be trained with a few labelled data. Considering the objectives, in this work, a multi-backbone model is proposed to yield good feature extraction and obtain superior performance compared to the state-of-the-art methods.
Figure 1. Example of few-shot training and classification.
This manuscript is organized as follows: Section 2 covers the literature review on the existing works and limitations and main contributions of the proposed work. Section 3 briefly describes the datasets used for the model training and evaluation. The detailed description of the proposed model is provided in Section 3. The comprehensive experimental analysis on three standard datasets and the overall summary are provided in Section 5 and Section 6, respectively.

3. Few-Shot Learning Datasets

The proposed work was tested on three public standard few-shot learning image datasets, as shown in Figure 5. The Mini-ImageNet dataset [], which contains 60,000 images, was collected from ImageNet. It has a total of 100 categories, including 64 categories of training sets, 16 categories of validation sets, and 20 categories of testing sets, and each category contains 600 images. To verify the ability of the model to classify a new category with a small number of samples, the training set and the test set in the Mini-ImageNet dataset were divided into distinct categories without any overlapping.
Figure 5. Few-shot classification datasets.
The second few-shot learning dataset was CIFAR-FS [], containing 60,000 images collected from CIFAR100. The dataset comprises 100 categories, which are divided into 64 categories in the training sets, 16 categories in the validation sets, and 20 categories in the test sets. Each category contains 600 images, and the image size is set at 84 × 84. Similar to the previous dataset, the training set and the testing set were divided in a specific way to avoid any overlapping. The third dataset used in this work was the Caltech-UCSD Birds-200-2011 (CUB 200) [], containing 11,788 bird images with 200 categories. This dataset is the most popular in fine-grained visual classification tasks. As opposed to the category settings of the other two few-shot training datasets, the categories of the training and test sets in CUB 200 still overlap, but the amount of data is far less than the other datasets.

4. Proposed Method

The present work comprises three main elements involving self-supervised learning, a dual-mode backbone network, and feature assessment. In this section, the detailed description of all are provided in the following three subsections.

4.1. Self-Supervised Learning

The main strategy of the proposed work is to exploit the advantage of the SSL methods in obtaining the best pretrained model and then improve the performance using more effective backbone networks with better training strategies. The detailed elaboration of the SSL methods is provided below.
The self-supervised learning models work under the common objective of learning the representations that are invariant under various distortions. In general, different distorted input images are fed through the variant of the Siamese network and a specific loss function is minimized. The most challenging factor is to avoid the model collapse, leading the encoder network to generate constant or non-informative vectors. To begin with, the well-known framework based on the contrastive learning of visual representations, termed SimCLR [], was utilized. Given any training image x , the module produces two correlated views of the same image, which are denoted as x ˜ i and x ˜ j , and this forms a positive pair. Among many data augmentation approaches, crop and resize, crop and flipping, rotation, cutout, Gaussian noise, and color jitter were adopted in this study for training. The model optimization involves minimizing and maximizing the distances amongst features in the intra- and inter-classes, respectively, and the distance metrics was based on the contrastive loss function.
For a given image set { x ˜ k } , the positive pairs are generated as x ˜ i and x ˜ j . The main objective of the contrastive prediction loss is to obtain a similar x ˜ j in { x ˜ k } k i for the given x ˜ i . For each mini-batch of N examples, as two augmentations are carried out for each example at a time, 2N data points are generated. Herein, the normalized temperature-scaled cross-entropy loss (NT-Xent) was adopted as the loss function, defined as follows.
  l i , j = l o g exp s i m x i , x j τ k = 1 2 N 1 k i exp s i m x i , x k τ ,
where 1 k i   0 , 1 ; the output is 1 if k i , and τ and s i m z i , z k denote the temperature parameter and cosine similarity, respectively. The temperature parameter is useful to widen the range of cosine similarity [−1, 1] according to the user preference. Herein, τ 1 was set at 0.1, which expands the cosine similarity with the range from exp 0.1 to exp 10 , and it helps to better separate the positive and negative examples. The consolidated loss is computed across all the positive pairs of both i , j and j , i for the mini-batch. The SimCLR model has two main limitations: First, it requires a large amount of contrastive learning pairs, which is not feasible for small/medium datasets. Second, to obtain the optimal performance, it requires training with large batch sizes (up to 4096 or 8192), which requires multiple graphical processing units (GPUs) or tensor processing units (TPUs), and these are highly expensive and hard to realize in many real-time applications. Another important problem is the model collapse, which results in a poor encoder model. To tackle this issue, the subsequent models were based on the distillation methods such as simple Siamese representation learning (SimSiam) [] and bootstrap your own latent (BYOL) []. The architecture and parameter updates were modified to bring asymmetry in the network. The model parameters were only updated using the distorted version of the input, and the other distorted version was used as a fixed target. Though the model avoids collapse, it is not certain how it will avoid collapse. More recently, another approach based on H. Barlow’s redundancy reduction principle was proposed, as demonstrated in Figure 4b, which was applied to the pair of identical networks as in SSL models. The method is termed Barlow twins (BTs) [] and can perform well with reduced batch sizes, a deeper projector head, a large embedding, etc. The main contribution is the introduction of a new loss function, termed the Barlow twins (BTs) loss.
L B T   i 1 C i i 2 + λ   i j i C i j 2 ,
where λ is a positive constant to balance the tradeoff between the invariance and redundancy reduction loss; the notation C refers to the cross-correlation matrix, which is the output of the two identical networks for each batch; the notation signifies equal by definition.
  C i j     b z b , i A z b , j B b z b , i A 2 b z b , j B 2 ,
where b refers to the batch samples and i ,   j are the indices of the vector dimension of the network output. The C is the square matrix and is of a size of the dimensionality of the network output. The BTs loss function is very effective in eliminating model collapse and can provide the best feature learning. Though BTs aims to reduce the redundancy at the embedding vector level, there is still a possibility that the input images may have correlated patterns. In this work, the pretraining was carried out using SimCLR, SimSiam, BYOL, and BTs with the additional augmentation used in the fine-grained classification problems. An improved pretrained model was obtained through this study, and the detailed comparative results are presented in the Results Section.

4.2. Dual-Model Architecture

Herein, the detailed description of the proposed model using the dual-model architecture is provided. As shown in Figure 6, the overall training structure and process can be divided into two parts, i.e., few-shot data extraction and feature similarity assessment. At the beginning of each few-shot training, a small subset of data is randomly selected from each category and is divided as the support set and the query set. For this randomly selected data, the categories in the support set and the query set are the same. Subsequently, the data of the support set and the query set are passed through different feature extraction networks to obtain the feature embedding for each image.
Figure 6. Few-shot classification with dual-model deep feature extraction and similarity measurement mechanism.
In the subsequent stage of model similarity learning and calculation, the feature embedding of the support set is used to build the most representative features for each category. On the other hand, the feature embedding of the query set is used for similarity calculation, which is similar to the classifier function of general supervised learning. In this way, the model can learn to understand the similarity in the feature embedding of each category from the few samples. The solid circles of different colors in the feature space represent the feature embedding of different categories in the support set. The black solid circles represent the most representative feature embedding of each category, and the white solid circles represent the feature embedding of the query set, respectively. Through the distance estimation between the feature embedding of the query set (white solid circles) and the most representative feature embedding of each category (black solid circles), the model can determine all of the data in the query set. The overall training achieves the improved few-shot classification through multiple few-shot extraction training and similarity judgment.
Figure 7 shows a schematic diagram of the selection method of few-shot training. For instance, let us assume the tasks as a 3-way 5-shot with 5-query task, in which 3-way refers to the number of categories selected each time before training and testing and 5-shot refers to number of data samples in the support set. The 5-query refers to the amount of data in each category of the query set. Hence, at the beginning of each few-shot training, different types of data are randomly selected as the new support and query set. This helps to increase the generalization ability of the model to various types of data, which is the key objective of few-shot learning.
Figure 7. Schematic diagram of the selection method.

4.3. Feature Extraction and Similarity Assessment

Figure 8 and Figure 9 show the backbone network of our feature extraction model for the support set and query set. The main architecture of the models was inspired by the design of the ConvNeXt [] block, as shown in Figure 10. Compared with the standard ResNet [] as a baseline, the ConvNeXt block combines the advantages of many models to optimize the performance of its own feature extraction. For example, it draws depthwise convolution to improve the learnable features of each channel and adapt the design concept of various vision transformers (ViT) [] such as Swin-transformer (Swin-T) [] to improve the learning performance of the CNN. Because of the different task orientations of the support set and the query set, the support set data are more influential than the query set in few-shot learning. From Figure 8, it can be seen that the number of convolutional filters used in the third convolutional block for the support set feature is three-times bigger than that of the query set backbone. As the network learns the query set through limited images, the backbone can be relatively small. However, in comparison with the single-model backbone in existing networks, this dedicated backbone for the support and query set is with more advantages.
Figure 8. Support set feature extraction backbone.
Figure 9. Query set feature extraction backbone.
Figure 10. ConvNext block.
The feature extraction process of the support set and query set is shown in Figure 11 and Figure 12. The feature embedding of each datum in the support set is extracted through the support set feature extraction model, and the most representative feature of each category is computed using Equation (1). The term C n refers to the center point within each category cluster.
  C n = 1 S x d S Backbone x d
d x , y = x 1 y 1 2 + + x n y n 2
L x = p y = n | x = log exp d b a c k b o n e x , C n n exp d b a c k b o n e x , C n  
Figure 11. Support set deep feature extraction process.
Figure 12. Query set deep feature extraction process.
An illustrative example is provided in Figure 13; it can be seen that, since the feature embedding of the query set is closer to the depth feature of C2, the category of the data is predicted as C2.
Figure 13. Query set deep feature classification.

5. Results and Analysis

To perform the comprehensive model evaluation, the three standard datasets, i.e., Mini-ImageNet, CIFAR-FS, and CUB 200, were used and compared with many state-of-the-art models. The final model predicts each image in the test dataset, and the class with the highest score is selected. The final evaluation was carried out by comparing the predicted class with the ground-truth label. If the two labels are consistent, it is a correct case; otherwise, it is an incorrect case. The accuracy rate (Acc.) was used as the evaluation criterion, which is defined as follows.
  A c c = N c o r r e c t N a l l ,
where N c o r r e c t refers to the number of correctly classified images and N a l l represents the total number of images in the test set.

5.1. Pretrained Model Optimization

As the few-shot learning method was trained on minimal images, the pretrained model can significantly affect the classification performance. In this work, four prominent SSL methods such as SimCLR, BYOL, SimSiam, and BTs were considered. The objective was to identify the SSL method that can produce the optimal pretrained model. The backbone model was ConvNeXt, and each model was trained for 50 epochs with a batch of size 256. For the experiments, the Mini-ImageNet dataset was used, in which 40% of the images were used to conduct training without labels, 10% of the labelled images were considered for fine-tuning, and 1000 images were used for testing. The general classification performance of these methods was tested, and the best approach was selected to obtain the pretrained model for the few-shot classification. In addition to the standard augmentation such as crop, resize, flipping, rotation, cutout, gaussian, and color jitter, we also exploited some additional augmentation that are popular in fine-grained classifiers such as random patch swap (RPS) and random jigsaw (RJ) []. In fine-grained learning, these augmentations are very useful in learning features of different granularities and also help the network localize in fine-grained regions. It can be also seen from Table 1 that the new augmentations also improvise the general classification accuracy, which corresponds to an improved pretrained model.
Table 1. Comparisons of SSL methods.
Overall, the BTs with additional augmentation attained the best classification performance. Hence, instead of considering the pretrained model trained on Image-Net directly, it was further fine-tuned using BTs with an additional augmentation technique on 30% of the training images (un-labelled) for all datasets. This process can provide a good generalized pretrained model to each of the datasets and also boost the few-shot classification performance.

5.2. Model Ablation Studies

In this study, a detailed ablation study was carried out to determine the number of convolutional blocks used in the support and query feature extractors. It can be seen from Figure 8 and Figure 9 that the numbers of convolutional blocks used for the support and query set were different, and the accuracies for different combination are presented below. All this testing was performed on the Mini-ImageNet dataset.
To begin with, as in Cases 1 and 2, the backbone models of the support and query set were provided with more convolutional layers from the existing setup []. It can be seen that the accuracy was significantly less for both cases, and it was also inferred that, when the convolutional blocks kept for the query set were fewer, the model delivered better accuracy. This reduction in accuracy was due to the overfitting issue with respect to very deep configurations and limited training data. This is evident from the fact that, in Cases 1 and 2, the model provided high accuracy on the training sets and low accuracy on the test sets, whereas in Cases 4 and 5, the training and test accuracy were nearly equal. Overall, Case 4 was considered for our final training due it having the best accuracy, and the query set backbone model is maintained to have less depth than the support set. We observed this type of configuration to be also effective in avoiding the overfitting or underfitting issues. Moreover, k-fold cross-validation was performed considering different model configurations and image categories. For the experimentation, five-fold cross-validation was considered using the training set images with different numbers of class labels, as in Figure 14. It can be seen that, in agreement with Table 2, the Case 4 model had the best average accuracy and least variance, whereas Case 3 showed significant degradation in performance with respect to the number of class labels. The comprehensive results of the Case 4 model on the standard few-learning datasets and its comparison analysis are provided in the next subsection.
Figure 14. Five-fold cross-validation.
Table 2. Convolutional blocks for support and query set.

5.3. Few-Shot Classification Results

Few-shot classification and few-shot learning are technically the same process. At the beginning of each small-sample test, several pieces of data are randomly selected from each category of the overall test dataset as a support set and a query set for prediction, and then, the query set is used for prediction. The feature embedding of each image is obtained from the data of the support and query set through the dual-backbone network. Subsequently, the distance is evaluated between the feature embedding of the query set and the most representative deep feature of each category (calculated from the depth feature of the support set). Based on the least distance, the class labels are estimated, and the results for various datasets are shown below.
Table 3 shows the few-shot classification results of Mini-ImageNet (5-way 5-shot). In this experiment, five categories of test images were randomly selected from the 20 categories of the Mini-ImageNet dataset as the test dataset, and 5 images in each category were randomly selected as the support set and 15 images as the query set for the prediction. The number of few-shot classifications in the testing phase was performed 10,000 times, and the average accuracy is listed. From the results, it can be seen that the proposed model performed the best among the existing works, and the overall classification accuracy was 3.11% higher than the current state-of-the-art method. Table 3 also shows the few-shot classification results of Mini-ImageNet (5-way 1-shot). In each few-shot classification, five categories of test images were randomly selected from the 20 categories in the Mini-ImageNet dataset as the test dataset, and each category would randomly select 1 image as the support set and 15 images as the query set for the prediction. The number of few-shot classifications in the testing phase was performed 10,000 times, and its average was estimated. As the data usage was with only 1-shot, thus the model accuracy was slightly lower than that of the 5-shot classification. However, the proposed method still outperformed the state-of-the-art methods by 2.31%.
Table 3. Classification accuracy on Mini-ImageNet.
Similar experiments for the CIFAR-FS dataset are presented in Table 4, and it can be seen that the proposed model performed 1.96% and 1.46% higher than the existing models for 5-way 5-shot and 5-way 1-shot, respectively.
Table 4. Classification accuracy on CIFAR-FS (5-way 5-shot).
The classification accuracy for the CUB 200 dataset is shown in Table 5, in which the proposed model performed 1.23% and 1.06% higher than the existing models for 5-way 5-shot and 5-way 1-shot, respectively.
Table 5. Classification accuracy on CUB-200 (5-way 5-shot).
Finally, the classification results of the single-model and dual-model on each small dataset are provided in Table 6. It can be seen from the results that the dual-model proposed in this work showed significant improvement in the classification results on the three few-shot datasets compared to the single-model. It can also be observed that the classification results of the dual-model in the three few-shot datasets were at least 1% higher than the accuracy of the single-model methods. The results of the ablation study also verified that the dual mode feature extraction architecture achieved superior classification results in the few-shot classification tasks.
Table 6. Classification accuracy comparison of single-model and dual-model.

5.4. Case Studies

To understand the robustness of the model with respect to various image attacks or variants, such as cropping, scaling, illumination, color, background, etc., detailed case studies considering images from all three datasets were conducted, as in Table 7. It can be seen that, though the Sample 1 and Sample 2 images were taken from the same category, it is visually challenging to classify them because of the huge variations or diversity. However, the proposed model was very successful in correctly classifying such images, and this clearly demonstrated the model’s capability in handling large intra-class variations and that it is ideal for many real-time applications.
Table 7. Case studies on model robustness.

6. Conclusions

A new few-shot classification approach was proposed by integrating self-supervised learning, a hybrid convolutional neural network, and progressive training with multiple subsets. Four prominent SSL frameworks, i.e., SimCLR, SimSiam, BYOL, and BTs, were evaluated, and the BTs trained with additional fine-grain augmentation was found to obtain the best generalized pretrained model. A new hybrid architecture involving a dual-CNN model with the vision-transformer-based augmentation technique was developed. The few-shot training was conducted using multiple subsets and similarity estimation to obtain the best feature embeddings for the query and sample set. Extensive experiments were conducted on the three standard few-shot datasets, Mini-ImageNet, CIFAR-FS, and CUB 200. Moreover, a detailed evaluation was carried out to validate the diversity and robustness of our method. As examined from the results, the proposed method outperformed the existing state-of-the-art methods on all datasets and set a new benchmark accuracy in few-shot classification.

Author Contributions

Conceptualization, J.-M.G.; methodology, W.-H.C.; software, W.-H.C.; validation, S.S.; formal analysis, S.S.; investigation, J.-M.G.; resources, S.S.; data curation, W.-H.C.; writing—original draft preparation, S.S.; writing—review and editing, S.S.; visualization, W.-H.C.; supervision, J.-M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–30 June 2016; pp. 770–778. [Google Scholar]
  2. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–30 June 2016; pp. 779–788. [Google Scholar]
  3. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 142–158. [Google Scholar] [CrossRef]
  4. Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning. PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
  5. Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 2016, 3637–3645. [Google Scholar] [CrossRef]
  6. Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.S. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1199–1208. [Google Scholar]
  7. Zhang, H.; Cao, Z.; Yan, Z.; Zhang, C. Sill-net: Feature augmentation with separated illumination representation. arXiv 2021, arXiv:2102.03539. [Google Scholar]
  8. Chen, X.; Wang, G. Few-shot learning by integrating spatial and frequency representation. In Proceedings of the 18th Conference on Robots and Vision (CRV), Burnaby, BC, Canada, 26–28 May 2021; pp. 49–56. [Google Scholar]
  9. Snell, J.; Swersky., K.; Zemel., R. Prototypical networks for few-shot learning. Adv. Neural Inf. Process. Syst. 2017, 4080–4090. Available online: https://dl.acm.org/doi/10.5555/3294996.3295163 (accessed on 25 October 2022).
  10. Chobola, T.; Vašata, D.; Kondik, P. Transfer learning based few-shot classification using optimal transport mapping from preprocessed latent space of backbone neural network. AAAI Workshop Meta-Learn. Meta-DL Chall. PMLR 2021, 29–37. [Google Scholar] [CrossRef]
  11. Hu, Y.; Pateux, S.; Gripon, V. Squeezing Backbone Feature Distributions to the Max for Efficient Few-Shot Learning. Algorithms 2022, 15, 147. [Google Scholar] [CrossRef]
  12. Bateni, P.; Barber, J.; Van de Meent, J.W.; Wood, F. Enhancing few-shot image classification with unlabelled examples. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, New Orleans, LA, USA, 18–24 June 2022; pp. 2796–2805. [Google Scholar]
  13. Bendou, Y.; Hu, Y.; Lafargue, R.; Lioi, G.; Pasdeloup, B.; Pateux, S.; Gripon, V. EASY: Ensemble Augmented-Shot Y-shaped Learning: State-Of-The-Art Few-Shot Classification with Simple Ingredients. arXiv 2022, arXiv:2201.09699. [Google Scholar]
  14. Shalam, D.; Korman, S. The Self-Optimal-Transport Feature Transform. arXiv 2022, arXiv:2204.03065. [Google Scholar]
  15. Chen, D.; Chen, Y.; Li, Y.; Mao, F.; He, Y.; Xue, H. Self-supervised learning for few-shot image classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1745–1749. [Google Scholar]
  16. Ravi, S.; Larochelle, H. Optimization as a Model for Few-Shot Learning. In Proceedings of the ICLR, Toulan, France, 24–26 April 2017. [Google Scholar]
  17. Bertinetto, L.; Henriques, J.F.; Torr, P.H.; Vedaldi, A. Meta-learning with differentiable closed-form solvers. arXiv 2018, arXiv:1805.08136. [Google Scholar]
  18. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 12–18 July 2020; pp. 1597–1607. [Google Scholar]
  19. Chen, X.; He, K. Exploring Simple Siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15750–15758. [Google Scholar]
  20. Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.A.; Daniel Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2021, 21271–21284. Available online: https://dl.acm.org/doi/abs/10.5555/3495724.3497510 (accessed on 25 October 2022).
  21. Zbontar, J.; Jing, L.; Misra, I.; LeCun, Y.; Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In Proceedings of the International Conference on Machine Learning, Seoul, Korea, 18–24 July 2021; pp. 12310–12320. [Google Scholar]
  22. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
  23. Wightman, R.; Touvron, H.; Jégou, H. Resnet strikes back: An improved training procedure in timm. arXiv 2021, arXiv:2110.00476. [Google Scholar]
  24. Zhou, D.; Kang, B.; Jin, X.; Yang, L.; Jiang, Z.; Hou, Q.; Feng, J. Deepvit: Towards deeper vision transformer. arXiv 2021, arXiv:2103.11886. [Google Scholar]
  25. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  26. Breiki, F.A.; Ridzuan, M.; Grandhe, R. Self-Supervised Learning for Fine-Grained Image Classification. arXiv 2021, arXiv:2107.13973. [Google Scholar]
  27. Hu, Y.; Pateux, S.; Gripon, V. Adaptive Dimension Reduction and Variational Inference for Transductive Few-Shot Classification. arXiv 2022, arXiv:2209.08527. [Google Scholar]
  28. Singh, A.; Jamali-Rad, H. Transductive Decoupled Variational Inference for Few-Shot Classification. arXiv 2022, arXiv:2208.10559. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.