# Predicting the Generalization Ability of a Few-Shot Classifier

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

**can the generalization ability of a few-shot classifier be estimated without using a validation set?**

- To the best of our knowledge, we propose the
**first benchmark of generalization measures in the context of transfer-based few-shot learning**. - We conduct experiments to stress the ability of the measures to correctly predict generalization using different settings related to few-shot: (i) supervised, where we only have access to a few labeled samples, (ii) semi-supervised, where we have access to both a few labeled samples and a set of unlabeled samples and (iii) unsupervised, where no label is provided.

## 2. Related Work

#### 2.1. Few-Shot Learning

#### 2.1.1. With Meta-Learning

#### 2.1.2. Without Meta-Learning

#### 2.2. Better Backbone Training

#### 2.2.1. Learning Diverse Visual Features

#### 2.2.2. Using Additional Unlabeled Data Samples

#### 2.2.3. Learning Good Representations

#### 2.3. Evaluating the Generalization Ability

## 3. Background

#### 3.1. Few-Shot Classification: A Transfer-Based Approach

#### 3.2. Studied Settings

#### 3.3. Studied Classifiers

#### 3.3.1. Supervised Setting

#### 3.3.2. Semi-Supervised Setting

#### 3.3.3. Unsupervised Setting

## 4. Predictive Measures

#### 4.1. Supervised Setting

#### 4.1.1. LR Training Loss

**Definition**

**1**

**.**Given ${y}_{ic}$ the number (0 or 1) indicating if the label of the data sample i is c and ${p}_{ic}$ the output of the LR indicating the probability of i being labeled c, the loss is defined as:

#### 4.1.2. Similarity

**Definition**

**2**

**.**The cosine similarity within a class c is:

**Definition**

**3**

**.**The cosine similarity through classes c and $\tilde{c}$ is:

**Definition**

**4**

**.**The proposed similarity measure is:

#### 4.2. Unsupervised Setting

#### 4.2.1. Davies-Bouldin Score after a N-means Algorithm

**Definition**

**5**

**.**Denote the centroid of a cluster C ${\mathit{\mu}}_{c}$, such that ${\mathit{\mu}}_{c}=\frac{1}{\left|C\right|}{\sum}_{i\in C}{\mathbf{f}}_{i}$. The average distance between the samples in C and the centroid of their cluster ${\mathit{\mu}}_{c}$ is:

#### 4.2.2. Laplacian Eigenvalues

**Definition**

**6**

**.**We consider the graph $\mathcal{G}\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}\langle \mathcal{V},\mathcal{E},\mathbf{W}\rangle $ where $\mathcal{V}$ is the set of data samples. The adjacency matrix $\mathbf{W}$ is obtained by first considering the cosine similarity between these samples, removing self-loops, and keeping only the k-th largest values on each line/column. The Laplacian of the graph is given by $\mathbf{L}=\mathbf{D}-\mathbf{W}$, where $\mathbf{D}$ is the degree matrix of the graph: $\mathbf{D}$ is a diagonal matrix where ${\mathbf{D}}_{ii}={\sum}_{j=1}^{NQ}{\mathbf{W}}_{ij}$. The measure we consider is the amplitude of the N-th lower eigenvalue of $\mathbf{L}$.

#### 4.3. Semi-Supervised Setting

**Definition**

**7**

**.**Let ${p}_{ic}$ denote the probability that the data sample i is labeled c.

## 5. Experiments

#### 5.1. Datasets

#### 5.2. Backbones

**wideresnet**) of 28 layers and width factor 10 described in [11]. It has been trained on mini-ImageNet with a classification loss (classification error), an auxiliary loss (self-supervised loss) and fine-tuned using manifold mix-up [39]. Its results are among the best reported in the literature. The second backbone is a DenseNet [40] (

**densenet**) trained on tiered-ImageNet from [12]. As advised in the original papers, all feature vectors are divided by their ${\mathrm{L}}_{2}$-norm: given $\mathbf{f}\in \mathcal{F}$, $\mathbf{f}\leftarrow \frac{\mathbf{f}}{{\u2225\mathbf{f}\u2225}_{2}}\phantom{\rule{0.277778em}{0ex}}.$

#### 5.3. Evaluation Metrics

#### 5.4. Correlations in the Supervised Setting

#### 5.5. Correlations in the Unsupervised Setting

#### 5.6. Correlations in the Semi-Supervised Setting

#### 5.7. Predicting Task Accuracy

**Supervised setting:**In Figure 6a, the ROC curve is built by varying a threshold value over the LR loss. When selecting a threshold value at $(0.29,0.81)$, we obtain a confusion matrix on the second set where 1—specificity becomes $0.14$ and sensibility becomes $0.45$. As both variables are lower, the chosen threshold does not apply to the second set.

**Unsupervised setting:**In Figure 6b, the ROC curve is built by varying a threshold value over the DB-score. When selecting a threshold value at $(0.30,0.81)$, we show on the confusion matrix that 1—specificity becomes $0.64$ and sensibility becomes $0.95$. Here, both variables are higher. Once again, the chosen threshold does not apply to the second set.

**Semi-supervised setting:**In Figure 6c, the ROC curve is built by varying a threshold value over the LR confidence. We select the threshold value at $(0.16,0.81)$. In the confusion matrix, 1—specificity becomes $0.18$ and sensibility becomes $0.76$. Both variables are similar on the two sets. So, the chosen threshold value generalizes to the second set.

#### 5.8. Using Per-Sample Confidence to Annotate the Hardest Samples

**wideresnet**and diffused through a similarity graph. We observe that when the number of labeled samples is small, it is better to have a random selection. This is probably due to the fact that classes are more balanced when the annotation is randomized. Above a certain amount of labeled samples, it becomes clearly more efficient to choose the samples to label. This is not surprising as the chosen elements happen to be the ones with the lowest confidences, meaning that the remaining ones are easy to classify.

#### 5.9. Additional Experiments

**wideresnet**from mini-ImageNet, and diffused through a similarity graph. We observe that, in both semi-supervised and unsupervised settings, the index of the best eigenvalue is lower than N.

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A. Details about the Training of the Classifiers

## Appendix B. Models Performance on Various Tasks

**Figure A1.**Performance of the models used in Figure 2, Figure 4 and Figure 5. By default, 5-way 5-shot 30-query tasks are generated. Mini/Tiered means that data come from mini-ImageNet/tiered-ImageNet. LR are the accuracies obtained in the supervised setting. Adapted LR, the accuracies obtained in the semi-supervised setting. N-means are the ARIs obtained in the unsupervised setting. For reasons of scale, the ARIs are multiplied by 100.

## Appendix C. Influence of the Number of Nearest Neighbors

**Figure A2.**Influence of the number of neighbors k on the correlations. The features are extracted with

**wideresnet**from mini-ImageNet and diffused through a k-nearest neighbors similarity graph. 5-way 5-shot 30-query tasks are generated. In (

**a**), the correlations are computed between the measures and the accuracy of a LR on the unlabeled samples. In (

**b**), they are computed between the measures and the ARI of a N-means on the unlabeled samples. Each point is obtained over $10,000$ random tasks.

## References

- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM
**2017**, 60, 84–90. [Google Scholar] [CrossRef] - Aytar, Y.; Vondrick, C.; Torralba, A. Soundnet: Learning sound representations from unlabeled video. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 892–900. [Google Scholar]
- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of Go without human knowledge. Nature
**2017**, 550, 354. [Google Scholar] [CrossRef] [PubMed] - Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Burt, J.R.; Torosdagli, N.; Khosravan, N.; RaviPrakash, H.; Mortazi, A.; Tissavirasingham, F.; Hussein, S.; Bagci, U. Deep learning beyond cats and dogs: Recent advances in diagnosing breast cancer with deep neural networks. Br. J. Radiol.
**2018**, 91, 20170545. [Google Scholar] [CrossRef] [PubMed] - Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst.
**2019**, 30, 3212–3232. [Google Scholar] [CrossRef] [Green Version] - Ma, J.; Zhou, C.; Cui, P.; Yang, H.; Zhu, W. Learning disentangled representations for recommendation. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 5711–5722. [Google Scholar]
- Gupta, V.; Sambyal, N.; Sharma, A.; Kumar, P. Restoration of artwork using deep neural networks. Evol. Syst.
**2019**. [Google Scholar] [CrossRef] - Caruana, R.; Lawrence, S.; Giles, C.L. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–8 December 2001; pp. 402–408. [Google Scholar]
- Guyon, I. A Scaling Law for the Validation-Set Training-Set Size Ratio. Available online: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1337&rep=rep1&type=pdf (accessed on 9 January 2021).
- Mangla, P.; Kumari, N.; Sinha, A.; Singh, M.; Krishnamurthy, B.; Balasubramanian, V.N. Charting the right manifold: Manifold mixup for few-shot learning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 2218–2227. [Google Scholar]
- Wang, Y.; Chao, W.L.; Weinberger, K.Q.; van der Maaten, L. SimpleShot: Revisiting Nearest-Neighbor Classification for Few-Shot Learning. arXiv
**2019**, arXiv:1911.04623. [Google Scholar] - Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
- Rusu, A.A.; Rao, D.; Sygnowski, J.; Vinyals, O.; Pascanu, R.; Osindero, S.; Hadsell, R. Meta-Learning with Latent Embedding Optimization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 4077–4087. [Google Scholar]
- Vinyals, O.; Blundell, C.; Lillicrap, T.; kavukcuoglu, K.; Wierstra, D. Matching networks for one shot learning. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 3630–3638. [Google Scholar]
- Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1199–1208. [Google Scholar]
- Oreshkin, B.; López, P.R.; Lacoste, A. Tadam: Task dependent adaptive metric for improved few-shot learning. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 721–731. [Google Scholar]
- Ye, H.J.; Hu, H.; Zhan, D.C.; Sha, F. Learning embedding adaptation for few-shot learning. arXiv
**2018**, arXiv:1812.03664. [Google Scholar] - Chen, W.Y.; Liu, Y.C.; Kira, Z.; Wang, Y.C.F.; Huang, J.B. A Closer Look at Few-shot Classification. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Tian, Y.; Wang, Y.; Krishnan, D.; Tenenbaum, J.B.; Isola, P. Rethinking Few-Shot Image Classification: A Good Embedding Is All You Need? arXiv
**2020**, arXiv:2003.11539. [Google Scholar] - Milbich, T.; Roth, K.; Bharadhwaj, H.; Sinha, S.; Bengio, Y.; Ommer, B.; Cohen, J.P. DiVA: Diverse Visual Feature Aggregation forDeep Metric Learning. arXiv
**2020**, arXiv:2004.13458. [Google Scholar] - Lichtenstein, M.; Sattigeri, P.; Feris, R.; Giryes, R.; Karlinsky, L. TAFSSL: Task-Adaptive Feature Sub-Space Learning for few-shot classification. arXiv
**2020**, arXiv:2003.06670. [Google Scholar] - Hu, Y.; Gripon, V.; Pateux, S. Exploiting Unsupervised Inputs for Accurate Few-Shot Classification. arXiv
**2020**, arXiv:2001.09849. [Google Scholar] - Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell.
**2013**, 35, 1798–1828. [Google Scholar] [CrossRef] [PubMed] - Arjovsky, M.; Bottou, L.; Gulrajani, I.; Lopez-Paz, D. Invariant risk minimization. arXiv
**2019**, arXiv:1907.02893. [Google Scholar] - Xu, Y.; Zhao, S.; Song, J.; Stewart, R.; Ermon, S. A Theory of Usable Information under Computational Constraints. arXiv
**2020**, arXiv:2002.10689. [Google Scholar] - Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Wang, Z.; Du, B.; Guo, Y. Domain adaptation with neural embedding matching. IEEE Trans. Neural Netw. Learn. Syst.
**2020**, 31, 2387–2397. [Google Scholar] [CrossRef] - Lu, J.; Jin, S.; Liang, J.; Zhang, C. Robust Few-Shot Learning for User-Provided Data. IEEE Trans. Neural Netw. Learn. Syst.
**2020**. [Google Scholar] [CrossRef] - Jiang, Y.; Neyshabur, B.; Mobahi, H.; Krishnan, D.; Bengio, S. Fantastic Generalization Measures and Where to Find Them. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Jiang, Y.; Krishnan, D.; Mobahi, H.; Bengio, S. Predicting the Generalization Gap in Deep Networks with Margin Distributions. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell.
**1979**, PAMI-1, 224–227. [Google Scholar] [CrossRef] - Shuman, D.I.; Narang, S.K.; Frossard, P.; Ortega, A.; Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag.
**2013**, 30, 83–98. [Google Scholar] [CrossRef] [Green Version] - Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis.
**2015**, 115, 211–252. [Google Scholar] [CrossRef] [Green Version] - Ren, M.; Triantafillou, E.; Ravi, S.; Snell, J.; Swersky, K.; Tenenbaum, J.B.; Larochelle, H.; Zemel, R.S. Meta-learning for semi-supervised few-shot classification. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Miller, G.A. WordNet: A lexical database for English. Commun. ACM
**1995**, 38, 39–41. [Google Scholar] [CrossRef] - Zagoruyko, S.; Komodakis, N. Wide Residual Networks. In Proceedings of the British Machine Vision Conference (BMVC), York, UK, 19–22 September 2016; pp. 87.1–87.12. [Google Scholar] [CrossRef] [Green Version]
- Verma, V.; Lamb, A.; Beckham, C.; Najafi, A.; Mitliagkas, I.; Lopez-Paz, D.; Bengio, Y. Manifold Mixup: Better Representations by Interpolating Hidden States. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar]

**Figure 2.**Supervised setting. Study of the linear correlations between the measures and the accuracy of a LR computed on a test set. In (

**a**,

**b**), the data come from mini-ImageNet. Their features are extracted with

**wideresnet**. In (

**c**,

**d**), the data come from tiered-ImageNet. Their features are extracted with

**densenet**. See Section 5 for details. By default, 5-way 5-shot tasks are generated. In (

**a**,

**c**), the number of shots varies. In (

**b**,

**d**), the number of classes varies. Each point is obtained over $10,000$ random tasks.

**Figure 3.**Supervised setting. Each point represents a task. We plot the accuracy of a LR in function of the loss of the LR on the training samples. In (

**a**), we consider $10,000$ random 5-way 5-shot tasks. In (

**b**), we consider $10,000$ random 5-way 1-shot tasks. The data samples come from mini-ImageNet. Their features are extracted with

**wideresnet**.

**Figure 4.**Unsupervised setting. Study of linear correlations between the measures and the ARI of a N-means algorithm. The ARI is computed on the $NQ$ unlabeled samples available during training. In (

**a**,

**b**), the data come from mini-ImageNet. Their features are extracted with

**wideresnet**. In (

**c**,

**d**), the data come from tiered-ImageNet. Their features are extracted with

**densenet**. All features are diffused through a similarity graph. See Section 3 for details. By default, 5-way 35-query tasks are generated. In (

**a**,

**c**), the number of queries varies. In (

**b**,

**d**), the number of classes varies. Each point is obtained over $10,000$ random tasks.

**Figure 5.**Semi-supervised setting. Study of linear correlations between the measures and the accuracy of a LR on the $NQ$ unlabeled samples available during training. In (

**a**–

**d**), the data samples come from mini-ImageNet. Their features are extracted with

**wideresnet**and diffused through a similarity graph. In (

**e**–

**h**), the samples come from tiered-ImageNet. Their features are extracted with

**densenet**and diffused though a similarity graph. See Section 3 for details. By default, 5-way 5-shot 30-query tasks are generated. In (

**a**,

**e**), the number of queries varies. In (

**b**,

**f**), the number of shots varies. In (

**c**,

**g**), the number of classes varies. Each point is obtained over $10,000$ random tasks. In (

**d**,

**h**), each point represents a task. We plot the accuracy of the LR in function of the LR confidence. In (

**d**), 5-way 5-shot 30-query tasks are generated from mini-ImageNet. In (

**h**), 5-way 5-shot 30-query tasks are generated from tiered-ImageNet.

**Figure 6.**Task prediction. The ROC curves are computed over 10 classes of mini-ImageNet. The tables are computed on 10 other classes, applying the threshold value denoted by a red point on the curves. Features are extracted with

**wideresnet**. In both cases, $10,000$ 5-way 5-shot (30-query) are randomly generated. In (

**a**), the variable is the LR loss, in (

**b**), the DB-score, and in (

**c**), the LR confidence.

**Figure 7.**Using per-sample confidence to label data in a semi-supervised setting. We consider 5000 random 5-way 1-shot 50-query tasks. After a first training, we label either random samples or samples with a low LR confidence. In both cases, the accuracies after a second training are reported.

**Figure 8.**Influence of the proportion of unlabeled data samples p in a class with respect to the other ones in a semi-supervised setting. The features are extracted with

**wideresnet**from mini-ImageNet, and diffused through a similarity graph. We report linear correlations between the measures and the accuracy of a LR on the unlabeled samples. In (

**a**), 2-way 5-shot 50-query tasks are generated. In (

**b**), 5-way 5-shot 50-query. The proportion of samples in the other classes is identical. Each point is obtained over $10,000$ random tasks.

**Figure 9.**Analysis of the relevance of eigenvalues with different number of classes. In the semi-supervised setting (

**a**), 5-shot 30-query tasks are generated. In the unsupervised setting (

**b**), 35-query tasks. We report linear correlations between the egv N and the accuracy of the LR (

**a**)/the ARI of the N-means algorithm (

**b**). In both settings, we also report the index of the eigenvalue which enables the best correlation. Each point is obtained over $10,000$ random tasks.

**Table 1.**Table summarizing the solutions considered to predict the generalization ability of a classifier trained on few examples. The solutions are measures designed to quantify how well a trained model generalizes to unseen data.

SETTINGS | ||||
---|---|---|---|---|

Supervised | Semi-Supervised | Unsupervised | ||

N-Way K-Shot | N-Way K-Shot | N-Way Q-Query * | ||

Q-Query * | ||||

SOLUTIONS | Using available labels and features of data samples | |||

Training loss of the logistic regression | √ | √ | × | |

Similarities between labeled samples | √ | √ | × | |

Confidence in the output of the logistic regression | × | √ | × | |

Using only data relationships | ||||

Eigenvalues of a graph Laplacian | √ | √ | √ | |

Davies-Bouldin score after a N-means algorithm | √ | √ | √ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Bontonou, M.; Béthune, L.; Gripon, V.
Predicting the Generalization Ability of a Few-Shot Classifier. *Information* **2021**, *12*, 29.
https://doi.org/10.3390/info12010029

**AMA Style**

Bontonou M, Béthune L, Gripon V.
Predicting the Generalization Ability of a Few-Shot Classifier. *Information*. 2021; 12(1):29.
https://doi.org/10.3390/info12010029

**Chicago/Turabian Style**

Bontonou, Myriam, Louis Béthune, and Vincent Gripon.
2021. "Predicting the Generalization Ability of a Few-Shot Classifier" *Information* 12, no. 1: 29.
https://doi.org/10.3390/info12010029