#
The Details Matter: Preventing Class Collapse in Supervised Contrastive Learning^{ †}

^{*}

^{†}

^{‡}

## Abstract

**:**

## 1. Introduction

- We evaluate how well ${L}_{spread}$’s embeddings encode fine-grained subclasses with coarse-to-fine transfer learning. ${L}_{spread}$ achieves up to 4.4 points of lift across four datasets.
- We evaluate how well embeddings produced by ${L}_{spread}$ can recover strata in an unsupervised setting by evaluating robustness against worst-group accuracy and noisy labels. We use our insights about how ${L}_{spread}$ embeds strata of different sizes to improve worst-group robustness by up to 2.5 points and to recover 75% performance when 20% of the labels are noisy.
- We evaluate how well we can differentiate rare strata from common strata by constructing limited subsets of the training data that can achieve the highest performance under a fixed training strategy (the coreset problem). We construct coresets by subsampling points from common strata. Our coresets outperform prior work by 1.0 points when coreset size is 30% of the training set.

## 2. Background

#### 2.1. Data Setup

#### 2.2. Supervised Contrastive Loss

#### 2.2.1. SupCon and Collapsed Embeddings

#### 2.2.2. End Model

## 3. Method

#### 3.1. Theoretical Motivation

**Observation**

**1.**

**Observation**

**2.**

#### 3.2. Modified Contrastive Loss ${L}_{spread}$

#### Qualitative Evaluation

## 4. Geometry of Strata

#### 4.1. Existing Analysis

#### 4.2. Subsampling Strata

**Both strata appear in ${\mathcal{D}}_{t}$**The encoder ${\widehat{f}}_{t}$ is trained on both z and ${z}^{\prime}$. For large N, we can approximate this setting by considering ${\widehat{f}}_{t}$ trained on infinite data from these strata. Points belonging to these strata will be defined in the optimal embedding distribution on the hypersphere, which can be characterized by prior theoretical approaches [2,8,9]. With ${L}_{spread}$, $\delta ({\widehat{f}}_{t},z,{z}^{\prime})$ depends on $\alpha $, which controls the extent of spread in the embedding geometry. With ${L}_{SC}$, points from the two strata would asymptotically map to one location on the hypersphere, and $\delta ({\widehat{f}}_{t},z,{z}^{\prime})$ would converge to 0. This case occurs with probability increasing in $p\left(z\right),p\left({z}^{\prime}\right),$ and t.**One stratum but not the other appears in ${\mathcal{D}}_{t}$**Without loss of generality, suppose that points from z appear in ${\mathcal{D}}_{t}$ but no points from ${z}^{\prime}$ do. To understand $\delta ({\widehat{f}}_{t},z,{z}^{\prime})$, we can consider how the end model $\widehat{p}\left(y\right|{\widehat{f}}_{t}\left(x\right))$ learned using the “source” distribution containing z performs on the “target” distribution of stratum ${z}^{\prime}$ since this downstream classifier is a function of distances in embedding space. Borrowing from the literature in domain adaptation, the difficulty of this out-of-distribution problem depends on both the divergence between source z and target ${z}^{\prime}$ distributions and the capacity of the overall model. The $\mathcal{H}\Delta \mathcal{H}$-divergence from Ben-David et al. [10,11], which is studied in lower bounds in Ben-David and Urner [12], and the discrepancy difference from Mansour et al. [13] capture both concepts. Moreover, the optimal geometries of ${L}_{spread}$ and ${L}_{SC}$ induce different end model capacities and prediction distributions, with data being more separable under ${L}_{SC}$, which can help explain why ${L}_{spread}$ better preserves strata distances. This case occurs with probability increasing in $p\left(z\right)$ and decreasing in $p\left({z}^{\prime}\right)$ and t.**Neither strata appears in ${\mathcal{D}}_{t}$**The distance $\delta ({\widehat{f}}_{t},z,{z}^{\prime})$ in this case is at most $2{D}_{\mathrm{TV}}({\mathcal{P}}_{z},{\mathcal{P}}_{{z}^{\prime}})$ (total variation distance) regardless of how the encoder is trained, although differences in transfer from models learned on $\mathcal{Z}\backslash z,{z}^{\prime}$ to z versus ${z}^{\prime}$ can be further analyzed. This case occurs with probability decreasing in $p\left(z\right),p\left({z}^{\prime}\right),$ and t.

Common strata are more tightly clustered together, while rarer and more semantically distinct strata are far away from them.

#### 4.3. Implications

#### 4.3.1. Theoretical Implications

**Lemma**

**1.**

**Lemma**

**2.**

#### 4.3.2. Practical Implications

## 5. Experiments

- First, in Section 5.2, we use coarse-to-fine transfer learning to evaluate how well the embeddings maintain strata information. We find that ${L}_{spread}$ achieves lift across four datasets.
- In Section 5.3, we evaluate how well ${L}_{spread}$ can detect rare strata in an unsupervised setting. We first use ${L}_{spread}$ to detect rare strata to improve worst-group robustness by up to 2.5 points. We then use rare strata detection to correct noisy labels, recovering 75% performance under 20% noise.
- In Section 5.4, we evaluate how well ${L}_{spread}$ can distinguish points from large strata versus points from small strata. We downsample points from large strata to construct minimal coresets on CIFAR10, outperforming prior work by 1.0 points at 30% labeled data.
- Finally, in Section 5.5, we show that training with ${L}_{spread}$ improves model quality, validating our theoretical claims that preventing class collapse can improve generalization error. We find that ${L}_{spread}$ improves performance in 7 out of 9 cases.

#### 5.1. Datasets and Models

#### 5.2. Coarse-to-Fine Transfer Learning

#### 5.3. Robustness against Worst-Group Accuracy and Noise

#### 5.4. Minimal Coreset Construction

#### 5.5. Model Quality

## 6. Related Work and Discussion

**contrastive learning**, we take inspiration from [21], who use a latent classes view to study self-supervised contrastive learning. Similarly, [22] considers how minimizing the InfoNCE loss recovers a latent data generating model. We initially started from a debiasing angle to study the effects of noise in supervised contrastive learning inspired by [23], but moved to our current strata-based view of noise instead. Recent work has also analyzed contrastive learning from the information-theoretic perspective [24,25,26], but does not fully explain practical behavior [27], so we focus on the geometric perspective in this paper because of the downstream applications. On the geometric side, we are inspired by the theoretical tools from [8] and [2], who study representations on the hypersphere along with [9].

**strata**is strongly inspired by [5,6], who document empirical consequences of hidden strata. We are inspired by empirical work that has demonstrated that detecting subclasses can be important for performance [4,34] and robustness [14,35,36].

**applications**is a field in itself, and we take inspiration from recent work from each. Our noise heuristic is similar to the ELR [37] and takes inspiration from a various work using contrastive learning to correct noisy labels and for semi-supervised learning [38,39,40]. Our coreset algorithm is inspired by recent work in coresets for modern deep networks [19,41,42], and takes inspiration from [18] in particular.

## 7. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A. Glossary

Symbol | Used for |
---|---|

${L}_{SC}$ | SupCon (see Section 2.2), a supervised contrastive loss introduced by [1]. |

${L}_{spread}$ | Our modified loss function defined in Section 3.2. |

x | Input data $x\in \mathcal{X}$. |

y | Class label $y\in \mathcal{Y}=\{1,\cdots ,K\}$. |

$\mathcal{D}$ | Dataset of N points ${\left\{({x}_{i},{y}_{i})\right\}}_{i=1}^{N}$ drawn i.i.d. from $\mathcal{P}$. |

$h\left(x\right)$ | The class that x belongs to, i.e., $h\left(x\right)$ is a label drawm from $p\left(y\right|x)$. This label |

information is used as input in the supervised contrastive loss. | |

$\widehat{p}\left(y\right|x)$ | The end model’s predicted distribution over y given x. |

z | A stratum is a latent variable $z\in \mathcal{Z}=\{1,\cdots ,C\}$ that further categorizes data |

beyond labels. | |

${S}_{k}$ | The set of all strata corresponding to label k (deterministic). |

$S\left(c\right)$ | The label corresponding to strata c (deterministic). |

${\mathcal{P}}_{z}$ | The distribution of input data belonging to stratum z, i.e., $x\sim p(\xb7|z)$. |

m | The number of strata per class. |

d | Dimension of the embedding space. |

f | The encoder $f:\mathcal{X}\to {\mathbb{R}}^{d}$ maps input data to an embedding space and is learned by |

minimizing the contrastive loss function. | |

${\mathbb{S}}^{d-1}$ | The unit hypersphere, formally $\{v\in {\mathbb{R}}^{d}:\parallel v{\parallel}_{2}=1\}$. |

$\tau $ | Temperature hyperparameter in contrastive loss function. |

$\sigma (x,{x}^{\prime})$ | Notation for $\frac{f{\left(x\right)}^{\top}f\left({x}^{\prime}\right)}{\tau}$. |

$\mathcal{B}$ | Set of batches of labeled data on $\mathcal{D}$. |

$P(i,B)$ | Points in B with the same label as ${x}_{i}$, formally $\{p\in B\backslash i:h(p)=h(i\left)\right\}$. |

${\left\{{v}_{i}\right\}}_{i=1}^{K}$ | A regular simplex inscribed in the hypersphere (see Definition A1). |

W | The weight matrix that parametrizes the downstream linear classifier |

(end model) learned on $f\left(x\right)$. | |

$\widehat{\mathcal{L}}(W,\mathcal{D})$ | The empirical cross entropy loss used to learn W over dataset $\mathcal{D}$ (see (A1)). |

$\mathcal{L}(x,y,f)$ | The generalization error of the end model of predicting output y on x using |

encoder f (see (A2) and (A3)). | |

${L}_{attract}$ | A variant on SupCon that is used in ${L}_{spread}$ that pushes points of a class together |

(see (2)). | |

${L}_{repel}$ | A class-conditional InfoNCE loss that is used in ${L}_{spread}$ to pull apart points |

within a class (see ()). | |

$\alpha $ | Hyperparameter $\alpha \in [0,1]$ controls how to balance ${L}_{attract}$ and ${L}_{repel}$. |

${x}^{aug}$ | An augmentation of data point x. |

$N(i,B)$ | Points in B with a label different from that of ${x}_{i}$, formally $\{a\in B\backslash i:h(a)\ne h(i\left)\right\}$. |

t | Fraction of training data $t\in [0,1]$ that is varied in our thought experiment. |

${\mathcal{D}}_{t}$ | Randomly sampled dataset from $\mathcal{P}$ with size equal to $t\xb7N$ fraction of $\mathcal{D}$. |

${\widehat{f}}_{t}$ | Encoder trained on sampled dataset ${\mathcal{D}}_{t}$. |

$\delta ({\widehat{f}}_{t},z,{z}^{\prime})$ | The distance between centers of strata z and ${z}^{\prime}$ under encoder ${\widehat{f}}_{t}$, |

namely $\delta ({\widehat{f}}_{t},z,{z}^{\prime})={\parallel {\mathbb{E}}_{x\sim {\mathcal{P}}_{z}}\left[{\widehat{f}}_{t}\left(x\right)\right]-{\mathbb{E}}_{x\sim {\mathcal{P}}_{{z}^{\prime}}}\left[{\widehat{f}}_{t}\left(x\right)\right]\parallel}_{2}$. |

## Appendix B. Definitions

**Definition**

**A1**

- 1.
- ${\sum}_{i=1}^{K}{v}_{i}=0$
- 2.
- $\parallel {v}_{i}\parallel =1$ for all i
- 3.
- $\exists {c}_{K}\le 1$ s.t. ${v}_{i}^{\top}{v}_{j}={c}_{K}$ for $i\ne j$

**Definition**

**A2**

## Appendix C. Additional Theoretical

#### Appendix C.1. Transfer Learning on $({x}^{\prime},{y}^{\prime})$

**Observation**

**A1.**

#### Appendix C.2. Probabilities of Strata $z,{z}^{\prime}$ Appearing in Subsampled Dataset

#### Appendix C.3. Performance of Collapsed Embeddings on Coarse-to-Fine Transfer and Original Task

**Lemma**

**A1.**

**Proof.**

## Appendix D. Proofs

#### Appendix D.1. Proofs for Theoretical Motivation

**Lemma**

**A2**

**Proof.**

**Corollary**

**A1.**

**Proof**

**of**

**Observation**

**1.**

**Proof**

**of**

**Observation**

**2.**

#### Appendix D.2. Proofs for Theoretical Implications

**Proof**

**of**

**Lemma**

**1.**

**Proof**

**of**

**Lemma**

**2**.

## Appendix E. Additional Experimental Details

#### Appendix E.1. Datasets

**CIFAR10**,**CIFAR100**, and**MNIST**are all the standard computer vision datasets.**CIFAR10-Coarse**consists of two superclasses: animals (dog, cat, deer, horse, frog, bird) and vehicles (car, truck, plane, boat).**CIFAR100-Coarse**consists of twenty superclasses. We artificially imbalance subclasses to create**CIFAR100-Coarse-U**. For each superclass, we select one subclass to keep all 500 points, select one subclass to subsample to 250 points, select one subclass to subsample to 100 points, and select the remaining two to subsample to 50 points. We use the original CIFAR100 class index to select which subclasses to subsample: the subclass with the lowest original class index keeps all 500 points, the next subclass keeps 250 points, etc.**MNIST-Coarse**consists of two superclasses: <5 and ≥5.**Waterbirds**[14] is a robustness dataset designed to evaluate the effects of spurious correlations on model performance. The waterbirds dataset is constructed by cropping out birds from photos in the Caltech-UCSD Birds dataset [43], and pasting them on backgrounds from the Places dataset [44]. It consists of two categories: water birds and land birds. The water birds are heavily correlated with water backgrounds and the land birds with land backgrounds, but 5% of the water birds are on land backgrounds, and 5% of the land birds are on water backgrounds. These form the (imbalanced) hidden strata.**ISIC**is a public skin cancer dataset for classifying skin lesions [15] as malignant or benign. 48% of the benign images contain a colored patch, which form the hidden strata.

#### Appendix E.2. Hyperparameters

#### Appendix E.3. Applications

#### Appendix E.3.1. Robustness against Worst-Group Performance

#### Appendix E.3.2. Robustness against Noise

#### Appendix E.3.3. Minimal Coreset Construction

## Appendix F. Additional Experimental Results

#### Appendix F.1. Performance of ${L}_{attract}$

**Table A2.**Performance of ${L}_{spread}$ compared to ${L}_{SC}$ and using ${L}_{attract}$ on its own. Best in bold.

End Model Perf. | ||||
---|---|---|---|---|

Dataset | ${\mathit{L}}_{\mathbf{SS}}$ | ${\mathit{L}}_{\mathbf{SC}}$ | ${\mathit{L}}_{\mathbf{attract}}$ | ${\mathit{L}}_{\mathbf{spread}}$ |

CIFAR10 | 89.7 | 90.9 | 91.3 | 91.5 |

CIFAR100 | 68.0 | 67.5 | 68.9 | 69.1 |

#### Appendix F.2. Sample Complexity

**Figure A1.**Performance of training ViT with ${L}_{spread}$ compared to training with ${L}_{SC}$ and ${L}_{SS}$ on CIFAR10 at various amounts of labeled data. ${L}_{spread}$ outperforms the baselines at each point. The cross entropy head here is trained with 1% labeled data to isolate the effect of training data on the contrastive losses.

#### Appendix F.3. Noisy Labels

#### Appendix F.3.1. Debiasing Noisy Contrastive Loss

#### Noise-Aware Triplet Loss

**Lemma**

**A3.**

**Proof.**

**Lemma**

**A4.**

**Proof.**

**Proposition**

**A1.**

**Observation**

**A2.**

#### Appendix F.3.2. Additional Noisy Label Results

**Figure A2.**Performance of models under various amounts of label noise for the contrastive loss head, and various amounts of clean training data for the cross entropy loss.

## References

- Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Mschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. Adv. Neural Inf. Process. Syst.
**2020**, 33, 18661–18673. [Google Scholar] - Graf, F.; Hofer, C.; Niethammer, M.; Kwitt, R. Dissecting Supervised Constrastive Learning. Proc. Int. Conf. Mach. Learn. PMLR.
**2021**, 139, 3821–3830. [Google Scholar] - Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning requires rethinking generalization. Commun. ACM
**2016**, 64, 107–115. [Google Scholar] [CrossRef] - Hoffmann, A.; Kwok, R.; Compton, P. Using subclasses to improve classification learning. In European Conference on Machine Learning; Springer: Berlin/Heidelberg, Germany, 2001; pp. 203–213. [Google Scholar]
- Sohoni, N.; Dunnmon, J.; Angus, G.; Gu, A.; Ré, C. No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems. Adv. Neural Inf. Process. Syst.
**2020**, 33, 19339–19352. [Google Scholar] - Oakden-Rayner, L.; Dunnmon, J.; Carneiro, G.; Ré, C. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proceedings of the Proceedings of the ACM conference on Health, Inference, and Learning, Toronto, ON, Canada, 2–4 April 2020; pp. 151–159. [Google Scholar]
- Linsker, R. Self-organization in a perceptual network. Computer
**1988**, 21, 105–117. [Google Scholar] [CrossRef] - Wang, T.; Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. Proc. Int. Conf. Mach. Learn. PMLR
**2020**, 119, 9929–9939. [Google Scholar] - Robinson, J.; Chuang, C.Y.; Sra, S.; Jegelka, S. Contrastive learning with hard negative samples. arXiv
**2020**, arXiv:2010.04592. [Google Scholar] - Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; Vaughan, J.W. A theory of learning from different domains. Mach. Learn.
**2010**, 79, 151–175. [Google Scholar] [CrossRef] [Green Version] - Ben-David, S.; Blitzer, J.; Crammer, K.; Pereira, F. Analysis of representations for domain adaptation. Adv. Neural Inf. Process. Syst.
**2007**, 19, 137. [Google Scholar] - Ben-David, S.; Urner, R. On the Hardness of Domain Adaptation and the Utility of Unlabeled Target Samples. In Proceedings of the 23rd International Conference, Lyon, France, 29–31 October 2012; Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 139–153. [Google Scholar]
- Mansour, Y.; Mohri, M.; Rostamizadeh, A. Domain adaptation: Learning bounds and algorithms. arXiv
**2009**, arXiv:0902.3430. [Google Scholar] - Sagawa, S.; Koh, P.W.; Hashimoto, T.B.; Liang, P. Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization. arXiv
**2019**, arXiv:1911.08731. [Google Scholar] - Codella, N.; Rotemberg, V.; Tschandl, P.; Celebi, M.E.; Dusza, S.; Gutman, D.; Helba, B.; Kalloo, A.; Liopyris, K.; Marchetti, M.; et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv
**2019**, arXiv:1902.03368. [Google Scholar] - Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. In Proceedings of the Proceedings of International Conference on Computer Vision (ICCV), Santiago, Chile, 7 December 2015. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv
**2020**, arXiv:2010.11929. [Google Scholar] - Toneva, M.; Sordoni, A.; des Combes, R.T.; Trischler, A.; Bengio, Y.; Gordon, G.J. An Empirical Study of Example Forgetting during Deep Neural Network Learning. arXiv
**2018**, arXiv:1812.05159. [Google Scholar] - Paul, M.; Ganguli, S.; Dziugaite, G.K. Deep Learning on a Data Diet: Finding Important Examples Early in Training. arXiv
**2021**, arXiv:2107.07075. [Google Scholar] - Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. Proc. Int. Conf. Mach. Learn. PMLR
**2020**, 119, 1597–1607. [Google Scholar] - Arora, S.; Khandeparkar, H.; Khodak, M.; Plevrakis, O.; Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. arXiv
**2019**, arXiv:1902.09229. [Google Scholar] - Zimmermann, R.S.; Sharma, Y.; Schneider, S.; Bethge, M.; Brendel, W. Contrastive Learning Inverts the Data Generating Process. arXiv
**2021**, arXiv:2012.08850. [Google Scholar] - Chuang, C.Y.; Robinson, J.; Torralba, A.; Jegelka, S. Debiased Contrastive Learning. Adv. Neural Inf. Process. Syst.
**2020**, 33, 8765–8775. [Google Scholar] - Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv
**2018**, arXiv:1807.03748. [Google Scholar] - Tian, Y.; Sun, C.; Poole, B.; Krishnan, D.; Schmid, C.; Isola, P. What makes for good views for contrastive learning? arXiv
**2020**, arXiv:2005.10243. [Google Scholar] - Tsai, Y.H.H.; Wu, Y.; Salakhutdinov, R.; Morency, L.P. Self-supervised Learning from a Multi-view Perspective. arXiv
**2020**, arXiv:2006.05576. [Google Scholar] - Tschannen, M.; Djolonga, J.; Rubenstein, P.K.; Gelly, S.; Lucic, M. On Mutual Information Maximization for Representation Learning. arXiv
**2019**, arXiv:1907.13625. [Google Scholar] - He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. arXiv
**2019**, arXiv:1911.05722. [Google Scholar] - Chen, X.; Fan, H.; Girshick, R.; He, K. Improved Baselines with Momentum Contrastive Learning. arXiv
**2020**, arXiv:2003.04297. [Google Scholar] - Goyal, P.; Caron, M.; Lefaudeux, B.; Xu, M.; Wang, P.; Pai, V.; Singh, M.; Liptchinsky, V.; Misra, I.; Joulin, A.; et al. Self-supervised Pretraining of Visual Features in the Wild. arXiv
**2021**, arXiv:2103.01988. [Google Scholar] - Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. Adv. Neural Inf. Process. Syst.
**2020**, 33, 9912–9924. [Google Scholar] - Islam, A.; Chen, C.F.; Panda, R.; Karlinsky, L.; Radke, R.; Feris, R. A Broad Study on the Transferability of Visual Representations with Contrastive Learning. arXiv
**2021**, arXiv:2103.13517. [Google Scholar] - Bukchin, G.; Schwartz, E.; Saenko, K.; Shahar, O.; Feris, R.; Giryes, R.; Karlinsky, L. Fine-grained Angular Contrastive Learning with Coarse Labels. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19 June 2021; 2021. [Google Scholar] [CrossRef]
- d’Eon, G.; d’Eon, J.; Wright, J.R.; Leyton-Brown, K. The Spotlight: A General Method for Discovering Systematic Errors in Deep Learning Models. arXiv
**2021**, arXiv:2107.00758. [Google Scholar] - Duchi, J.; Hashimoto, T.; Namkoong, H. Distributionally robust losses for latent covariate mixtures. arXiv
**2020**, arXiv:2007.13982. [Google Scholar] - Goel, K.; Gu, A.; Li, Y.; Re, C. Model Patching: Closing the Subgroup Performance Gap with Data Augmentation. arXiv
**2020**, arXiv:2008.06775. [Google Scholar] - Liu, S.; Niles-Weed, J.; Razavian, N.; Fernandez-Granda, C. Early-Learning Regularization Prevents Memorization of Noisy Labels. Adv. Neural Inf. Process. Syst.
**2020**, 33, 20331–20342. [Google Scholar] - Li, J.; Xiong, C.; Hoi, S.C. Semi-supervised Learning with Contrastive Graph Regularization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11 October 2021. [Google Scholar]
- Ciortan, M.; Dupuis, R.; Peel, T. A Framework using Contrastive Learning for Classification with Noisy Labels. arXiv
**2021**, arXiv:2104.09563. [Google Scholar] [CrossRef] - Li, J.; Socher, R.; Hoi, S.C. DivideMix: Learning with Noisy Labels as Semi-supervised Learning. arXiv
**2020**, arXiv:2002.07394. [Google Scholar] - Ju, J.; Jung, H.; Oh, Y.; Kim, J. Extending Contrastive Learning to Unsupervised Coreset Selection. arXiv
**2021**, arXiv:2103.03574. [Google Scholar] [CrossRef] - Sener, O.; Savarese, S. Active Learning for Convolutional Neural Networks: A Core-Set Approach. arXiv
**2017**, arXiv:1708.00489. [Google Scholar] - Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schroff, F.; Belongie, S.; Perona, P. Caltech-UCSD Birds 200. In Technical Report CNS-TR-2010-001; California Institute of Technology: Pasadena, CA, USA, 2010. [Google Scholar]
- Zhou, B.; Lapedriza, A.; Xiao, J.; Torralba, A.; Oliva, A. Learning Deep Features for Scene Recognition using Places Database. Adv. Neural Inf. Process. Syst.
**2014**, 27, 487–495. [Google Scholar] - McInnes, L.; Healy, J.; Saul, N.; Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw.
**2018**, 3. [Google Scholar] [CrossRef]

**Figure 1.**Classes contain critical information that is not explicitly encoded in the class labels. Supervised contrastive learning (

**left**) loses this information, since it maps unlabeled strata such as sleeping cats, jumping cats, and swatting cat to a single embedding. We introduce a new loss function ${L}_{spread}$ that prevents class collapse and maintains strata distinctions. ${L}_{spread}$ produces higher-quality embeddings, which we evaluate with three downstream applications.

**Figure 2.**${L}_{spread}$ produces embeddings that are qualitatively better than those produced by ${L}_{SC}$. We show t-SNE visualizations of embeddings for the CIFAR10 test set and report cosine similarity metrics (average intracluster cosine similarities, and similarities between individual points and the class cluster). ${L}_{spread}$ produces lower intraclass cosine similarity and embeds images from rare strata further out over the hypersphere than ${L}_{SC}$.

**Figure 3.**Points from large subclasses cluster tightly; points from small subclasses scatter (CIFAR100-Coarse, unbalanced subclasses).

**Figure 4.**(

**Left**) Performance of models under various amounts of label noise for the contrastive loss head. (

**Right**) Performance of a ResNet18 trained with coresets of various sizes. Our coreset algorithm is competitive with the state-of-the-art in the large coreset regime (from 40–90% coresets), but maintains performance for small coresets (smaller than 40%). At the 10% coreset, our algorithm outperforms [18] by 32 points and matches random sampling.

Dataset | Notes |
---|---|

CIFAR10 | Standard computer vision dataset |

CIFAR10-Coarse | CIFAR10 with animal/vehicle coarse labels |

CIFAR100 | Standard computer vision dataset |

CIFAR100-Coarse | CIFAR100 with standard coarse labels |

CIFAR100-Coarse-U | CIFAR100 with standard coarse labels, but with some fine classes |

sub-sampled | |

MNIST | Standard computer vision dataset |

MNIST-Coarse | MNIST with <5 and ≥5 coarse labels |

Waterbirds | Robustness dataset mixing up images of birds and their |

backgrounds [14] | |

ISIC | Images of skin lesions [15] |

CelebA | Images of celebrity faces [16] |

**Table 2.**Performance of coarse-to-fine transfer on various datasets compared against contrastive baselines. In these tasks, we first train a model on coarse task labels, then freeze the representation and train a model on fine-grained subclass labels. ${L}_{spread}$ produces embeddings that transfer better across all datasets. Best in bold.

Coarse-to-Fine Transfer | |||
---|---|---|---|

Dataset | ${\mathit{L}}_{\mathbf{SS}}$ | ${\mathit{L}}_{\mathbf{SC}}$ | ${\mathit{L}}_{\mathbf{spread}}$ |

CIFAR10-Coarse | 71.7 | 52.5 | 76.1 |

CIFAR100-Coarse | 62.0 | 62.4 | 63.9 |

CIFAR100-Coarse-U | 61.9 | 59.5 | 62.4 |

MNIST-Coarse | 97.1 | 98.8 | 99.0 |

**Table 3.**Unsupervised strata recovery performance (top, F1), and worst-group performance (AUROC for ISIC, Acc for others) using recovered strata. Best in bold.

Sub-Group Recovery | |||
---|---|---|---|

Dataset | Sohoni et al. [5] | ${\mathit{L}}_{\mathbf{SC}}$ | ${\mathit{L}}_{\mathbf{spread}}$ |

Waterbirds | 56.3 | 47.2 | 59.0 |

ISIC | 74.0 | 92.5 | 93.8 |

CelebA | 24.2 | 19.4 | 24.8 |

Worst-Group Robustness | |||

Waterbirds | 88.4 | 86.5 | 89.0 |

ISIC | 92.0 | 93.3 | 92.6 |

CelebA | 55.0 | 66.1 | 67.8 |

**Table 4.**End model performance training with ${L}_{spread}$ on various datasets compared against contrastive baselines. All metrics are accuracy except for ISIC (AUROC). ${L}_{spread}$ produces the best performance in 7 out of 9 cases, and matches the best performance in 1 case. Best in bold.

End Model Perf. | |||
---|---|---|---|

Dataset | ${\mathit{L}}_{\mathbf{SS}}$ | ${\mathit{L}}_{\mathbf{SC}}$ | ${\mathit{L}}_{\mathbf{spread}}$ |

CIFAR10 | 89.7 | 90.9 | 91.5 |

CIFAR10-Coarse | 97.7 | 96.5 | 98.1 |

CIFAR100 | 68.0 | 67.5 | 69.1 |

CIFAR100-Coarse | 76.9 | 77.2 | 78.3 |

CIFAR100-Coarse-U | 72.1 | 71.6 | 72.4 |

MNIST | 99.1 | 99.3 | 99.2 |

MNIST-Coarse | 99.1 | 99.4 | 99.4 |

Waterbirds | 77.8 | 73.9 | 77.9 |

ISIC | 87.8 | 88.7 | 90.0 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Fu, D.Y.; Chen, M.F.; Zhang, M.; Fatahalian, K.; Ré, C.
The Details Matter: Preventing Class Collapse in Supervised Contrastive Learning. *Comput. Sci. Math. Forum* **2022**, *3*, 4.
https://doi.org/10.3390/cmsf2022003004

**AMA Style**

Fu DY, Chen MF, Zhang M, Fatahalian K, Ré C.
The Details Matter: Preventing Class Collapse in Supervised Contrastive Learning. *Computer Sciences & Mathematics Forum*. 2022; 3(1):4.
https://doi.org/10.3390/cmsf2022003004

**Chicago/Turabian Style**

Fu, Daniel Y., Mayee F. Chen, Michael Zhang, Kayvon Fatahalian, and Christopher Ré.
2022. "The Details Matter: Preventing Class Collapse in Supervised Contrastive Learning" *Computer Sciences & Mathematics Forum* 3, no. 1: 4.
https://doi.org/10.3390/cmsf2022003004