The Details Matter: Preventing Class Collapse in Supervised Contrastive Learning †

: Supervised contrastive learning optimizes a loss that pushes together embeddings of points from the same class while pulling apart embeddings of points from different classes. Class collapse—when every point from the same class has the same embedding—minimizes this loss but loses critical information that is not encoded in the class labels. For instance, the “cat” label does not capture unlabeled categories such as breeds, poses, or backgrounds (which we call “strata”). As a result, class collapse produces embeddings that are less useful for downstream applications such as transfer learning and achieves suboptimal generalization error when there are strata. We explore a simple modiﬁcation to supervised contrastive loss that aims to prevent class collapse by uniformly pulling apart individual points from the same class. We seek to understand the effects of this loss by examining how it embeds strata of different sizes, ﬁnding that it clusters larger strata more tightly than smaller strata. As a result, our loss function produces embeddings that better distinguish strata in embedding space, which produces lift on three downstream applications: 4.4 points on coarse-to-ﬁne transfer learning, 2.5 points on worst-group robustness, and 1.0 points on minimal coreset construction. Our loss also produces more accurate models, with up to 4.0 points of lift across 9 tasks.


Introduction
Supervised contrastive learning has emerged as a promising method for training deep models, with strong empirical results over traditional supervised learning [1]. Recent theoretical work has shown that under certain assumptions, class collapse-when the representation of every point from a class collapses to the same embedding on the hypersphere, as in Figure 1-minimizes the supervised contrastive loss L SC [2]. Furthermore, modern deep networks, which can memorize arbitrary labels [3], are powerful enough to produce class collapse.
Although class collapse minimizes L SC and produces accurate models, it loses information that is not explicitly encoded in the class labels. For example, consider images with the label "cat." As shown in Figure 1, some cats may be sleeping, some may be jumping, and some may be swatting at a bug. We call each of these semantically-unique categories of data-some of which are rarer than others, and none of which are explicitly labeled-a stratum. Distinguishing strata is important; it empirically can improve model performance [4] and fine-grained robustness [5]. It is also critical in high-stakes applications such as medical imaging [6]. However, L SC maps the sleeping, jumping, and swatting cats all to a single "cat" embedding, losing strata information. As a result, these embeddings are less useful Figure 1. Classes contain critical information that is not explicitly encoded in the class labels. Supervised contrastive learning (left) loses this information, since it maps unlabeled strata such as sleeping cats, jumping cats, and swatting cat to a single embedding. We introduce a new loss function L spread that prevents class collapse and maintains strata distinctions. L spread produces higher-quality embeddings, which we evaluate with three downstream applications.
In Section 3, we present our modification to L SC , which prevents class collapse by changing how embeddings are pushed and pulled apart. L SC pushes together embeddings of points from the same class and pulls apart embeddings of points from different classes. In contrast, our modified loss L spread includes an additional class-conditional InfoNCE loss term that uniformly pulls apart individual points from within the same class. This term on its own encourages points from the same class to be maximally spread apart in embedding space, which discourages class collapse (see Figure 1 middle). Even though L spread does not use strata labels, we observe that it still produces embeddings that qualitatively appear to retain more strata information than those produced by L SC (see Figure 2).
In Section 4, motivated by these empirical observations, we study how well L spread preserves distinctions between strata in the representation space. Previous theoretical tools that study the optimal embedding distribution fail to characterize the geometry of strata. Instead, we propose a simple thought experiment considering the embeddings that the supervised contrastive loss generates when it is trained on a partial sample of the dataset. This setup enables us to distinguish strata based on their sizes by considering how likely it is for them to be represented in the sample (larger strata are more likely to appear in a small sample). In particular, we find that points from rarer and more distinct strata are clustered less tightly than points from common strata, and we show that this clustering property can improve embedding quality and generalization error.
In Section 5, we empirically validate several downstream implications of these insights. First, we demonstrate that L spread produces embeddings that retain more information about strata, resulting in lift on three downstream applications that require strata recovery: • We evaluate how well L spread 's embeddings encode fine-grained subclasses with coarseto-fine transfer learning. L spread achieves up to 4.4 points of lift across four datasets. • We evaluate how well embeddings produced by L spread can recover strata in an unsupervised setting by evaluating robustness against worst-group accuracy and noisy labels. We use our insights about how L spread embeds strata of different sizes to improve worst-group robustness by up to 2.5 points and to recover 75% performance when 20% of the labels are noisy. • We evaluate how well we can differentiate rare strata from common strata by constructing limited subsets of the training data that can achieve the highest performance under a fixed training strategy (the coreset problem). We construct coresets by subsampling points from common strata. Our coresets outperform prior work by 1.0 points when coreset size is 30% of the training set. Next, we find that L spread produces higher-quality models, outperforming L SC by up to 4.0 points across 9 tasks. Finally, we discuss related work in Section 6 and conclude in Section 7.  Figure 2. L spread produces embeddings that are qualitatively better than those produced by L SC . We show t-SNE visualizations of embeddings for the CIFAR10 test set and report cosine similarity metrics (average intracluster cosine similarities, and similarities between individual points and the class cluster). L spread produces lower intraclass cosine similarity and embeds images from rare strata further out over the hypersphere than L SC .

Background
We present our generative model for strata (Section 2.1). Then, we discuss supervised contrastive learning-in particular the SupCon loss L SC from [1] and its optimal embedding distribution [2]-and the end model for classification (Section 2.2).

Data Setup
We have a labeled input dataset D = {(x i , y i )} N i=1 , where (x, y) ∼ P for x ∈ X and y ∈ Y = {1, . . . , K}. For a particular data point x, we denote its label as h(x) ∈ Y with distribution p(y|x). We assume that data is class-balanced such that p(y = i) = 1 K for all i ∈ Y. The goal is to learn a modelp(y|x) on D to classify points.
Data points also belong to categories beyond their labels, called strata. Following [5], we denote a stratum as a latent variable z, which can take on values in Z = {1, . . . , C}. Z can be partitioned into disjoint subsets S 1 , . . . , S K such that if z ∈ S k , then its corresponding y label is equal to k. Let S(c) denote the deterministic label corresponding to stratum c. We model the data generating process as follows. First, the latent stratum is sampled from distribution p(z). Then, the data point x is sampled from the distribution P z = p(·|z), and its corresponding label is y = S(z) (see Figure 2 of [5]). We assume that each class has m strata, and that there exist at least two strata, z 1 , z 2 , where S(z 1 ) = S(z 2 ) and supp(z 1 ) ∩ supp(z 2 ) = ∅.

Supervised Contrastive Loss
Supervised contrastive loss pushes together pairs of points from the same class (called positives) and pulls apart pairs of points from different classes (called negatives) to train an encoder f : X → R d . Following previous works, we make three assumptions on the encoder: (1) we restrict the encoder output space to be S d−1 , the unit hypersphere; (2) we assume K ≤ d + 1, which allows Graf et al. [2] to recover optimal embedding geometry; and (3) we assume the encoder f is "infinitely powerful", meaning that any distribution on S d−1 is realizable by f (x).

SupCon and Collapsed Embeddings
We focus on the SupCon loss L SC from [1]. Denote σ(x, x ) = f (x) f (x )/τ, where τ is a temperature hyperparameter. Let B be the set of batches of labeled data on D and P(i, B) = {p ∈ B\i : h(p) = h(i)} be the points in B with the same label as x i . For an , where P(i, B) forms positive pairs and B\i forms negative pairs.
The optimal embedding distribution that minimizes L SC has one embedding per class, with the per-class embeddings collectively forming a regular simplex inscribed in the hypersphere Graf et al. [2].
We describe this property as class collapse and define the distribution of f (x) that satisfies these conditions as collapsed embeddings.

End Model
After the supervised contrastive loss is used to train an encoder, a linear classifier W ∈ R K×d is trained on top of the representations f (x) by minimizing cross-entropy loss over softmax scores. We assume that W y 2 ≤ 1 for each y ∈ Y. The end model's empirical . The model uses softmax scores constructed with f (x) and W to generate predictionsp(y|x), which we also write aŝ p(y| f (x)). Finally, the generalization error of the model on P is the expected cross-entropy betweenp(y|x) and p(y|x),

Method
We now highlight some theoretical problems with class collapse under our generative model of strata (Section 3.1). We then propose and qualitatively analyze the loss function L spread (Section 3.2).

Theoretical Motivation
We show that the conditions under which collapsed embeddings minimize generalization error on coarse-to-fine transfer and the original task do not hold when distinct strata exist.
Consider the downstream coarse-to-fine transfer task (x, z) of using embeddings f (x) learned on (x, y) to classify points by fine-grained strata. Formally, coarse-to-fine transfer involves learning an end model with weight matrix W ∈ R C×d and fixed f (x) (as described in Section 2.2) on points (x, z), where we assume the data are class-balanced across z. Observation 1. Class collapse minimizes L(x, z, f ) if for all x, (1) p(y = h(x)|x) = 1, meaning that each x is deterministically assigned to one class, and (2) p(z|x) = 1 m where z ∈ S h(x) . The second condition implies that p(x|z) = p(x|y) for all z ∈ S y , meaning that there is no distinction among strata from the same class. This contradicts our data model described in Section 2.1.
Similarly, we characterize when collapsed embeddings are optimal for the original task (x, y).

Observation 2.
Class collapse minimizes L(x, y, f ) if, for all x, p(y = h(x)|x) = 1. This contradicts our data model.
Proofs are in Appendix D.1. We also analyze transferability of f on arbitrary new distributions (x , y ) information-theoretically in Appendix C.1, finding that a one-to-one encoder obeys the Infomax principle [7] better than collapsed embeddings on (x , y ). These observations suggest that a distribution over the embeddings that preserves strata distinctions and does not collapse classes is more desirable.

Modified Contrastive Loss L spread
We introduce the loss L spread , a weighted sum of two contrastive losses L attract and L repel . L attract is a supervised contrastive loss, while L repel encourages intra-class separation. For α ∈ [0, 1], L attract is a variant of the SupCon loss, which encourages class separation in embedding space as suggested by Graf et al. [2].L repel is a class-conditional InfoNCE loss, where the positive distribution consists of augmentations and the negative distribution consists of i.i.d samples from the same class. It encourages points within a class to be spread apart, as suggested by the analysis of the InfoNCE loss by Wang and Isola [8].
Qualitative Evaluation Figure 2 shows t-SNE plots for embeddings produced with L SC versus L spread on the CIFAR10 test set. L spread produces embeddings that are more spread out than those produced by L SC and avoids class collapse. As a result, images from different strata can be better differentiated in embedding space. For example, we show two dogs, one from a common stratum and one from a rare stratum (rare pose). The two dogs are much more distinguishable by distance in the L spread embedding space, which suggests that it helps preserve distinctions between strata.

Geometry of Strata
We first discuss some existing theoretical tools for analyzing contrastive loss geometrically and their shortcomings with respect to understanding how strata are embedded. In Section 4.2, we propose a simple thought experiment about the distances between strata in embedding space when trained under a finite subsample of data to better understand our prior qualitative observations. Then, in Section 4.3, we discuss implications of representations that preserve strata distinctions, showing theoretically how they can yield better generalization error on both coarse-to-fine transfer and the original task and empirically how they allow for new downstream applications.

Existing Analysis
Previous works have studied the geometry of optimal embeddings under contrastive learning [2,8,9], but their techniques cannot analyze strata because strata information is not directly used in the loss function. These works use the infinite encoder assumption, where any distribution on S d−1 is realizable by the encoder f applied to the input data. This allows the minimization of the contrastive loss to be equivalent to an optimization problem over probability measures on the hypersphere. As a result, solving this new problem yields a distribution whose characterization is solely determined by information in the loss function (e.g., labels information [2,9]) and is decoupled from other information about the input data x and hence decoupled from strata.
More precisely, if we denote the measure of x ∈ X as µ X , minimizing the contrastive loss over the mapping f is equal (at the population level) to minimizing over the pushfor- 1]. The infinite encoder assumption allows us to relax the problem and instead consider optimizing over any µ ∈ M(S d−1 ) in the Borel set of probability measures on the hypersphere. Then, the optimal µ learned is independent of the distribution of the input data P beyond what is in the relaxed objective function.
This approach using the infinite encoder assumption does not allow for analysis of strata. Strata are unknown at training time and thus cannot be incorporated explicitly into the loss function. Their geometries will not be reflected in the characterization of the optimal distribution obtained from previous theoretical tools. Therefore, we need additional reasoning for our empirical observations that strata distinctions are preserved in embedding space under L spread .

Subsampling Strata
We propose a simple thought experiment based on subsampling the dataset-randomly sampling a fraction of the training data-to analyze strata. Consider the following: we subsample a fraction t ∈ [0, 1] of a training set of N points from P. We use this subsampled dataset D t to learn an encoderf t , and we study the average distance underf t between two strata z and z as t varies.
The average distance between z and z is δ ] 2 and depends on whether z and z are both in the subsampled dataset. We study when z and z belong to the same class. We have three cases (with probabilities stated in Appendix C.2) based on strata frequency and t-when both, one, or neither of the strata appears in D t : 1.
Both strata appear in D t The encoderf t is trained on both z and z . For large N, we can approximate this setting by consideringf t trained on infinite data from these strata. Points belonging to these strata will be defined in the optimal embedding distribution on the hypersphere, which can be characterized by prior theoretical approaches [2,8,9]. With L spread , δ(f t , z, z ) depends on α, which controls the extent of spread in the embedding geometry. With L SC , points from the two strata would asymptotically map to one location on the hypersphere, and δ(f t , z, z ) would converge to 0. This case occurs with probability increasing in p(z), p(z ), and t.

2.
One stratum but not the other appears in D t Without loss of generality, suppose that points from z appear in D t but no points from z do. To understand δ(f t , z, z ), we can consider how the end modelp(y|f t (x)) learned using the "source" distribution containing z performs on the "target" distribution of stratum z since this downstream classifier is a function of distances in embedding space. Borrowing from the literature in domain adaptation, the difficulty of this out-of-distribution problem depends on both the divergence between source z and target z distributions and the capacity of the overall model. The H∆H-divergence from Ben-David et al. [10,11], which is studied in lower bounds in Ben-David and Urner [12], and the discrepancy difference from Mansour et al. [13] capture both concepts. Moreover, the optimal geometries of L spread and L SC induce different end model capacities and prediction distributions, with data being more separable under L SC , which can help explain why L spread better preserves strata distances. This case occurs with probability increasing in p(z) and decreasing in p(z ) and t.

3.
Neither strata appears in D t The distance δ(f t , z, z ) in this case is at most 2D TV (P z , P z ) (total variation distance) regardless of how the encoder is trained, although differences in transfer from models learned on Z \z, z to z versus z can be further analyzed. This case occurs with probability decreasing in p(z), p(z ), and t. We make two observations from these cases. First, if z and z are both common strata, then as t increases, the distance between them depends on the optimal asymptotic distribution. Therefore, if we set α = 1 in L spread , these common strata will collapse. Second, if z is a common strata and z is uncommon, the second case occurs frequently over randomly sampled D t , and thus the strata are separated based on the difficulty of the respective out-of-distribution problem. We thus arrive at the following insight from our thought experiment: Common strata are more tightly clustered together, while rarer and more semantically distinct strata are far away from them. Figure 3 demonstrates this insight. It shows a t-SNE visualization of embeddings from training on CIFAR100 with coarse superclass labels, and with artifically imbalanced subclasses. We show points from the largest subclasses in dark blue and points from the smallest subclasses in light blue. Points from the largest subclasses (dark blue) cluster tightly, whereas points from small subclasses (light blue) are scattered throughout the embedding space.

Implications
We discuss theoretical and practical implications of our subsampling argument. First, we show that on both the coarse-to-fine transfer task (x, z) and the original task (x, y), embeddings that preserve strata yield better generalization error. Second, we discuss practical implications arising from our subsampling argument that enable new applications.

Theoretical Implications
Considerf 1 , the encoder trained on D with all N points using L spread , and suppose a mean classifier is used for the end model, e.g., W y = E x|y f 1 (x) and W z = E x|z f 1 (x) . On coarse-to-fine transfer, generalization error depends on how far each stratum center is from the others. Lemma 1. There exists λ z > 0 such that the generalization error on the coarse-to-fine transfer task is at most where δ(f 1 , z, z ) is the average distance between strata z and z defined in Section 4.2.
The larger the distances between strata, the smaller the upper bound on generalization error. We now show that a similar result holds on the original task (x, y), but there is an additional term that penalizes points from the same class being too far apart.

Lemma 2.
There exists λ y > 0 such that the generalization error on the original task is at most This result suggests that maximizing distances between strata of different classes is desirable, but less so for distances between strata of the same class as suggested by the first term in the expression. Both results illustrate that separating strata to some extent in the embedding space results in better bounds on generalization error. In Appendix C.3, we provide proofs of these results and derive values of the generalization error for these two tasks under class collapse for comparison.

Practical Implications
Our discussion in Section 4.2 suggests that training with L spread better distinguishes strata in embedding space. As a result, we can use differences between strata of different sizes for downstream applications. For example, unsupervised clustering can help recover pseudolabels for unlabeled, rare strata. These pseudolabels can be used as inputs to worstgroup robustness algorithms, or used to detect noisy labels, which appear to be rare strata during training (see Section 5.3 for examples). We can also train over subsampled datasets to heuristically distinguish points that come from common strata from points that come from rare strata. We can then downsample points from common strata to construct minimal coresets (see Section 5.4 for examples).

Experiments
This section evaluates L spread on embedding quality and model quality: • First, in Section 5.2, we use coarse-to-fine transfer learning to evaluate how well the embeddings maintain strata information. We find that L spread achieves lift across four datasets. • In Section 5.3, we evaluate how well L spread can detect rare strata in an unsupervised setting. We first use L spread to detect rare strata to improve worst-group robustness by up to 2.5 points. We then use rare strata detection to correct noisy labels, recovering 75% performance under 20% noise. • In Section 5.4, we evaluate how well L spread can distinguish points from large strata versus points from small strata. We downsample points from large strata to construct minimal coresets on CIFAR10, outperforming prior work by 1.0 points at 30% labeled data. • Finally, in Section 5.5, we show that training with L spread improves model quality, validating our theoretical claims that preventing class collapse can improve generalization error. We find that L spread improves performance in 7 out of 9 cases.

Datasets and Models
Tabel 1 lists all the datasets we use in our evaluation. CIFAR10, CIFAR100, and MNIST are the standard computer vision datasets. We also use coarse versions of each, wherein classes are combined to create coarse superclasses (animals/vehicles for CIFAR10, standard superclasses for CIFAR100, and <5, ≥5 for MNIST). In CIFAR100-Coarse-U, some subclasses have been artificially imbalanced. Waterbirds, ISIC and CelebA are image datasets with documented hidden strata [5,[14][15][16]. We use a ViT model [17] (4 × 4, 7 layers) for CIFAR and MNIST and a ResNet50 for the rest. For the ViT models, we jointly optimize the contrastive loss with a cross entropy loss head. For the ResNets, we train the contrastive loss on its own and use linear probing on the final layer. More details in Appendix E.

Coarse-to-Fine Transfer Learning
In this section, we use coarse-to-fine transfer learning to evaluate how well L spread retains strata information in the embedding space. We train on coarse superclass labels, freeze the weights, and then use transfer learning to train a linear layer with subclass labels. We use this supervised strata recovery setting to isolate how well the embeddings can recover strata in the optimal setting. For baselines, we compare against training with L SC and the SimCLR loss L SS . Table 2 reports the results. We find that L spread produces better embeddings for coarseto-fine transfer learning than L SC and L SS . Lift over L SC varies from 0.2 points on MNIST (16.7% error reduction), to 23.6 points of lift on CIFAR10. L spread also produces better embeddings than L SS , since L SS does not encode superclass labels in the embedding space. Table 2. Performance of coarse-to-fine transfer on various datasets compared against contrastive baselines. In these tasks, we first train a model on coarse task labels, then freeze the representation and train a model on fine-grained subclass labels. L spread produces embeddings that transfer better across all datasets. Best in bold.

Robustness Against Worst-Group Accuracy and Noise
In this section, we use robustness to measure how well L spread can recover strata in an unsupervised setting. We use clustering to detect rare strata as an input to worstgroup robustness algorithms, and we use a geometric heuristic over embeddings to correct noisy labels.
To evaluate worst-group accuracy, we follow the experimental setup and datasets from Sohoni et al. [5]. We first train a model with class labels. We then cluster the embeddings to produce pseudolabels for hidden strata, which we use as input for a Group-DRO algorithm to optimize worst-group robustness [14]. We use both L SC and cross entropy loss [5] for training the first stage as baselines.
To evaluate robustness against noise, we introduce noisy labels to the contrastive loss head on CIFAR10. We detect noisy labels with a simple geometric heuristic: points with incorrect labels appear to be small strata, so they should be far away from other points of the same class. We then correct noisy points by assigning the label of the nearest cluster in the batch. More details can be found in Appendix E. Table 3 shows the performance of unsupervised strata recovery and downstream worstgroup robustness. We can see that L spread outperforms both L SC and Sohoni et al. [5] on strata recovery. This translates to better worst-group robustness on Waterbirds and CelebA. Figure 4 (left) shows the effect of noisy labels on performance. When noisy labels are uncorrected (purple), performance drops by up to 10 points at 50% noise. Applying our geometric heuristic (red) can recover 4.8 points at 50% noise, even without using L spread . However, L spread recovers an additional 0.9 points at 50% noise, and an additional 1.6 points at 20% noise (blue). In total, L spread recovers 75% performance at 20% noise, whereas L SC only recovers 45% performance. Table 3. Unsupervised strata recovery performance (top, F1), and worst-group performance (AUROC for ISIC, Acc for others) using recovered strata. Best in bold.  Our coreset algorithm is competitive with the state-of-the-art in the large coreset regime (from 40-90% coresets), but maintains performance for small coresets (smaller than 40%). At the 10% coreset, our algorithm outperforms [18] by 32 points and matches random sampling.

Minimal Coreset Construction
Now we evaluate how well training on fractional samples of the dataset with L spread can distinguish points from large versus small strata by constructing minimal coresets for CIFAR10. We train a ResNet18 on CIFAR10, following Toneva et al. [18], and compare against baselines from Toneva et al. [18] (Forgetting) and Paul et al. [19] (GradNorm, L2Norm). For our coresets, we train with L spread on subsamples of the dataset and record how often points are correctly classified at the end of each run. We bucket points in the training set by how often the point is correctly classified. We then iteratively remove points from the largest bucket in each class. Our strategy removes easy examples first from the largest coresets, but maintains a set of easy examples in the smallest coresets. Figure 4 (right) shows the results at various coreset sizes. For large coresets, our algorithm outperforms both methods from Paul et al. [19] and is competitive with Toneva et al. [18]. For small coresets, our method outperforms the baselines, providing up to 5.2 points of lift over Toneva et al. [18] at 30% labeled data. Our analysis helps explain this gap; removing too many easy examples hurts performance, since then the easy examples become rare and hard to classify.

Model Quality
Finally, we confirm that L spread produces higher-quality models and achieves better sample complexity than both L SC and the SimCLR loss L SS from [20]. Table 4 reports the performance of models across all our datasets. We find that L spread achieves better overall performance compared to models trained with L SC and L SS in 7 out of 9 tasks, and matches performance in 1 task. We find up to 4.0 points of lift over L SC (Waterbirds), and up to 2.2 points of lift (AUROC) over L SS (ISIC). In Appendix F, we additionally evaluate the sample complexity of contrastive losses by training on partial subsamples of CIFAR10. L spread outperforms L SC and L SS throughout. Table 4. End model performance training with L spread on various datasets compared against contrastive baselines. All metrics are accuracy except for ISIC (AUROC). L spread produces the best performance in 7 out of 9 cases, and matches the best performance in 1 case. Best in bold.

Related Work and Discussion
From work in contrastive learning, we take inspiration from [21], who use a latent classes view to study self-supervised contrastive learning. Similarly, [22] considers how minimizing the InfoNCE loss recovers a latent data generating model. We initially started from a debiasing angle to study the effects of noise in supervised contrastive learning inspired by [23], but moved to our current strata-based view of noise instead. Recent work has also analyzed contrastive learning from the information-theoretic perspective [24][25][26], but does not fully explain practical behavior [27], so we focus on the geometric perspective in this paper because of the downstream applications. On the geometric side, we are inspired by the theoretical tools from [8] and [2], who study representations on the hypersphere along with [9].
Our work builds on the recent wave of empirical interest in contrastive learning [20,[28][29][30][31] and supervised contrastive learning [1]. There has also been empirical work analyzing the transfer performance of contrastive representations and the role of intra-class variability in transfer learning. [32] find that combining supervised and self-supervised contrastive loss improves transfer learning performance, and they hypothesize that this is due to both inter-class separation and intra-class variability. [33] find that combining cross entropy and self-supervised contrastive loss improves coarse-to-fine transfer, also motivated by preserving intra-class variability.
We derive L spread from similar motivations to losses proposed in these works, and we futher theoretically study why class collapse can hurt downstream performance. In particular, we study why preserving distinctions of strata in embedding space may be important, with theoretical results corroborating their empirical studies. We further propose a new thought experiment for why a combined loss function may lead to better separation of strata.
Our treatment of strata is strongly inspired by [5,6], who document empirical consequences of hidden strata. We are inspired by empirical work that has demonstrated that detecting subclasses can be important for performance [4,34] and robustness [14,35,36].
Each of our downstream applications is a field in itself, and we take inspiration from recent work from each. Our noise heuristic is similar to the ELR [37] and takes inspiration from a various work using contrastive learning to correct noisy labels and for semi-supervised learning [38][39][40]. Our coreset algorithm is inspired by recent work in coresets for modern deep networks [19,41,42], and takes inspiration from [18] in particular.

Conclusions
We propose a new supervised contrastive loss function to prevent class collapse and produce higher-quality embeddings. We discuss how our loss function better maintains strata distinctions in embedding space and explore several downstream applications. Future directions include encoding label hierarchies and other forms of knowledge in contrastive loss functions and extending our work to more modalities, models, and applications. We hope that our work inspires further work in more fine-grained supervised contrastive loss functions and new theoretical approaches for reasoning about generalization and strata.  We provide a glossary in Appendix A. Then we provide definitions of terms in Appendix B. We discuss additional theoretical results in Appendix C. We provide proofs in Appendix D. We discuss additional experimental details in Appendix E. Finally, we provide additional experimental results in Appendix F.

Appendix A. Glossary
The glossary is given in Table A1 below. Table A1. Glossary of variables and symbols used in this paper.

Symbol
Used for L SC SupCon (see Section 2.2), a supervised contrastive loss introduced by [1].

L spread
Our modified loss function defined in Section 3.2.
The class that x belongs to, i.e., h(x) is a label drawm from p(y|x). This label information is used as input in the supervised contrastive loss.

p(y|x)
The end model's predicted distribution over y given x. z A stratum is a latent variable z ∈ Z = {1, . . . , C} that further categorizes data beyond labels.

S k
The set of all strata corresponding to label k (deterministic).

S(c)
The label corresponding to strata c (deterministic). P z The distribution of input data belonging to stratum z, i.e., x ∼ p(·|z). m The number of strata per class. d Dimension of the embedding space. f The encoder f : X → R d maps input data to an embedding space and is learned by minimizing the contrastive loss function.

B
Set of batches of labeled data on D.

P(i, B)
Points in B with the same label as A regular simplex inscribed in the hypersphere (see Definition A1). W The weight matrix that parametrizes the downstream linear classifier (end model) learned on f (x).

L(W, D)
The empirical cross entropy loss used to learn W over dataset D (see (A1)). L(x, y, f ) The generalization error of the end model of predicting output y on x using encoder f (see (A2) and (A3)).

L attract
A variant on SupCon that is used in L spread that pushes points of a class together (see (2)).

L repel
A class-conditional InfoNCE loss that is used in L spread to pull apart points within a class (see (3)). α Hyperparameter α ∈ [0, 1] controls how to balance L attract and L repel . x aug An augmentation of data point x.
Fraction of training data t ∈ [0, 1] that is varied in our thought experiment. D t Randomly sampled dataset from P with size equal to t · N fraction of D. f t Encoder trained on sampled dataset D t . δ(f t , z, z ) The distance between centers of strata z and z under encoderf t , namely

Appendix B. Definitions
We restate definitions used in our proofs.
Definition A1 (Regular Simplex). The points {v i } K i=1 form a regular simplex inscribed in the hypersphere if Definition A2 (Downstream model).
Once an encoder f (x) is learned, the downstream model consists of a linear classifier trained using the cross-entropy loss: D). Then, the end model's outputs are the probabilitieŝ and the generalization error is

Appendix C. Additional Theoretical Results
Appendix C.1. Transfer Learning on (x , y ) We now show an additional transfer learning result on new tasks (x , y ). Formally, recall that we learn the encoder f on (x, y) ∼ P. We wish to use it on a new task with target distribution (x , y ) ∼ P . We find that an injective encoder f (x) is more appropriate to be used on new distributions than collapsed embeddings based on the Infomax principle [7].
Observation A1. Define f c (y) as the mapping to collapsed embeddings and f 1−1 (x) as an injective mapping, both learned on P. Construct a new variable y with joint distribution (x , y) ∼ p(y|x) · p (x ) and suppose that y ⊥ ⊥ y |x . Then, by the data processing inequality, it holds that I( y, y ) ≤ I(x , y ) where I(·, ·) is the mutual information between two random variables. We apply f c to y and f 1−1 to x to get that I( f c ( y), y ) ≤ I( f 1−1 (x ), y ). Therefore, f 1−1 obeys the Infomax principle [7] better on P than f c . Via Fano's inequality, this statement implies that the Bayes risk for learning y from x is lower using f 1−1 than f c .

Appendix C.2. Probabilities of Strata z, z Appearing in Subsampled Dataset
As discussed in Section 4.2, the distance between strata z and z in embedding space depends on if these strata appear in the subsampled dataset D t that the encoder was trained on. We define the exact probabilities of the three cases presented. Let Pr(z, z ∈ D t ) be the probability that both strata are seen, Pr(z ∈ D t , z / ∈ D t ) be the probability that only z is seen, and Pr(z, z / ∈ D t be the probability that neither are seen. First, the probability of neither strata appearing in D t is easy to compute. In particular, we have that Pr(z, z / ∈ D t ) = (1 − p(z) − p(z )) tN . This quantity decreases in p(z) and p(z ), confirming that it is less likely for two common strata to not appear in D t .
This quantity depends on the difference between p(z) and p(z ), so this case is common when one stratum is common and one is rare. Lastly, the probability of both z and z being in D t is thus tN . This quantity increases in p(z) and p(z ).

Appendix C.3. Performance of Collapsed Embeddings on Coarse-to-Fine Transfer and Original Task
Lemma A1. Denote f c to be the encoder that collapses embeddings such that f c (x) = v y for any (x, y) ∼ P. Then, the generalization error on the coarse-to-fine transfer task using f c and a linear classifier learned using cross entropy loss is at least where c K is the dot product of any two different class-collapsed embeddings. The generalization error on the original task under the same setup is at least Proof. We first bound generalization error on the coarse-to-fine transfer task. For collapsed embeddings, f (x) = v i when h(x) = i, where h(x) is information available at training time that follows the distribution p(y|x). We thus denote the embedding f (x) as v h(x) . Therefore, we write the generalization error with an expectation over h(x) and factorize the expectation according to our generative model.
Furthermore, since the W learned over collapsed embeddings satisfies W z = v y for S(z) = y, we have that log ∑ C i=1 exp(v y W i ) = m exp(1) + (C − m) exp(c K ) for any y, and our expected generalization error is This tells us that the generalization error is at most log(m exp(1) + (C − m) exp(c K )) − c K and at least log(m exp(1) + (C − m) exp(c K )) − 1.
For the original task, we can apply this same approach to the case where m = 1, C = K to get that the average generalization error is This is at least log(exp(1) + (K − 1) exp(c K )) − 1 and at most log(exp(1) + (K − 1) exp(c K )) − c K .

Appendix D. Proofs
Appendix D. 1

. Proofs for Theoretical Motivation
We provide proofs for Section 3.1. First, we characterize the optimal linear classifier (for both the coarse-to-fine transfer task and the original task) learned on the collapsed embeddings. Note that this result appears similar to Corollary 1 of [2], but their result minimizes the cross entropy loss over both the encoder and downstream weights (i.e., in a classical supervised setting where only cross entropy is used in training).
Lemma A2 (Downstream linear classifier for coarse-to-fine task). Suppose the dataset D z is class-balanced across z, and the embeddings satisfy form the regular simplex. Then the optimal weight matrix W ∈ R C×d that minimizesL(W, D z ) satisfies W z = v y for y = S(z).
Proof. Formally, the convex optimization problem we are solving is The Lagrangian of this optimization problem is and the stationarity condition w.r.t. W z is m exp(1)+(C−m) exp(δ) ≥ 0, satisfying the dual constraint. We can further verify complementary slackness and primal feasibility, since W z 2 2 = 1, to confirm that an optimal weight matrix satisfies W z = v y for y = S(z).
Corollary A1. When we apply the above proof to the case when m = 1, we recover that the optimal weight matrix W ∈ R K×d that minimizesL(W, D) for the original task on (x, y) ∼ P satisfies W y = v y for all y ∈ Y.
We now prove Observation 1 and 2. Then, we present an additional result on transfer learning on collapsed embeddings to general tasks of the form (x , y ) ∼ P .
Proof of Observation 1. We write out the generalization error for the downstream task, L(x, z, f ) = E x,z [− logp(z|x)] using our conditions that p(y = h(x)|x) = 1 and p(z|x) = 1 m .
To minimize this, f (x) should be the same across all x where h(x) is the same value, since p(z|x) does not change across fixed h(x) and thus varying f (x) will not further decrease the value of this expression. Therefore, we rewrite f (x) as f h(x) . Using the fact that y is class balanced, our loss is now .
We claim that f y = v y and W z = v y for all S(z) = y minimizes this convex function. The corresponding Lagrangian is The stationarity condition with respect to W z is the same as (A6), and we have already demonstrated that the feasibility constraints and complementary slackness are satisfied on W. The stationarity condition with respect to f y is Substituting in W i = v S(i) and f y = v y , we get − ∑ z∈S y v y + m · = p(x|y). Therefore, this condition suggests that there is no distinction among the strata within a class.
Proof of Observation 2. This observation follows directly from Observation 1 by repeating the proof approach with z = y, m = 1.
Lastly, suppose it is not true that p(y = h(x)|x) = 1. Then, the generalization error on the original task is L(x, y, f ) = − X ∑ K y=1 p(x)p(y|x) logp(y| f (x)), which is mini-mized whenp(y| f (x)) = p(y|x). Intuitively, a model constructed with label information, p(y|h(x)), will not improve over one that uses x itself to approximate p(y|x).

Appendix D.2. Proofs for Theoretical Implications
We provide proofs for Section 4.3.
Proof of Lemma 1. The generalization error is Using the definition of the mean classifier, We can also rewrite the dot product between mean embeddings per strata in terms of the distance between them: This directly gives us our desired bound.
Proof of Lemma 2. The generalization error is We substitute in the definition of the mean classifier to get land birds with land backgrounds, but 5% of the water birds are on land backgrounds, and 5% of the land birds are on water backgrounds. These form the (imbalanced) hidden strata. • ISIC is a public skin cancer dataset for classifying skin lesions [15] as malignant or benign. 48% of the benign images contain a colored patch, which form the hidden strata. • CelebA is an image dataset commonly used as a robustness benchmark [14,16]. The task is blonde/not blonde classification. Only 6% of blonde faces are male, which creates a rare stratum in the blonde class.

Appendix E.3. Applications
We describe additional experimental details for the applications.

Appendix E.3.1. Robustness Against Worst-Group Performance
We follow the evaluation of [5]. First, we train a model on the standard class labels. We evaluate different loss functions for this step, including L spread , L SC , and the cross entropy loss L CE . Then we project embeddings of the training set using a UMAP projection [45], and cluster points to discover unlabeled subgroups. Finally, we use the unlabeled subgroups in a Group-DRO algorithm to optimize worst-group robustness [14].

Appendix E.3.2. Robustness Against Noise
We use the same training setup as we use to evaluate model quality, and introduce symmetric noise into the labels for the contrastive loss head. We train the cross entropy head with a fraction of the full training set. In Section 5.3, we report results from training with 20% labels to cross entropy. We report additional levels in Appendix F.
We detect noisy labels with a simple geometric heuristic: for each point, we compute the cosine similarity between the embedding of the point and the center of all the other points in the batch that have the same class. We compare this similarity value to the average cosine similarity with points in the batch from every other class, and rank the points by the difference between these two values. Points with incorrect labels have a small difference between these two values (they appear to be small strata, so they are far away from points of the same class). Given the noise level as an input, we rank the points by this heuristic and mark the fraction of the batch with the smallest scores as noisy. We then correct their labels by adopting the label of the closest cluster center.
Our coreset algorithm proceeds in two parts. First, we give each point a difficulty rating based on how likely we are to classify it correctly under partial training. Then we subsample the easiest points to construct minimal coresets.
First, we mirror the set up from our thought experiment and train with L spread on random samples of t% of the CIFAR10 training set, taking three random samples for each of t ∈ [10, 20, 50] (and we train the cross entropy head with 1% labeled data). For each run, we record which points are classified correctly by the cross entropy head at the end of training, and bucket points the training set by how often the point was correctly classified. To construct a coreset of size t%, we iteratively remove points from the largest bucket in each class. Our strategy removes easy examples first from the largest coresets, but maintains a set of easy examples in the smallest coresets.

Appendix F. Additional Experimental Results
In this section, we report three sets of additional experimental results: the performance of using L attract on its own to train models, sample complexity of L spread compared to L SC , and additional noisy label results (including a bonus de-noising algorithm).

Appendix F.1. Performance of L attract
In an early iteration of this project, we experienced success with using L attract on its own to train models, before realizing the benefits of adding in an additional term to prevent class collapse. As an ablation, we report on the performance of using L attract on its own in Table A2. L attract can outperform L SC , but L spread outperforms both. We do not report the results here, but L attract also performs significantly worse than L SC on downstream applications, since it more direclty encourages class collapse.  Figure A1 shows the performance of training ViT models with various amounts of labeled data for L spread , L SC , and L SS . In these experiments, we train the cross entropy head with 1% labeled data to isolate the effect of training data on the contrastive losses themselves.
L spread outperforms L SC and L SS throughout. At 10% labeled data, L spread outperforms L SS by 13.9 points, and outperforms L SC by 0.5 points. By 100% labeled data (for the contrastive head), L spread outperforms L SS by 25.4 points, and outperforms L SC by 10.3 points.

Appendix F.3. Noisy Labels
In Section 5.3, we reported results from training the contrastive loss head with noisy labels and the cross entropy loss with clean labels from 20% of the training data.
In this section, we first discuss a de-noising algorithm inspired by [23] that we initially developed to correct for noisy labels, but that we did not observe strong empirical results from. We hope that reporting this result inspires future work into improving contrastive learning.
We then report additional results with larger amounts of training data for the cross entropy head. Appendix F.3.1. Debiasing Noisy Contrastive Loss First, we consider the triplet loss and show how to debias it in expectation under noise. Then we present an extension to supervised contrastive loss.

Noise-Aware Triplet Loss
Consider the triplet loss: Now suppose that we do not have access to true labels but instead have noisy labels denoted by the weak classifier y := h(x). We adopt a simple model of symmetric noise where p = Pr(noisy label is correct).
We use y to construct P + and P − as p(x + | h(x) = h(x + )) and p(x − | h(x) = h(x − )). For simplicity, we start by looking at how the triplet loss in (A7) is impacted when noise is not addressed in the binary setting. Define L triplet noisy as L triplet used with P + and P − . Lemma A3. When class-conditional noise is uncorrected, L noisy triplet is equivalent to Proof. We split L noisy triplet depending on if the noisy positive and negative pairs are truly positive and negative.
Define p = p(noisy label is correct). Note that p(h(x) = h( x + ), h(x) = h( x − )) = p 3 + (1 − p) 3 , (i.e., all three points are correct or all reversed, such that their relative pairings are correct). In addition, the other three probabilities above are all equal to p(1 − p).
We now show that there exists a weighted loss function that in expectation equals L triplet .
We now show the general case for debiasing L attract , which uses more negative samples.
Proposition A1. Define m = n + 1 (as the "batch size" in the denominator), and w + and w − are defined in the same was as before. w = {w 0 , . . . w m } ∈ R m+1 is the solution to the system Pw = e 2 where e 2 is the standard basis vector in R m+1 where the 2nd index is 1 and all others are 0. The i, jth element of P is P ij = pQ i,j + (1 − p)Q m−i,j where Then, E L attract = L attract .
We do not present the proof for Proposition A1, but the steps are very similar to the proof for the triplet loss case. We also note that a different form of E L attract must be computed for the multi-class case, which we do not present here (but can be derived through computation).
Observation A2. Note that the values of Q i,j have high variance in the noise rate as m increases. Additionally, note that the number of terms in the summation of Q i,j increase combinatorially with m. We found this de-noising algorithm very unstable as a result.

Appendix F.3.2. Additional Noisy Label Results
Now we report the performance of denoising algorithms with additional amounts of labeled data for the cross entropy loss head. We also report the performance of using L attract to debias noisy labels. Figure A2 shows the results. Our geometric correction together with L spread works the most consistently. Using the geometric correction with L SC can be unreliable, since L SC can learn memorize noisy labels early on in training. The expectation-based debiasing algorithm L attract occasionally shows promise but is unreliable, and is very sensitive to having the correct noise rate as an input. Noisy label performance, 60% labels to L_CE Geometry-correction, L_spread Geometry-correction, L_SC Expectation-based noisy correction L_SC, no correction Figure A2. Performance of models under various amounts of label noise for the contrastive loss head, and various amounts of clean training data for the cross entropy loss.