1. Introduction
Supervised contrastive learning has emerged as a promising method for training deep models, with strong empirical results over traditional supervised learning [
1]. Recent theoretical work has shown that under certain assumptions,
class collapse—when the representation of every point from a class collapses to the same embedding on the hypersphere, as in
Figure 1—minimizes the supervised contrastive loss
[
2]. Furthermore, modern deep networks, which can memorize arbitrary labels [
3], are powerful enough to produce class collapse.
Although class collapse minimizes
and produces accurate models, it loses information that is not explicitly encoded in the class labels. For example, consider images with the label “cat.” As shown in
Figure 1, some cats may be sleeping, some may be jumping, and some may be swatting at a bug. We call each of these semantically-unique categories of data—some of which are rarer than others, and none of which are explicitly labeled—a
stratum. Distinguishing strata is important; it empirically can improve model performance [
4] and fine-grained robustness [
5]. It is also critical in high-stakes applications such as medical imaging [
6]. However,
maps the sleeping, jumping, and swatting cats all to a single “cat” embedding, losing strata information. As a result, these embeddings are less useful for common downstream applications in the modern machine learning landscape, such as transfer learning.
In this paper, we explore a simple modification to that prevents class collapse. We study how this modification affects embedding quality by considering how strata are represented in embedding space. We evaluate our loss both in terms of embedding quality, which we evaluate through three downstream applications, and end model quality.
In
Section 3, we present our modification to
, which prevents class collapse by changing how embeddings are pushed and pulled apart.
pushes together embeddings of points from the same class and pulls apart embeddings of points from different classes. In contrast, our modified loss
includes an additional class-conditional InfoNCE loss term that uniformly pulls apart individual points from within the same class. This term on its own encourages points from the same class to be maximally spread apart in embedding space, which discourages class collapse (see
Figure 1 middle). Even though
does not use strata labels, we observe that it still produces embeddings that qualitatively appear to retain more strata information than those produced by
(see
Figure 2).
In
Section 4, motivated by these empirical observations, we study how well
preserves distinctions between strata in the representation space. Previous theoretical tools that study the optimal embedding distribution fail to characterize the geometry of strata. Instead, we propose a simple thought experiment considering the embeddings that the supervised contrastive loss generates when it is trained on a partial sample of the dataset. This setup enables us to distinguish strata based on their sizes by considering how likely it is for them to be represented in the sample (larger strata are more likely to appear in a small sample). In particular, we find that points from rarer and more distinct strata are clustered less tightly than points from common strata, and we show that this clustering property can improve embedding quality and generalization error.
In
Section 5, we empirically validate several downstream implications of these insights. First, we demonstrate that
produces embeddings that retain more information about strata, resulting in lift on three downstream applications that require strata recovery:
We evaluate how well ’s embeddings encode fine-grained subclasses with coarse-to-fine transfer learning. achieves up to 4.4 points of lift across four datasets.
We evaluate how well embeddings produced by can recover strata in an unsupervised setting by evaluating robustness against worst-group accuracy and noisy labels. We use our insights about how embeds strata of different sizes to improve worst-group robustness by up to 2.5 points and to recover 75% performance when 20% of the labels are noisy.
We evaluate how well we can differentiate rare strata from common strata by constructing limited subsets of the training data that can achieve the highest performance under a fixed training strategy (the coreset problem). We construct coresets by subsampling points from common strata. Our coresets outperform prior work by 1.0 points when coreset size is 30% of the training set.
Next, we find that
produces higher-quality models, outperforming
by up to 4.0 points across 9 tasks. Finally, we discuss related work in
Section 6 and conclude in
Section 7.
4. Geometry of Strata
We first discuss some existing theoretical tools for analyzing contrastive loss geometrically and their shortcomings with respect to understanding how strata are embedded. In
Section 4.2, we propose a simple thought experiment about the distances between strata in embedding space when trained under a finite subsample of data to better understand our prior qualitative observations. Then, in
Section 4.3, we discuss implications of representations that preserve strata distinctions, showing theoretically how they can yield better generalization error on both coarse-to-fine transfer and the original task and empirically how they allow for new downstream applications.
4.1. Existing Analysis
Previous works have studied the geometry of optimal embeddings under contrastive learning [
2,
8,
9], but their techniques cannot analyze strata because strata information is not directly used in the loss function. These works use the
infinite encoder assumption, where any distribution on
is realizable by the encoder
f applied to the input data. This allows the minimization of the contrastive loss to be equivalent to an optimization problem over probability measures on the hypersphere. As a result, solving this new problem yields a distribution whose characterization is solely determined by information in the loss function (e.g., labels information [
2,
9]) and is decoupled from other information about the input data
x and hence decoupled from strata.
More precisely, if we denote the measure of as , minimizing the contrastive loss over the mapping f is equal (at the population level) to minimizing over the pushforward measure . The infinite encoder assumption allows us to relax the problem and instead consider optimizing over any in the Borel set of probability measures on the hypersphere. Then, the optimal learned is independent of the distribution of the input data beyond what is in the relaxed objective function.
This approach using the infinite encoder assumption does not allow for analysis of strata. Strata are unknown at training time and thus cannot be incorporated explicitly into the loss function. Their geometries will not be reflected in the characterization of the optimal distribution obtained from previous theoretical tools. Therefore, we need additional reasoning for our empirical observations that strata distinctions are preserved in embedding space under .
4.2. Subsampling Strata
We propose a simple thought experiment based on subsampling the dataset—randomly sampling a fraction of the training data—to analyze strata. Consider the following: we subsample a fraction of a training set of N points from . We use this subsampled dataset to learn an encoder , and we study the average distance under between two strata z and as t varies.
The average distance between
z and
is
and depends on whether
z and
are both in the subsampled dataset. We study when
z and
belong to the same class. We have three cases (with probabilities stated in
Appendix C.2) based on strata frequency and
t—when both, one, or neither of the strata appears in
:
Both strata appear in The encoder
is trained on both
z and
. For large
N, we can approximate this setting by considering
trained on infinite data from these strata. Points belonging to these strata will be defined in the optimal embedding distribution on the hypersphere, which can be characterized by prior theoretical approaches [
2,
8,
9]. With
,
depends on
, which controls the extent of spread in the embedding geometry. With
, points from the two strata would asymptotically map to one location on the hypersphere, and
would converge to 0. This case occurs with probability increasing in
and
t.
One stratum but not the other appears in Without loss of generality, suppose that points from
z appear in
but no points from
do. To understand
, we can consider how the end model
learned using the “source” distribution containing
z performs on the “target” distribution of stratum
since this downstream classifier is a function of distances in embedding space. Borrowing from the literature in domain adaptation, the difficulty of this out-of-distribution problem depends on both the divergence between source
z and target
distributions and the capacity of the overall model. The
-divergence from Ben-David et al. [
10,
11], which is studied in lower bounds in Ben-David and Urner [
12], and the discrepancy difference from Mansour et al. [
13] capture both concepts. Moreover, the optimal geometries of
and
induce different end model capacities and prediction distributions, with data being more separable under
, which can help explain why
better preserves strata distances. This case occurs with probability increasing in
and decreasing in
and
t.
Neither strata appears in The distance in this case is at most (total variation distance) regardless of how the encoder is trained, although differences in transfer from models learned on to z versus can be further analyzed. This case occurs with probability decreasing in and t.
We make two observations from these cases. First, if z and are both common strata, then as t increases, the distance between them depends on the optimal asymptotic distribution. Therefore, if we set in , these common strata will collapse. Second, if z is a common strata and is uncommon, the second case occurs frequently over randomly sampled , and thus the strata are separated based on the difficulty of the respective out-of-distribution problem. We thus arrive at the following insight from our thought experiment:
Common strata are more tightly clustered together, while rarer and more semantically distinct strata are far away from them.
Figure 3 demonstrates this insight. It shows a t-SNE visualization of embeddings from training on CIFAR100 with coarse superclass labels, and with artifically imbalanced subclasses. We show points from the largest subclasses in dark blue and points from the smallest subclasses in light blue. Points from the largest subclasses (dark blue) cluster tightly, whereas points from small subclasses (light blue) are scattered throughout the embedding space.
4.3. Implications
We discuss theoretical and practical implications of our subsampling argument. First, we show that on both the coarse-to-fine transfer task and the original task , embeddings that preserve strata yield better generalization error. Second, we discuss practical implications arising from our subsampling argument that enable new applications.
4.3.1. Theoretical Implications
Consider , the encoder trained on with all N points using , and suppose a mean classifier is used for the end model, e.g., and . On coarse-to-fine transfer, generalization error depends on how far each stratum center is from the others.
Lemma 1. There exists such that the generalization error on the coarse-to-fine transfer task is at mostwhere is the average distance between strata z and defined in Section 4.2. The larger the distances between strata, the smaller the upper bound on generalization error. We now show that a similar result holds on the original task , but there is an additional term that penalizes points from the same class being too far apart.
Lemma 2. There exists such that the generalization error on the original task is at most This result suggests that maximizing distances between strata of different classes is desirable, but less so for distances between strata of the same class as suggested by the first term in the expression. Both results illustrate that separating strata to some extent in the embedding space results in better bounds on generalization error. In
Appendix C.3, we provide proofs of these results and derive values of the generalization error for these two tasks under class collapse for comparison.
4.3.2. Practical Implications
Our discussion in
Section 4.2 suggests that training with
better distinguishes strata in embedding space. As a result, we can use differences between strata of different sizes for downstream applications. For example, unsupervised clustering can help recover pseudolabels for unlabeled, rare strata. These pseudolabels can be used as inputs to worst-group robustness algorithms, or used to detect noisy labels, which appear to be rare strata during training (see
Section 5.3 for examples). We can also train over subsampled datasets to heuristically distinguish points that come from common strata from points that come from rare strata. We can then downsample points from common strata to construct minimal coresets (see
Section 5.4 for examples).
6. Related Work and Discussion
From work in
contrastive learning, we take inspiration from [
21], who use a latent classes view to study self-supervised contrastive learning. Similarly, [
22] considers how minimizing the InfoNCE loss recovers a latent data generating model. We initially started from a debiasing angle to study the effects of noise in supervised contrastive learning inspired by [
23], but moved to our current strata-based view of noise instead. Recent work has also analyzed contrastive learning from the information-theoretic perspective [
24,
25,
26], but does not fully explain practical behavior [
27], so we focus on the geometric perspective in this paper because of the downstream applications. On the geometric side, we are inspired by the theoretical tools from [
8] and [
2], who study representations on the hypersphere along with [
9].
Our work builds on the recent wave of empirical interest in contrastive learning [
20,
28,
29,
30,
31] and supervised contrastive learning [
1]. There has also been empirical work analyzing the transfer performance of contrastive representations and the role of intra-class variability in transfer learning. [
32] find that combining supervised and self-supervised contrastive loss improves transfer learning performance, and they hypothesize that this is due to both inter-class separation and intra-class variability. [
33] find that combining cross entropy and self-supervised contrastive loss improves coarse-to-fine transfer, also motivated by preserving intra-class variability.
We derive from similar motivations to losses proposed in these works, and we futher theoretically study why class collapse can hurt downstream performance. In particular, we study why preserving distinctions of strata in embedding space may be important, with theoretical results corroborating their empirical studies. We further propose a new thought experiment for why a combined loss function may lead to better separation of strata.
Our treatment of
strata is strongly inspired by [
5,
6], who document empirical consequences of hidden strata. We are inspired by empirical work that has demonstrated that detecting subclasses can be important for performance [
4,
34] and robustness [
14,
35,
36].
Each of our downstream
applications is a field in itself, and we take inspiration from recent work from each. Our noise heuristic is similar to the ELR [
37] and takes inspiration from a various work using contrastive learning to correct noisy labels and for semi-supervised learning [
38,
39,
40]. Our coreset algorithm is inspired by recent work in coresets for modern deep networks [
19,
41,
42], and takes inspiration from [
18] in particular.