A Framework using Contrastive Learning for Classification with Noisy Labels

We propose a framework using contrastive learning as a pre-training task to perform image classification in the presence of noisy labels. Recent strategies such as pseudo-labeling, sample selection with Gaussian Mixture models, weighted supervised contrastive learning have been combined into a fine-tuning phase following the pre-training. This paper provides an extensive empirical study showing that a preliminary contrastive learning step brings a significant gain in performance when using different loss functions: non-robust, robust, and early-learning regularized. Our experiments performed on standard benchmarks and real-world datasets demonstrate that: i) the contrastive pre-training increases the robustness of any loss function to noisy labels and ii) the additional fine-tuning phase can further improve accuracy but at the cost of additional complexity.


Introduction
Collecting large and well-annotated datasets for image classification tasks represents a challenge as human quality annotations are expensive and time-consuming.Alternative methods exist, such as web crawlers [27].Nevertheless, these methods generate noisy labels decreasing the performance of deep neural networks.They tend to overfit to noisy labels due to their high capacity [44].That is why developing efficient noisy-label learning (NLL) techniques is of great importance.
Various strategies have been proposed to deal with NLL: i) Noise transition matrix [9,33,41] estimates the noise probability and corrects the loss function, ii) a small and clean subset can help to avoid overfitting [14], iii) samples selection identifies true-labeled samples [10,15,22], and iv) robust loss functions solve the classification problem only by adapting the loss function to be less sensitive to noisy labels [26,37,47].Methods also combine other strategies (eg.ELR+ [22], DivideMix [25]): two networks, semi-supervised learning, label correction, or mixup.They show the most promising results but lead to a large number of hyperparameters.That is why we explore improvement strategies for robust loss functions.They are simpler to integrate and faster to train, but as illustrated in Figure 1, they tend to overfit and have lower performance for high noise ratios.
Meanwhile, new self-supervised learning algorithms for image representations have been recently developed [5,12].Such algorithms extract representation (or features) in unsupervised settings.These representations can then be used for downstream tasks such as classification.Methods based on contrastive learning compete with fully supervised learning while fine-tuning only on a small fraction of all available labels.Therefore, using contrastive learning for NLL appears as promising.In this work, contrastive learning aims to pre-train the classifier to improve its robustness.
The key contributions of this work are: • A framework increasing robustness of any loss function to noisy labels by adding a contrastive pre-training task.• The adaptation of the supervised contrastive loss to use sample weight values, representing the probability of correctness for each sample in the training set • An extensive empirical study identifying and benchmarking additional state of the art strategies to boost the performance of pre-trained models: pseudo-labeling, sample selection with GMM, weighted supervised contrastive learning, and mixup with bootstrapping.

Related works
Existing approaches dealing with NLL and contrastive learning in computer vision are briefly reviewed.Extra details can be found in Le-Khac et al. [21], Song et al. [36].

Noise tolerant classification
Sample Selection: This method identifies noisy and clean samples within the training data.Several strategies leverage the interactions between multiple networks to identify the probably correct labels [10,15,22].Recent works [1,35] exploit the small loss trick to identify clean and noisy samples by considering a certain number of small-loss training samples as true-labeled samples.This approach can be justified by the memorization effect: deep neural networks first fit the training data with clean labels during a so-called early learning phase, before overfitting the noisy samples during the memorization phase [2,25].
Robust Loss Function: Commonly used loss functions, such as Cross Entropy (CE) or Focal Loss, are not robust to noisy labels.Therefore, new loss functions have been designed.Such robust loss functions can be easily incorporated into existing pipelines to improve performance regarding noisy labels.The symmetric cross entropy [37] has been proposed by adding a reverse CE loss to the initial CE.This combination improves the accuracy of the model compared to classical loss functions.Ma et al. [26] show theoretically that normalization can convert classical loss functions into loss functions robust to noise labels.The combination of two robust loss functions can also improve robustness.However, the performance of normalized loss functions remains quite low for high noise rates as illustrated in Figure 1.
Semi-supervised: Semi-supervised approaches deal with both labeled and unlabeled data.Recent works [22,30,38] combine sample selection with semi-supervised methods: the possibly noisy samples are treated as unlabeled and the possibly clean samples are treated as labeled.Such approaches leverage information contained in noisy data, for instance by using MixMatch [3].Semi-supervised approaches show competitive results.However, they use several hyperparameters that can be sensitive to changes in data or noise type [31,36].
Contrastive learning: recent developments in self-supervised and contrastive learning [23,31,46] inspire new approaches in NLL.Li et al. [23] employed features learned by contrastive learning to detect out-of-distribution samples.

Contrastive learning for vision data
Contrastive learning extracts features by comparing each data sample with different samples.The central idea is to bring different instances of the same input image closer and spread instances from different images apart.The inputs are usually divided into positive (similar inputs) and negative pairs (dissimilar inputs).Frameworks have been recently developed, such as CPCv2 [13], SimCLR [5], Moco [12].Once the self-supervised model is trained, the extracted representations can be used for downstream tasks.In this work, the representations are used for noisy label classification.
Chen et al. [5] demonstrate that large sets of negatives (and large batches) are crucial in learning good representations.However, large batches are limited by GPU memory.Maintaining a memory bank accumulating a large number of negative representations is an elegant solution decoupling the batch size from the number of negatives [28].Nevertheless, the representations get outdated in a few iterations.The Momentum Encoder [12] addresses the issues by generating a dynamic memory queue of representations.Other strategies aim at getting more meaningful negative samples to reduce the memory/batch size [16].
3 Preliminaries , K} denote a noisy input dataset with an unknown number of samples incorrectly labelled.The associated true and unobservable labels are written y i .The images x i are of size d and the classification problem has K classes.The goal is to train a deep neural network (DNN) f .Using a robust loss function for training consists of minimizing the empirical risk defined by robust loss functions in order to find the set of optimal parameters θ.The one-hot encoding of the label is denoted by the distribution q(k|x) for a sample x and a class k, such as q(y i |x i ) = 1 and q(k = y i |x i ) = 0, ∀i ∈ {1, • • • , n}.The probability vector of f is defined by the softmax function p(k|x) = e z k K j=1 e z j where z k denotes the logits output with respect to class k.

Classification with robust loss functions
The method employs noise-robust losses to train the classifier in the presence of noisy labels.Such losses improve the classification accuracy compared to the commonly used Cross Entropy (CE), as illustrated in Figure 1.In this section, the general empirical risk for a given mini-batch is defined by The term l i is modified by each loss function.
The classical CE is used as a baseline loss function not robust to noisy labels [8] and is defined as: As presented in section 2, Ma et al. [26] introduce robust loss functions called Active Passive Losses that do not suffer from underfitting.We investigate the combination between the Normalized Focal Loss (NFL) and the Reversed Cross Entropy (RCE) called NFL+RCE.It shows promising results on various benchmarks.The NFL is defined as: where γ ≥ 0 is an hyperparameter.The RCE loss is: The final combination following the framework simply gives a different α and β to each loss: The two hyperparameters α and β control the balancing between more active learning and less passive learning.For simplicity, α and β are set to 1.0 without any tuning.
Liu et al. [25] propose another framework to deal with noisy annotations based on the "early learning" phase.The loss, called Early Learning Regularization (ELR), adds a regularization term to capitalize on early learning.ELR is not strictly speaking a robust loss but belongs to robust penalization and label correction methods.The penalization term corrects the CE based on estimated soft labels identified with semi-supervised learning techniques.It prevents memorization of false labels by steering the model towards these targets.The regularization term maximizes the inner product between model outputs and targets: The target is not set equal to the model output but is estimated with a temporal ensembling from semi-supervised methods.Let t(k|x i ) (l) denote the target for example x i at iteration l of training with a momentum β:

Contrastive learning
Contrastive learning methods learn representations by contrasting positive and negative examples.A typical framework is composed of several blocks [7]: • Data augmentation: Data augmentation is used to decouple the pretext tasks from the network architecture.Chen et al. [5] study broadly the impact of data augmentation.We follow their suggestion combining random crop (and flip), color distortion, Gaussian blur, and gray-scaling.• Encoding: The encoder extracts features (or representation) from augmented data samples.A classical choice for the encoder is the ResNet model [11] for image data.The final goal of the contrastive approach is to find correct weights for the encoder.• Loss function: The loss function usually combines positive and negative pairs.The Noise Contrastive Estimation (NCE) and its variants are popular choices.The general formulation for such loss function is defined for the i-th pair as [40]: where z is a feature vector, I is the set of indexes in the mini-batch, i is the index of the anchor, j(i) is the index of an augmented version of the anchor source image, A(i) = I \ {i}, and τ is a temperature controlling the dot product.The denominator includes one positive and K negative pairs.• Projection head: That step is not used in all frameworks.The projection head maps the representation to a lower-dimensional space and acts as an intermediate layer between the representation and the embedding pairs.Chen et al. [5,6] show that the projection head helps to improve the representation quality.

A framework coupling contrastive learning and noisy labels
As illustrated in Figure 2, our method classifies noisy samples in a two phased process.First, a classifier pre-trained with contrastive learning produces train-set pseudo-labels (pre-training phase, in panel a), used during the training of a subsequent fine-tuning phase (panel b).The underlying intuition is that the predicted pseudo-lables are more accurate than the original noisy labels.The contrastive learning performed in the first phase (panel a1) improves the performance the classifier (panel a2), sensitive to noisy labels; the resulting model can be also used in a standalone way with a reduced number of hyperparameters, without the underlying fine-tuning phase.
The second phase leverages the pseudo-labels predicted by the pre-training in all underlying steps (b1-b3).To mitigate the effect of potentially incorrectly predicted pseudo-labels, a Gaussian Mixture Model (GMM, panel b1) with 2 components follows the small loss-trick to predict for each sample the probability of correctness.This value is used as a weight in a supervised contrastive step (panel b2), performed to improve the learned representations by taking advantage of the label information.A classification head is added to the contrastive model in order to produce the final predictions (panel b3).The fine-tuning phase can be seen as an adaptation of the pre-training phase to handle pseudo-labels.To maximize the impact of the contrastive learning on the underlying classification, the supervised training is performed in 2 steps: a warm-up step, updating only the classifier layer (while keeping the encoder frozen) is followed by the full model training.We compared three different loss functions for the supervised classification: the classical CE, the robust NFL+RCE, and the ELR loss.

Sample selection and correction with pseudo-labels
Pseudo labels represent one hot encoded model's predictions on the training set.Pseudo-labels were initially used in semi-supervised learning to produce annotations for unlabelled data; in the noisy label setting, various techniques (e.g.DivideMix, etc) identify a subset with a high likelihood of correctness and treat the remaining samples as the unlabeled counterpart in semi-supervised learning.In this work, we elaborate on the observation that the training set labels, predicted after training the model with a noise-robust loss function (i.e. the pseudo labels), are more accurate than the ground truth.This observation is supported by the results in Figure 3, depicting the accuracy of pseudo labels predicted on CIFAR100, contaminated with various levels of asymmetric (panel a) and symmetric (panel b) noise.The pseudo labels are more accurate than the corrupted ground truth in both settings and bring a higher gain in performance as the noise ratio increases.Figure 3: Accuracy of pseudo labels on all simulated settings with asymmetric (a) and symmetric (b) noise, evaluated on CIFAR100.The correctness of the ground truth is represented on the x axis, while the accuracy of predicted pseudo labels on the y axis.In all experiments, the pseudo labels have a higher accuracy than the corrupted ground truth and this gain increases with the noise ratio As proposed by other approaches [1], the loss value on train samples can be used to discriminate between clean and mislabeled samples.The sample correctness probability is computed by fitting a 2 components GMM on the distribution of losses [22].The underlying probability is used as a sample weight: where l i is the loss for sample i and k = 0 is the GMM component associated to the clean samples (lowest loss).Figure 4 depicts the evolution of the clean training set identified by GMM on an example: its accuracy grows from 0.6 to 0.93 while the size stabilizes at 60% of the training set.

Weighted supervised contrastive learning
A modification to the contrastive loss defined in Equation 7has been proposed to leverage label information [18]: where P (i) = {j ∈ I \ {i}, y j = y i } with y i the prediction of the model for input x i .
As explained in the previous section, the loss value for the training set samples is used to fit a GMM with 2 components, corresponding to correctly and incorrectly labeled samples.We adapted the supervised representation loss to employ w, a weighting factor representing the sample probability of membership to the correctly labeled component.Thus, likely mislabeled samples having large loss values would contribute only marginally to the supervised representations: where w p,i is a modified version of w p such as w p,i = 1 if p = j(i) else w p,i = w i .If all samples are considered as noisy, Equation 10 is simplified into the classical unsupervised contrastive loss in Equation 7.

Experiments
The framework is assessed on three benchmarks and the contribution of each block identified in Figure 2 is analyzed.

Datasets
CIFAR10 and CIFAR100 [20].These experiments assess the accuracy of the method against synthetic label noise.
The two datasets are contaminated with simulated symmetric or asymmetric label noise reproducing the heuristic in Ma et al. [26].The symmetric noise consists in corrupting an equal arbitrary ratio of labels for each class.The noise level varies from 0.2 to 0.8.For asymmetric noise [25,33], sample labels have been flipped within a specific set of classes, thus providing confusion between predetermined pairs of labels.For CIFAR100, 20 groups of super-classes have been created, each consisting of 5 sub-classes.The label flipping is performed only within each super-class circularly.The asymmetric noise ratio is explored between 0.2 and 0.4.
Webvision [24].This is a real-world dataset with noisy labels.It contains 2.4 million images crawled from the web (Google and Flickr) that share the same 1,000 classes from the ImageNet dataset.The noise ratio varies from 0.5% to 88%, depending on the class.In order to speed-up the training time, we used mini Webvision [15], consisting of only top 50 classes in the Google subset (66,000 images).
Clothing1M [42].Clothing 1M is a large real-world dataset consisting of 1 million images on 14 classes of clothing articles.Being gathered from e-commerce websites, Clothing1M embeds an unknown ratio of label noise.Additional validation and test sets, consisting of 14k and 10k clean labeled samples have been made available.In order to speed-up the training time, we selected a subset of 56,000 images keeping the initial class distribution.
Both Webvision and Clothing1M images were resized to 128 × 128.Therefore, the reported results may differ from other papers cropping the images to a 224 × 224 resolution.

Settings
We use the contrastive SimCLR framework [5] with a ResNet18 [11] (without ImageNet pre-training) as encoder.A projection head was added after the encoder for the contrastive learning with the following architecture: a multi-layer perceptron with one hidden layer and a ReLu non-linearity.The classifier following the contrastive learning step has a simple multilayer architecture: a single hidden layer with batch normalization and a ReLU activation function.A comparison with a linear classifier is provided in the supplementary materials.
For all supervised classification, we use SGD optimizer with momentum 0.9 and cosine learning rate annealing.The NFL hyperparameter γ is set to 0.5.Unlike the original paper, the ELR hyperparameters do no depend on the noise type: the regularization coefficient λ elr and the momentum β are set to 3.0 and 0.7.Details on the experiment setting can be found in the supplementary materials.
All codes are implemented in the PyTorch framework [32].The experiments for CIFAR are performed with a single Nvidia TITAN V-12GB and the experiments for Webvision and Clothing1M are performed with a single Nvidia Tesla V100-32GB, demonstrating the accessibility of the method.Our implementation has been made available along with the supplementary materials.

Results
All experiments presented in this secion evaluate our method's performance with the top-1 accuracy score.

Impact of contrastive pre-training
To evaluate the impact of the contrastive pre-training on the classification model, the proposed method (pre-training phase) is compared with a baseline classifier, trained for 200 epochs without contrastive learning.For each simulated dataset, we compare robust losses (e.g.NLF+RCE and ELR) and cross entropy.Results for CIFAR10 and CIFAR100 are depicted in Table 1 for different levels of symmetric and asymmetric noise.The pre-training improves the accuracy of the three different baselines for both datasets with different types and ratios of label noise.The largest differences are observed for the noisiest case with 80% noise.The pre-training outperforms the baselines by large margins between 10 and 75 for CIFAR10 and between 5 and 30 for CIFAR100.In addition to the comparisons with ELR and NFL+RCE, performed using our implementations (column Base in Table 1), we present the results reported by other recent competing methods.As shown in the introduction, numerous contributions have been made to the field in the last years.Six recent representative methods are selected for comparison: Taks [34], Co-teaching+ [43], ELR [25], DivideMix [22], SELF [30], and JoCoR [39].The results are presented in Table 2.The difference between the scores reported by ELR and those obtained with our run (using the same implementation, but slightly different hyper-parameters and a ResNet18 instead of a ResNet34) suggests that the method is less stable on data contaminated with asymmetric noise and sensitive to small changes hyperparameters.Moreover, ELR proposes hyperparameters having different values depending on the type of dataset (i.e.CIFAR10/CIFAR100) and underlying noise (i.e.symmetric/asymmetric), identified after a hyperparameter search exercise.The best scores Webvision and Clothing1M results are presented in Table 3.The contrastive framework outperforms the respective baselines for the three loss functions.Because the images have a reduced size, and for Clothing1M, we use a smaller training set, the direct comparison with competing methods is less relevant.However, the observed gap in performance is significant and promising for training images with higher resolution.Moreover, a ResNet50 model has been trained with our framework on the Webvision dataset with a higher resolution (224 × 224).The accuracy reaches respectively 75.7% and 76.2% for CE and ELR.These results are very close to the values reported with DivideMix (77.3%) and ELR+ (77.8%) using a larger model, Inception-ResNet-v2 (the difference is more than 4% on the ImageNet benchmark [4]).Supported by this first set of experiments, the preliminary pre-training with contrastive learning shows great performances.The accuracy of both traditional and robust-loss classification models is significantly improved.

Sensitivity to the hyperparameters
Estimating the best hyperparameters is complex for datasets with noisy labels as clean validation sets are not available.For instance, Ortego et al. [31] show that two efficient methods (eg.ELR and DivideMix) could be sensitive to specific hyperparameters.Therefore a hyperparameter sensitivity study has been carried out to estimate the stability of the framework for the learning rate.Figure 5 depicts the sensitivity on CIFAR100 with 80% noise.CE and NFL+RCE seem to have opposite behaviors.The CE reaches competitive results with small learning rates but is prompt to overfitting for higher learning rates.The NFL+RCE loss tends to underfitting for the lowest learning rates but is quite robust for higher values.The ELR loss has the smallest sensitivity to the learning for the investigated range but does not reach the best values obtained with CE or NFL+RCE.We can assume that the regularization term coupled with pre-training is very efficient.It prevents memorization of the false labels as observed with CE. Results for other noise ratios have been documented in the supplementary materials.
This sensitivity analysis is limited to the learning rate.Investigating the impact of other hyperparameters, such as the momentum β or the regularization factor λ elr , could be interesting.In their original papers, ELR and NFL+RCE reach respectively 25.2% and 30.3% with other hyperparameters.These values are still far from the improvements brought by the contrastive pre-training but it suggests that the results could be improved with different hyperparameters.
Our empirical results indicate that the analyzed methods may be sensitive to hyperparameters.Despite the promised robustness to label noise, the analyzed robust losses are also affected by overfitting or underfitting.Our experiments have been built upon the parameters recommended in each issuing paper (e.g.ELR, SIMCLR) but, since the individual building blocks can be affected by small variations in input parameters, the performance of our method may also be impacted.Finding a relevant method to estimate proper hyperparameters in NLL remains a challenge.In the absence of a clean validation set, identifying when overfitting starts also remains an open challenge.This is demonstrated by our studies on the behaviour of the (also noise-corrupted) validation set and another two recently proposed methods, analyzing the stability of the loss function on the train set and the changes in the upstream layers.These experiments are detailed in Supplementary Materials.

Impact of the fine-tuning phase
Experimental results on synthetic label noise, depicted in Figure 6, show that continuing the presented pre-training block (Figure 2) with the fine-tuning phase increases the accuracy in over 65% of cases on CIFAR10 and over 80% of cases on CIFAR100.For both datasets, asymmetric noise data benefit more from this approach than symmetric noise.All experiments only use the input parameters proposed in the loss-issuing papers.
The sample selection has also got a positive impact on the two real-world datasets, as shown in Table 3 by the "Finetune" columns.The average accuracy improvement is about 1.8%.Only the ELR loss function slightly decreases the performance on Clothing1M.
Enriching pretrained models with sample weighting and selection, pseudo labels instead of corrupted targets, and supervised contrastive pre-training can improve the classification accuracy.However, such an approach raises the question of a trade-off between complexity, accuracy improvement, and computation time.

Discussion and limits of the framework
In addition to the presented fine-tuning phase, we evaluated the performance of other promising techniques, such as the dynamic bootstrapping with mixup [1].This strategy has been developed to help convergence under extreme label noise conditions.Details can be found in the supplementary materials.The improvement that dynamic bootstraping can bring when used after pre-training is depicted in Figure 7.In most of the cases, this technique improves the accuracy, as indicated by the positive accuracy gain scores, measuring the difference between the accuracy after dynamic bootstraping and the accuracy of the pre-training phase.ELR and CE benefit most from this addition for ce elr nfl_rce   One of the major drawbacks of our method is the extra computational time needed to learn representations with contrastive learning.A detailed study, comparing the execution time of our framework with 6 other competing methods has been provided in supplementary materials.The pre-training phase doubles the execution time of a reference baseline, consisting of performing only a single classification step, while the entire framework increases the execution time 3 to 4 times the baseline value.However, the constrastive learning does not increase the need for GPU memory if the batch size is limited for the contrastive learning [12,29].The computational time could be reduced by initializing the contrastive step with the pretrained weights from ImageNet.
Most state-of-the-art approaches also leverage computationally expensive settings, consisting of larger models (e.g.ResNet50), dual model training, or data augmentation such as mixup.In this work, we explored the limits of a restricted computational setting, consisting of a single GPU and 8GB RAM.All experiments use a ResNet18 model, batch sizes of 256, and for real-world datasets, the images have been rescaled (e.g.128 × 128 instead of 224 × 224).We also foresee that the constrastive learning step could be improved by images with higher resolutions as smaller details could be identified in the representation embedding.
There remain multiple open problems for future research, such as: i) identifying the start of the memorization phase in the absence of a clean dataset, ii) studying the impact of contrastive learning on other models for noisy labels such as DivideMix, iii) comparing SimCLR approach in the context of noisy labels with other contrastive frameworks (the impact of Moco is studied in the supplementary materials) and other self-supervised approaches, and iv) having a better theoretical understanding of the interaction between the initial state precomputed with contrastive learning and the classifier in presence of noisy labels.Moreover, the analysis carried out in this work should be validated on larger settings, in particular on Clothing1M with a ResNet50, higher resolutions, and the full dataset.

Conclusions
In this work, we presented a contrastive learning framework optimized with several adaptations for noisy label classification.Supported by an extensive range of experiments, we conclude that a preliminary representation pretraining improves the performance of both traditional and robust-loss classification models.Additionally, multiple techniques can be used to fine-tune and further optimize these results; however, no approach provides a significant improvement systematically on all types of datasets and label noise.The cross-entropy penalized by Early-Learning Regularization (ELR) shows the best overall results for synthetic noise but also real-world datasets.
However, the training phases remain sensitive to input configuration.Overfitting is the common weakness of all studied models.When trained with tuned parameters, even traditional (cross-entropy) models provide competitive results, while robust-losses are less sensitive.The typical noisy label adaptations, such as sample selection or weighting, the usage of pseudo labels, or supervised contrastive losses, improve the performance to a lesser extent but increase the framework's complexity.We hope that this work will promote the use of contrastive learning to improve the robustness of the classification process with noisy labels.Even if pretraining the encoder increases the accuracy for both contrastive methods, the two approaches do not have the same behavior.In particular, the best parameters for the classifier optimizer seem to be different.This raises several questions about the difference between the two representations and what properties of these representations improve the robustness of the classifier.

C.2 Sensitivity to the learning rate
We perform an hyperparameter search on the CIFAR100 datasets.The learning rate is chosen in {10 −3 , 10 −2 , 10 −1 , 10 0 }.Results are presented in Figure 8.The configuration with 80% noise is clearly the most sensitive case, in particular for the NFL+RCE loss and the CE.The ELR method is quiet robust over the investigated range.

C.3 Impact of the classifier architecture
The impact of the 2 classifier architectures is detailed in Table 7.The multilayer architecture performs better on datasets contaminated with a significant amount of asymmetric noise.

D Dynamic bootstrapping with mixup
In addition to the presented fine-tuning phase, we also evaluated the performance of other techniques recently proposed for noisy label classification.The weights w computed by the sample selection phase can also be combined with a mixup data augmentation strategy [45].A specific strategy for noisy labels, called dynamic bootstrapping with mixup [1], has been developed to help convergence under extreme label noise conditions.The convex combinations of sample pairs x p (loss l p ) and x q (loss l q ) is weighted by the probability w to belong to the clean dataset: x = w p w p + w q x p + w q w p + w q x q .
(11) l = w p w p + w q l p + w q w p + w q l q .
The associated CE is corrected according to the weights: where z(k|x i ) = 1 if k = argmax p(k|x i ) or zero for all the other cases.If the GMM probability are well estimated, combining one noisy sample with one clean sample leads to a large weight for the clean sample and a small weight for the noisy sample.Clean-clean and noisy-noisy cases remain similar to a classical mixup with weights around 0.5.
The dynamic bootstrapping for ELR is derived by replacing the CE term by the corrected version: Regarding the robust loss function NFL+RCE, the two losses have to be modified: where q is the one-hot encoding of the label (the zero value is fixed to a low value to avoid log(0)).

E Classification warmup
This section compares the classification accuracy of models trained with and without a warm-up phase after the representation learning.The warm-up phase consists of freezing the entire model except for the classification head.Figure 9 depicts the gain in performance brought by the warm-up phase.When using the default values, its inclusion is beneficial only for significant amounts of symmetric noise.Our experiments have been performed using only the recommended classifier learning rates, detailed in the experimental setup.Having different learning rates for the warm-up phase and the classification optimizing all weights (encoder and classifier) could have a different impact on the warmup phase.

F Execution time analysis
In order to estimate our method's computational we compared the execution time of both approaches, consisting of performing only the pre-training phase and the pre-training followed by fine-tuning with the execution time of  The number of times our methods were slower than the baseline has been depicted in Table 8.We provided similar metrics for the methods making available this informations (i.e.Taks, Co-teaching+, JoCoR).As expected, the pre-training doubles the execution time of the baseline as, in addition to training the classifier, a contrastive learning phase has to be performed beforehand.The entire framework introduces a computational cost 3 to 4.5 times higher.However, all methods leveraging pre-trained models (using for instance supervised pre-training) also hide a similar computational cost.

G An attempt to prevent overfitting with early stopping
Overfitting is the common weakness of all studied models.Several strategies understanding and preventing overfitting have been explored: i) analysing the model behaviour on a validation set, ii) identifying the start of the memorization phase using Training Stop Point [17], and iii) characterizing changes in the model using Centered Kernel Alignment [19].
A clean validation set is generally used to find the best moment for early stopping and to estimate the hyperparameter sets.However, we assume that clean validation samples are not available.Therefore, the methods must be robust to overfitting and to a wide range of hyperparameter values.
As typical noisy label settings lack a clean reference set, we contrasted the behavior of the model on a corrupted validation set with that on a clean test set, where overfitting can be easily identified.Train/validation sets have been generated using 5 cross validation folds.In the figure below, panel (a) depicts the evolution of accuracy scores on the corrupted train/validation sets as well as on the test set.After the first 50 epochs, the model starts overfitting as the test accuracy drops by 10% ( Figure 10 panel a).The accuracy on the corrupted train continues to increase as the model memorizes the input labels.However, on the corrupted validation set a plateau followed by a loss of performance is indicative of the same phenomena, but without being always aligned with the overfitting phase observed on the test-set.The memorization phenomena of the train-set labels incapacitates the model to generalize on the corrupted validation set and explains the significant difference in scores between the train and validation accuracies.Centered Kernel Alignment (CKA) [19] provides a similarity index comparing representations between layers of different trained models.In particular, CKA shows interesting properties as CKA can consistently identify correspondences between layers trained from different initializations.The objective is twofold: i) observing if a specific behavior can be identified for the overfitting and ii) comparing the CKA values with and without contrastive pre-training.The CKA index is computed at three different locations in the network: the input layer, the middle of the network, and the final layer.Figure 13 shows the CKA similarity computed between the initialization/pre-trained model and the same layer at different epochs during the training process.It is interesting to note that the first layer of the pre-trained model remains very similar to the same layer computed by contrastive learning.Such behavior was expected in order to improve the robustness against noisy labels.Indeed, if contrastive learning can extract good representations for semi-supervised or transfer learning, being close to such representations can also help to avoid learning noisy labels.As expected, all layers of the model trained from a random initialization vary much more during the training.
The training phase of the pre-trained model reaches its maximum accuracy around 50 epochs but the CKA values of the middle and last layers continue to drop until 130 epochs.On the other hand, the CKA values of the initialized model remain stable after 150 epochs when the test accuracy reaches almost its maximum value.At first glance, the CKA behavior cannot be related to overfitting.
None of the studied approaches provides a solution preventing overfitting across all our experiments and this problem remains an open question.

Figure 1 :
Figure 1: Top-1 test accuracy for a ResNet18 trained on the CIFAR-100 dataset with a symmetric noise of 80% for three losses: Cross Entropy (CE), Normalized Focal Loss + Reverse Cross Entropy (NFL+RCE), and Early Learning Regularization (ELR).

Figure 2 :
Figure 2: Overview of the framework consisting of two phases: pre-training (panel a) and fine-tuning (panel b).After a contrastive learning phase (a1) a classifier (a2) is trained to predict train-set pseudo-labels y.The fine-tuning phase uses y as a new ground truth.First, a GMM model (b1) predicts the probability of correctness for each sample, used as a corrective weight factor in a supervised contrastive training (panel b2).The final predictions y f inal are produced by the (b3) classifier.

Figure 4 :
Figure 4: Accuracy of the entire training set (in blue) compared to the clean train subset (in red); The clean subset's percentual size is depicted in green.The example is performed on CIFAR100, with 40% symmetric noise.

Figure 6 :
Figure 6: Accuracy gain when performing the fine-tuning phase after the pre-training block (computed as the difference between fine-tuning accuracy and pre-training accuracy).The plot gathers the results for all noise ratios on CIFAR10 (panels a, b) and CIFAR100 (c, d) with symmetric (first column) and asymmetric (second column) noise.

Figure 7 :
Figure 7: Top-1 accuracy gain for the dynamic bootstrapping on CIFAR100 with asymmetric (a) and symmetric noise (b).Dynamic bootstrapping is an alternative to the proposed fine-tuning phase.

Figure 9 :
Figure 9: Gain in performance when using a supplementary classifier warm-up phase before training the entire model on CIFAR 100 with symmetric (panel a) and asymmetric noise (panel b).

Figure 12 :
Figure12: Evolution of train loss and test accuracy on CIFAR, 60% symmetric noise.The theoretical conditions of higher variance on the train loss, associated with the start of the memorization phase, as suggested by TSP, are not fulfilled.
CKA from a pre-trained encoder with contrastive learning.CKA from a random initialization.

Figure 13 :
Figure 13: CKA similarity for a model trained with NFL+RCE loss function on CIFAR100 with 80% noise.

Table 1 :
Results on both CIFAR10 and CIFAR100 using symmetric noise (0.2 -0.8) and asymmetric noise (0.2 -0.4).We compare training from scratch or from pre-trained representation.Best scores are in bold for each noise scenario and each loss.

Table 2 :
[31]racy scores compared with 6 methods (Taks, Co-teaching+, ELR, DivideMix, SELF, and JoCoR) on CIFAR10 (C10) and CIFAR100 (C100).The cases most affected by dropout are presented, with symmetric (S) and asymmetric (A) noise.Top-2 scores are in bold DivideMix and they surpass all other techniques.One can note DivideMix uses a PreAct ResNet18 while we use a classical ResNet18.Moreover, a recent study[31]attempted to replicate these values and reported significantly lower results on CIFAR100 (i.e.49.5% instead of 59.6% on symmetric data and 50.9% instead of 72.1% on asymmetric data).Our framework compares favourably with the other competing methods, both on symmetric and asymmetric noise.

Table 3 :
Top-1 accuracy for mini-Webvision and Clothing1M.Best scores are in bold for each dataset and each loss.Pre-t represents the pre-training phase while Fine-tune refers to the results after the fine-tuning step.

Table 6 :
Top-1 accuracy on CIFAR100 with 80% noise.Two different contrastive learning frameworks are evaluated for the pre-training: SimCLR and Moco.The third column gives the accuracy for a classifier with a smaller learning rate.

Table 8 :
Comparison of execution time results reported as a factor with respect to the training time of the baseline, representing the supervised training of the model with the CE loss.The abbreviations Ours (Pre-t) indicate the pre-training phase while Ours (Fine-tune) the pre-training phase followed by fine-tuning