Prediction Consistency Regularization for Learning with Noise Labels Based on Contrastive Clustering

In the classification task, label noise has a significant impact on models’ performance, primarily manifested in the disruption of prediction consistency, thereby reducing the classification accuracy. This work introduces a novel prediction consistency regularization that mitigates the impact of label noise on neural networks by imposing constraints on the prediction consistency of similar samples. However, determining which samples should be similar is a primary challenge. We formalize the similar sample identification as a clustering problem and employ twin contrastive clustering (TCC) to address this issue. To ensure similarity between samples within each cluster, we enhance TCC by adjusting clustering prior to distribution using label information. Based on the adjusted TCC’s clustering results, we first construct the prototype for each cluster and then formulate a prototype-based regularization term to enhance prediction consistency for the prototype within each cluster and counteract the adverse effects of label noise. We conducted comprehensive experiments using benchmark datasets to evaluate the effectiveness of our method under various scenarios with different noise rates. The results explicitly demonstrate the enhancement in classification accuracy. Subsequent analytical experiments confirm that the proposed regularization term effectively mitigates noise and that the adjusted TCC enhances the quality of similar sample recognition.


Introduction
In recent years, neural network-based methods have achieved unprecedented success in the fundamental machine-learning task of classification [1][2][3].However, the effectiveness of these models depends on the quality of labeled datasets, which often contain mistakes known as label noise, resulting from various factors [4].For example, automatically collecting image labels through methods like web scraping cannot guarantee the correctness of all labels [5].Similarly, in biostatistics, measurement errors are quite common [6].The capable parameters of neural networks grant them significant model capacity, but they also make it easy for the networks to overfit noisy labels, ultimately resulting in poor model performance.Developing methods suitable for learning with noisy labels has significant implications for fields such as image analysis and medical applications [4].
A well-trained model is expected to yield consistent outputs for similar inputs.However, recent work [7] reveals that models trained on datasets with label noise exhibit significant variations in predictions when faced with two different augmentations of the same image.In classification tasks, the consistency between probability distributions can be measured using the cross-entropy function.From this perspective, label noise leads to an abnormally increased cross-entropy between the predicted probability distributions for similar inputs.To address this anomaly, recent research [7,8] suggests introducing regularization terms on top of the classification loss to combat the adverse effects of noise.
These regularization terms, known as prediction consistency regularization terms, aim to minimize prediction variance among similar samples.However, building consistency through regularization relies on identifying samples in the training dataset that share similar labels.The mismatch between assigned labels and their true counterparts hinders the accurate identification of all samples sharing the same true label.A more general alternative is to consider samples that are close enough as samples sharing the same labels, effectively transforming the problem from identifying samples with the same label to recognizing similar samples.
Determining sample similarity in datasets with straightforward structures is relatively easy.However, for more complex data types, such as images, this task becomes more challenging.In the case of such complex-structure data, a feasible approach is to map the data into a representation space with a simpler structure and then search for similar samples by analyzing the relationships between these representations.Recently, contrastive learning [9][10][11][12] has gained significant attention as a set of representation learning methods.It can provide representations that are independent of label noise and have the potential to identify similar samples.Nevertheless, contrastive learning is primarily used for unsupervised pretraining, with its core objective being the acquisition of transferable representations.This objective differs significantly from the core goal of classification tasks and brings two potential risks when applying self-supervised learning to label-noise classification: (1).The process of self-supervised representation learning does not involve label information, implying that samples with similar self-supervised representations may not necessarily share the same labels.(2).Mainstream contrastive learning frameworks emphasize obtaining transferable representations; then, identifying similar samples requires computing similarities between representations of all samples, leading to additional computational burdens.
This work proposes the twin-contrastive-clustering-based prediction consistency regularization (TPCR) to effectively handle label noise for image data.The proposed method consists of two main components.On the one hand, to accurately and efficiently identify similar samples and reduce potential risks associated with self-supervised learning, TPCR adopts twin contrastive clustering (TCC) [12] as the framework for representation learning.We improve TCC by integrating valuable information, enabling it to produce representations that reflect label consistency, thereby addressing the first potential risk.Since TCC's pretext task involves clustering input samples into different groups, samples belonging to the same cluster can be considered inherently similar without the need for additional calculations, thus avoiding the second potential risk.On the other hand, based on the refined TCC's clustering results, this paper designs a prototype-based regularization method that improves classification consistency within the same cluster by penalizing the cross-entropy between model outputs and the prototypes.Ultimately, these measures help alleviate the adverse effects of label noise on model performance, leading to improved model performance.
The main structure of this paper includes the following sections: In Section 2, Section 2.1 discusses related work on noisy label classification, while Section 2.2 introduces contrastive learning.In Section 3, Section 3.1 introduces the relevant notation; Section 3.2 describes TCC; Section 3.3 presents the adaptive modifications made to TCC; and Section 3.4 presents the proposed regularization term and provides an overview of the overall model training process.Section 4 focuses on the experiments, with Section 4.1 discussing the performance of the proposed method under simulated noise, Section 4.2 presenting the performance on real noisy data, Section 4.3 analyzing the sensitivity to key hyperparameters, and Section 4.4 conducting ablation experiments on the proposed components.Finally, we summarize the paper and discuss future research directions.

Learning with Noisy Labels
We focus on methodologies pertaining to noise-robust loss functions, which align closely with the framework of the proposed method.Ghosh et al. [13] proved that meanabsolute error (MAE)-based loss functions are tolerant to label noise under specific conditions, while traditional cross-entropy loss exhibits high susceptibility to label noise.The innovative mean-absolute error (IMAE) [14] introduced nonlinear transformations into MAE's weighting scheme through the exponential function, establishing a more effective learning process for extracting meaningful patterns.Expanding the scope of noise-tolerant loss functions, Liu et al. [15] generalize the robustness of existing binary loss functions to accommodate multi-category classification scenarios.Furthermore, the generalized crossentropy (GCE) [16] introduces a unique perspective by employing the negative Box-Cox transformation as a loss function.The symmetric cross-entropy (SCE) [17] introduces a novel component in the form of reverse cross-entropy, enhancing the conventional loss by promoting symmetry in predictions.The generalized Jensen-Shannon divergence (GJS) [7] is applied to improve sample-level prediction consistency.The neighborhood consistency regularization (NCR) [8] introduces a regularization term aimed at reducing the difference between the prediction of each sample and those of their nearest neighbors.Another innovative approach is embodied by early learning regularization (ELR) [18].ELR introduces a distinctive regularization term that guides the model towards reproducing its past outputs, on the early-learning phenomenon [19].
While MAE [13], IMAE [14], GCE [16], and SCE [17] mainly focus on modifying the cross-entropy function and may struggle in extreme label noise conditions, GJS [7], NCR [8] and ELR [18], emphasize prediction consistency, with GJS and ELR concentrating on individual-sample-level consistency and NCR on nearest neighbors, which may not suffice in severe noise scenarios.TPCR uniquely targets cluster-level prediction consistency, setting it apart from the aforementioned approaches.
In recent years, a notable trend has emerged in the form of contrastive-learning-based methodologies tailored to address the challenges posed by noisy labels.These innovative methods, including C2D [30] and the method in [31], harness the power of contrastive learning for pre-trained model initialization.Furthermore, MOIT [32], SelCL [33], Mopro [34], ProtoMix [35], and TCL [36] exploit representations derived from contrastive learning to selectively identify confident samples or generate pseudo-labels.Ctrr [37] introduces a novel contrastive regularization mechanism applied to representations.Finally, the co-learning method [25] represents a fusion of label-dependent information from supervised learning with feature-dependent insights derived from contrastive learning, thereby amalgamating the strengths of both paradigms.
These methods are based on the contrastive-instance framework, and the clusteringbased framework has not been fully utilized; moreover, the above methods are highly dependent on calculating similarity between samples, which introduces an additional computational overhead.In contrast to these methods, TPCR utilizes a clustering-based framework and requires no additional computation.

Method
In this section, we detail our proposed methodology, beginning with an overview of the noise classification problem and notations in Section 3.1, followed by an explanation of twin contrastive clustering (TCC) [12] in Section 3.2.These sections serve as an introduction to the foundation for TPCR.Section 3.3 presents modifications to TCC informed by label information, while Section 3.4 presents the novel regularization terms based on the clustering outcomes of the adjusted TCC.

Problem Formulation
Considering a classification problem with C classes, denote the input space as X ⊂ R d 1 and the label space as Y = {1, 2, • • • , C}.Generally, models are trained on the clean dataset denoted as D = {(x 1 , y 1 ), (x 2 , y 2 ), . . . ,(x N , y N )}, with x i ∈ X , y i ∈ Y and N representing the dataset's sample size.When learning with noisy labels, we only have access to the noisy dataset D = {(x 1 , ỹ1 ), (x 2 , ỹ2 ), . . . ,(x N , ỹN )}, where ỹi ∈ Y is noisy; that is, some of ỹi ̸ = y i and do not correctly reflect the visual content of the corresponding input.During training, only noisy labels are available, and it remains unknown whether ỹi is noisy ( ỹi ̸ = y i ) or clean ( ỹi = y i ).The objective is to train a model that achieves high accuracy on the true labels despite the presence of an unspecified number of noisy labels in the training set.
The neural network model for this classification task is denoted as f θ : X → R C , where θ is the trainable parameters of the network.This model captures the conditional probability distribution of y i .Specifically, the model first maps the input x i to a logits vector , where ŷic (1 ≤ c ≤ C) can be viewed as the probability of x i belonging to c-th category.When learning with noisy labels, this model employs a noisy classification loss function: where ℓ ce is the cross-entropy function and ỹi is the one-hot vector corresponding to ỹi .Notably, ỹi and ŷi can also represent the probability mass function of the categorical distribution.For the sake of brevity, we will use the 'probability vector' to refer to the probability mass function of a categorical distribution in the subsequent sections.With label noise, optimization of Equation ( 1) leads to overfitting label noise, which reduces the prediction accuracy on clean labels.

Twin Contrastive Clustering
In order to identify similar samples, we need to obtain the representation of samples and conduct clustering.This study adopts twin contrastive clustering (TCC) [12] as a contrastive learning framework.Prior to introducing TCC, we first describe the contrastiveinstance method that underpins TCC's methodology.
Contrastive learning leverages the unlabeled dataset , obtained by ignoring label information from datasets D or D. Contrastive learning relies on pretext tasks for supervision [38], broadly categorized into contrastive-instance and clusteringbased categories.The contrastive-instance approach involves identifying two augmented versions of the same input as belonging to the same category, serving as single-sample recognition.Specifically, after random augmentations, x i yields two variants: x (1) i and x (2) i , which are then transformed by a neural network model i ).The probability of x i being identified as itself (i.e., x i ) is expressed as: Here, τ represents the temperature hyperparameter, which controls the concentration level [12].Contrastive-instance methods construct the loss function via Equation ( 2) and further learn valuable representations.Moving to TCC, after generating instance-level representations, it clusters samples and then formulates a loss function centered around the clustering outcomes.This loss function combines the cluster-level and instance-level parts.We first introduce the clustering process of TCC.To allocate N samples in D x into K clusters, TCC employs learnable clustering and ∥ • ∥ 2 refers to the L 2 -norm.Using the dot product to measure similarity between z i and µ k , the membership probability of x i in cluster k is calculated as: For convenience, we use Note that π ik also reflects the degree of relevance of x i to the k-th cluster.With it being the aggregation weight, the representation rk for the k-th cluster can be expressed as follows: Here, L 2 -normalization is adopted for normalized representations benefiting contrastive learning [27].Analogous to Equation ( 2), TCC employs representations v i to generate an additional set of cluster-level representations, denoted as rk .Utilizing both rk and rk , TCC's clusterlevel contrastive objective is formulated as: Minimizing this equation enhances the similarity of representations of the same cluster (r k and rk ), while reducing the similarity across different clusters (r k and rk ′ , k ̸ = k ′ ), thereby fostering meaningful representations and clustering outcomes.
In addition to the cluster-level contrastive loss function L r , TCC also contains the instance-level contrastive loss, the evidence lower bound (ELBO) loss, which is derived from the lower bound of the log p 1 (i|x i ).Denote p 3 (i|x i , k) as the instance identification probability within the context of the k-th cluster, with p 0 denoting the prior following uniform distribution.The relationship between the instance identification probability log p 1 (i|x i ) and its lower bound is captured by the following inequality: where KL(•∥•) represents the Kullback-Leibler divergence.The detailed derivation of the inequality can be found in the Appendix A. The right-hand side of this inequality, the ELBO, incorporates the clustering probability π i and enhances the clustering performance of TCC.Based on Equation ( 6), the ELBO loss L elbo for TCC is formulated as: By minimizing L elbo , TCC maximizes the lower bound of the log p 1 (i|x i ), thereby elevating p 1 (i|x i ).Based on L elbo and L r , the loss function for TCC is represented as L TCC = L r + L elbo .

Injecting Label Information to TCC
The ELBO loss L elbo is crucial for TCC to generate effective instance-level representations and meaningful clustering results.To align the clustering results more closely with category information, this subsection introduces modifications to L elbo .
Note that the KL divergence term KL(π i ∥p 0 ) in Equation ( 7) involves the clustering prior distribution p 0 , which is simply set as the discrete uniform distribution for lack of meaningful prior information.To enhance the consistency between clustering and classification, it is a feasible way to replace the non-informative prior distribution with a meaningful clustering distribution derived from labels.To implement this replacement strategy, it is necessary to construct a new clustering prior probability distribution related to label information.This motivates us to reflect on the correspondence between classes and clusters.
Utilizing established notations, the total number of classes and clusters is denoted as C and K, respectively.A one-to-one correspondence between classes and clusters is feasible when C = K, resulting in clustering outcomes that mirror the classification task-whereby each cluster corresponds to a single class.If K < C, a single cluster may encompass multiple categories, diminishing the utility of clustering in identifying similar samples; such configurations are thus excluded from consideration.When K > C, a one-to-one correspondence between clusters and classes cannot be achieved.To extend the concept of correspondence, it is possible to make one class correspond to multiple clusters.This is equivalent to splitting one class into several sub-classes and then associating each sub-class with a cluster.Moreover, a small K would pose challenges to TCC training, and K usually takes a larger value.Hence, it can be assumed that K > C.
To delineate the one-to-many relationships between classes and clusters, we introduce an alignment matrix M ∈ R K×C .Ideally, M is expected to realize the transition from the classification probabilities y i to the clustering assignment probabilities π i , specifically, π i = My i .For the k-th element of the clustering assignment probabilities, the relationship π ik = M k,• y i should hold, where y i = (y i1 , y i2 , • • • , y iC ) ⊤ , and M k,• represents the k-th row of M. For each π ik , the contribution of y ic to π ik is determined by the c-th element of M k,• , denoted as M k,c .Specifically, if cluster k is associated with class c, then y ic should influence π ik , signifying that M k,c > 0; otherwise, M k,c = 0.
To construct the alignment matrix M, we need to clarify the class correspondences for each cluster.Intuitively, the class correspondence for a cluster should be the majority class label among the samples within that cluster.We refer to the label for the majority of samples as the main class of this cluster.In the context of label-noise classification tasks, there is no access to the true class labels y i and corresponding y i for individual samples.Thus, we resort to using ŷi to deduce the class label for each sample, thereby determining the main class for each cluster.Specifically, for the samples within the k-th cluster, we estimate the class index for each sample based on argmax c ŷic .By aggregating these estimations, we identify the most frequent class, denoted as m k , which is considered the main class for the k-th cluster.Upon estimating the main class for all clusters, the alignment matrix M is formulated as: Here, M ′ k,c is used to indicate the relevance of the k-th cluster to the c-th class, and M k,c is the result of column-wise normalization of M ′ k,c to ensure that M ŷi still satisfy the conditions of the probability distribution.
The KL divergence term in L elbo can be transformed as follows: where H(•) denotes the entropy.In the KL divergence term, only the cross-entropy term involves p 0 .With M ŷi as the new prior, we replace the cross-entropy term with ℓce(π i , M ŷi ).
Simply replacing the prior distribution introduces new pitfalls since ŷi may be misled by noise.To mitigate the impact of noisy labels, a confidence threshold γ is introduced to filter out significantly erroneous label information.Specifically, we introduce an indicator function I(max c ŷic > γ).Only ŷi satisfying max c ŷic > γ is used to guide clustering.
Replace the cross-entropy term in Equation ( 9) and obtain: Compared to KL(π i ∥p 0 ), Equation ( 10) introduces category information as a prior into the clustering process, facilitating category-consistent clustering outcomes.It is crucial to note that, during optimization, M ŷi is treated as fixed, and only π i is updated.The modified ELBO loss is expressed as: Another key element of which lies in the construction of p 3 (i|x i , k).In the original TCC, p 3 (i|x i , k) is parameterized with a small neural network.However, the introduction of a small neural network added extra parameters, potentially leading to instability in the model training.To enhance the training process's stability, we utilize the concatenation operation to generate the joint representations, which are subsequently employed to parameterize p 3 (i|x i , k).Additionally, expectation computation involves the reparameterization trick [39,40].Specific details can be found in the Appendix B. Finally, the modified TCC loss falls into the following form:

Prediction Consistency Regularization Based on Clustering
In the previous section, we adjusted the ELBO loss of TCC to incorporate classification information into the clustering process.In this section, we present a novel regularization term based on clustering results.
The purpose of the regularization term is to eliminate class prediction discrepancies among similar samples.In the clustering process of TCC, by evaluating the similarity between representations and µ k , samples with similar representations are aggregated into the k-th cluster.Consequently, from the perspective of representations, samples belonging to the same cluster can be regarded as similar samples.Therefore, the regularization term should ensure that all samples within a cluster have similar class predictions.To achieve this, the most intuitive approach is to constrain differences in class predictions between all pairs of samples.This intuitive approach involves a high computational cost, whereas the prototype-based approach would be more efficient.To develop the prototypebased regularization term, we first generate the prediction center for each cluster and then encourage all class predictions within a cluster close to the corresponding prediction center.
To generate the prediction center, ŷi is utilized as the substitute clean label.Note that ŷi may contain errors, and not all clustering results have the same reliability.We adopt a weighted averaging approach to overcome potential misleading information.Specifically, for x i , we denote its cluster index as a i = arg max k π ik , and the corresponding cluster confidence as α i = π ia i .Let M k be the set of indices of samples belonging to the k-th cluster, then the prediction center for the k-th cluster is defined as: This indicates that ν k remains a probability mass function.Therefore, ν k can also be understood as an aggregation classification distribution, where clustering confidence α i is the aggregation weight.Based on clustering prediction centers, we construct the regularization term as follows: R = 1 Here, ℓ ce represents the cross-entropy function, a i is the clustering assignment for x i , and ν a i is the prediction center of the a i -th cluster.Alternative metrics such as inner product [18] could be utilized to quantify the disparity between ŷi and its associated prediction center.Equation ( 14) is also formulated in a weighted averaging manner, which allows samples with higher clustering confidence to have a greater impact and help mitigate the potential impact of clustering errors.Finally, we obtain the following overall loss: where L ce is the classification loss based on noisy labels, L ′ TCC is the adjusted TCC loss, R is the regularization term, and λ is the regularization strength parameter.
The proposed regularization term relies on the quality of clustering.However, ensuring high-quality clustering during the initial stages of training is often challenging.To prevent the adverse effects of poor clustering results, we introduce a warm-up phase during which the objective function does not include the regularization term.Our training framework is summarized in Algorithm 1.To improve the alignment between clustering and classification while reducing the number of parameters, prior studies frequently shared parts of parameters between f θ and h ϕ .This choice is also adopted in this work.More precisely, h ϕ is structured as an encoder with the backbone network, while f θ is the composition of the same backbone network and a classification head.

Experiment
In this section, we present a series of experiments using synthetic and real-world noisy datasets to confirm the effectiveness of our approach.

Evaluation on Synthetic Noise
We assess the performance of our method on two synthetic noisy datasets, namely CIFAR-10 and CIFAR-100 [41].Each of these datasets comprises 50,000 training images and 10,000 test images, all with dimensions of 32 × 32 × 3. CIFAR-10 consists of 10 distinct classes, while CIFAR-100 contains 100 classes.We consider two types of synthetic noisy labels, symmetric and asymmetric noise, following the conventions set by previous studies [7,18].Symmetric noise randomly assigns the labels of the training set to random labels with predefined percentages, a.k.a., noise rates.On the other hand, asymmetric noise considers the class semantic information, and the labels are only changed to similar classes.For CIFAR-10, label flips are performed based on mappings such as "truck → automobile, bird → airplane, deer → horse, cat → dog".Meanwhile, in CIFAR-100, label flips occur within superclasses in a circular fashion.Our experiments cover various levels of noise.Symmetric noise rates include {0.2, 0.4, 0.6, 0.8}, while asymmetric noise rates include {0.2, 0.3, 0.4}.
For CIFAR, we use ResNet-34 [42] as the backbone network, and the dimension of output is 128.The classification heads are single-layer networks.We employ the SGD optimizer with a momentum of 0.9 and apply a cosine learning rate decay strategy.The initial learning rate is set at 0.1, and the final learning rate is set at 0.0001.The weight decay is set at 5.0 × 10 −4 .We use a batch size of 256 for all experiments.The temperature parameter τ in L ′ TCC is set at 0.2.Before utilizing TPCR, the network is warmed up to 50 epochs.Including warm-up stages, the network is trained for 350 epochs on CIFAR.
The batch size of 256 poses a limitation for clustering and contrastive learning.To address this constraint, we use memory banks [9,12] to help calculate the L ′ TCC .For individual representations, the memory bank's size is 25,600.For cluster-level representations, the size of the memory bank is set as 100 × K. Following previous work [10,25], we use random crop, random horizontal flip, and color jitter as augmentation strategies.
For CIFAR, we set the threshold γ as 0.2 in a quantile style.The number of clusters K is set at 160 for CIFAR-10 and 200 for CIFAR-100, respectively.For CIFAR-10, λ is set as 1.0 and 0.25 for asymmetric and symmetric noise, respectively.For CIFAR-100, λ is set as 1.0 and 0.5 for asymmetric and symmetric noise, respectively.
We compare our methods to other relevant methods: (1) Standard CE; (2) Forward [43]; (3) GCE [16]; (4) SCE [17].( 5) ELR [18].( 6) GJS [7].(7) Co-learning [25].Except for Standard CE, each method employs noise-robust loss functions.Specifically, ELR and GJS are associated with prediction consistency regularization techniques, whereas co-learning utilizes a contrastive learning framework.We re-implement ELR, GJS, and co-learning using publicly available code.To ensure a fair comparison, we present the results of GJS without using RandAug and CutOut data augmentations.All methods employ ResNet-34 [42] as the backbone network.All the experiments are repeated five times with different random seeds, and we report the mean and standard deviation of the best test accuracy.To further demonstrate the efficacy of TPCR, we also report the mean and standard deviation at the last epoch, denoted as TPCR(f).
Tables 1 and 2 present the test accuracies for CIFAR-10 and CIFAR-100, respectively.As illustrated in Tables 1 and 2, TPCR exhibits competitive performance when compared to other state-of-the-art (SOTA) methods on CIFAR datasets, thus affirming its effectiveness across various noise scenarios.In particular, for both CIFAR-10 and CIFAR-100, TPCR's performance is on par with that of ELR and GJS at low noise levels.However, in the presence of high noise levels, TPCR outperforms ELR [18] and GJS [7].

Evaluation on Real-World Noise
We also validated our method on a real-world noisy dataset, Animal-10N [44].Animal-10N consists of 50,000 training images with complex and confusing appearances, along with 5000 test images, each with a resolution of 64 × 64 × 3 pixels.This dataset comprises 10 classes, with an estimated noise level of approximately 8%.The experiment setting on Animal-10N is the same as experiments on CIFAR-10, except for λ = 0.75.We compare our methods to other related methods: (1) Standard CE; (2) Decoupling [20]; (3) Co-teaching [21]; (4) Co-teaching+ [22]; (5) JoCoR [23]; (6) Co-learning [25].Except for Standard CE, other methods rely on the integration of multiple models or tasks, akin to TPCR.We run TPCR five times and calculate the mean and standard deviation with the best accuracy.We also report the mean and standard deviation of the accuracy at the last epoch (denoted as TPCR(f)).The results of other methods are taken from [25].All methods use ResNet-34 [42] as the backbone.As shown in Table 3, TPCR surpasses other SOTA methods on ANIMAL-10N, validating the effectiveness of TPCR in real-noise scenarios.

Sensitivity of Hyperparameters
The proposed TPCR involves two crucial hyperparameters: λ and K. λ is used to control the strength of the regularization term.λ that is too small may prove insufficient for effectively combating noise, while an excessively large λ could potentially obscure valuable information contained within noisy labels.On the other hand, K controls the number of clusters.K that is too small can lead to the collapse of the contrastive learning process, which is detrimental to clustering.Moreover, a small K may fail to guarantee the quality of clusters.Conversely, an excessively large K can result in a limited number of samples within each cluster, diminishing the effectiveness of the regularization term.We conducted an analysis to assess the influence of the regularization strength λ on classification results under 0.4 asymmetric noise (abbreviated as @A.4) and 0.8 symmetric noise (abbreviated as @S.8) settings for both CIFAR-10 and CIFAR-100 datasets.The results, depicted in Figure 1, illustrate the evolution of test accuracy during training with varying values of λ.Notably, the optimal λ value for achieving the highest classification accuracy differs between datasets and noise settings.Generally, both excessively small and excessively large values of λ do not contribute to the best classification accuracy.Furthermore, Figure 1 reveals that different data settings exhibit varying degrees of sensitivity to λ.Specifically, for CIFAR-10 with 0.4 asymmetric noise, λ in the range of {0.5, 1.0, 1.5} achieves comparable classification outcomes.In contrast, for CIFAR-100 with 0.8 symmetric noise, the preferred value of λ is 0.5.These differences in sensitivity to λ underscore the varying levels of difficulty in mitigating label noise across different scenarios.Subsequently, we investigated the impact of the number of clusters K on our method's performance in both CIFAR-10 and CIFAR-100 under 0.4 asymmetric and 0.8 symmetric noise settings.The results are presented in Figure 2a-d.As anticipated, excessively small values of K prove detrimental to the final classification accuracy.Notably, our method demonstrates resilience to variations in K.For CIFAR-10, high classification accuracies can be achieved with K ∈ {160, 320}; for CIFAR-100, high classification accuracies can be obtained by taking K ∈ {200, 400}.Moreover, K emerges as a critical parameter that significantly affects clustering performance.To assess the impact of K on clustering performance, we introduce the purity metric, defined as follows: Here, I is the indicator function, α i is the clustering confidence for x i , a i is the clustering assignment, and ν a i is the prediction center of the a i -th cluster.The purity metric reflects the degree of consistency between the true labels of individual samples and the prediction center.A purity value of 1 indicates perfect alignment between true labels and cluster predictions, while a value of 0 signifies no consistency between them.The changes in training set purity with varying K are depicted in Figure 2e,f.In CIFAR-10 with 0.8 symmetric noise, selecting a small K, such as 10, results in lower purity.The potential reason is that a small number of clusters cannot guarantee that all samples within a cluster share the same label, which results in a reduction of purity.Reduced purity, in turn, affects the efficacy of the regularization term, leading to diminished classification accuracy.Higher clustering purity can be obtained when K ∈ {160, 320}.Combining accuracy and purity, for CIFAR-10, 160 and 320 can be used as the recommended values of K.
In CIFAR-100 with 0.8 symmetric noise, increasing the number of clusters from 100 to 200 is accompanied by improvements in purity and classification performance.However, further increasing K may lead to a decline in purity during later stages of training, indicating a reduction in clustering performance.An intriguing observation is that in CIFAR-100 with 0.8 symmetric noise, a decrease in purity does not necessarily result in an equivalent decrease in prediction accuracy.This may be attributed to the fact that cluster prediction centers employ soft labels.Consequently, even if the maximum probability of the prediction center does not align with the true sample label, as long as the probability associated with the true label is sufficiently high, it can still assist in mitigating label noise.Combining purity and accuracy, the most appropriate value for K on CIFAR-100 is 200.

Ablation Study
In this section, we conduct an ablation study to validate the effectiveness of the proposed strategies, including the following configurations: (1) Removal of contrastive learning and using only cross-entropy as the regularization term; (2) No adjustment to the evidence lower bound (ELBO) and using the original TCC loss; (3) Direct replacement of the KL divergence term in ELBO without filtering; (4) Removal of the regularization term.Figure 3a,b show the change in test accuracy during training under various configurations, while Figure 3c,d illustrate the evolution of cluster purity during training when the TCC-like loss is included.As shown in the figures, removing any of these components leads to a decrease in the final classification accuracy, confirming the effectiveness of each proposed component.
To elaborate, not adjusting the prior distribution in the ELBO leads to a decrease in cluster purity, consequently causing a decline in classification accuracy.Merely substituting the prior distribution without applying any filtering results in a significant decrease in both cluster purity and classification accuracy.This phenomenon can be attributed to the fact that in the early training stages, when classification predictions are not highly accurate, the prior distribution also exhibits significant bias, which is detrimental to the learning of TCC.Removing the regularization term initially improves classification accuracy during early training stages because the TCC loss provides some resistance to label noise by constraining representations.However, as training progresses, relying solely on the TCC loss cannot completely mitigate label noise, and the model eventually exhibits a decrease in classification accuracy due to overfitting noise.An intriguing observation is that in Figure 3c,d, removing the regularization term results in an improvement in cluster purity.One possible explanation for this phenomenon is that eliminating the regularization term simplifies the optimization objective, leading to enhanced clustering performance.

Representations Evaluation
In this section, we conduct a comparative analysis of the representations generated by TPCR and other methods for a detailed comparison.All methods are trained on CIFAR-10 with 0.8 symmetric noise, and we extract the representations at the output of backbone networks.We then visualize the training set representations in a 2-D space using t-SNE [45].Figure 4 displays these representations, with distinct colors representing different classes.Compared to the standard cross-entropy (CE) method, all methods, including TPCR, succeed in learning meaningful representations.Notably, TPCR's representations clearly delineate between categories, unlike ELR and co-learning, which exhibit areas of overlap among different classes.This highlights TPCR's superior ability to capture distinct and accurate class representations.To further quantify the quality of the representations obtained from different methods, we employ these representations for k-nearest neighbor (k-NN) classification.Specifically, we derive representations from both the CIFAR-10 test and training set images, subsequently assessing the test set's classification accuracy using a k-NN classifier based on Euclidean distance within the representation space.To ensure a comprehensive comparison of representation quality, we experiment with multiple values for the number of nearest neighbors, applying clean labels, model-predicted labels, and noisy labels to the training set simultaneously.The results, presented in Table 4, reveal that TPCR consistently achieves the highest classification accuracy across all configurations.This performance underscores TPCR's superiority in generating quality representations compared to other methodologies.

Training Time Analysis
In Table 5, we compare the training times of TPCR with three state-of-the-art methods on CIFAR-10 with 0.8 symmetric noise, using a single Nvidia RTX 3090 GPU.TPCR and colearning are based on contrastive learning, which takes longer than ELR and GJS.Notably, TPCR's design obviates the need for computing distances between sample pairs during training, resulting in shorter training times than co-learning.

Discussion
This paper introduces TPCR as a powerful strategy to handle label noise.TPCR leverages the prediction consistency of multiple instances within the cluster to provide an effective defense mechanism against the adverse effects of noisy labels.To identify similar samples, TPCR has made adjustments to TCC.The modified TCC enables the pretext task of contrastive learning to determine similar samples directly, eliminating the inherent additional computational requirements.Based on the identification of similar samples, we designed the prototypical regularization to guide model training and combat label noise.Experimental results confirm the effectiveness of our method in mitigating noise-induced disruptions.The analysis of experiments demonstrates that the proposed method's effectiveness stems from the accurate identification of similar samples and the effective design of the regularization term.
While TPCR demonstrates a significant impact, this study has some limitations and potential extensions.Primarily, TPCR's application has been confined to image data.Nevertheless, the regularization term proposed has the potential for broad applicability across various types of mislabeled data.The challenge lies in adapting twin contrastive clustering (TCC), currently tailored for image data through contrastive learning, to other data modalities.Exploring how to extend TPCR beyond image data presents a promising avenue for future research.Indeed, recent advances in contrastive learning frameworks for non-image data [46][47][48] suggest the feasibility of such an extension.These developments indicate the potential for applying TPCR to more diverse fields, including gene expression and electronic health records, in forthcoming studies.
Furthermore, the design of TPCR's prediction center and the metric used by the regularization term are relatively straightforward.Constructing more optimal prediction centers and difference metrics represents another research direction that could further enhance noise resilience.

Algorithm 1 : 2 if s ≤ S 1 then 3 repeat 4 5 Calculate 6 L ← L ce + L ′ TCC ; 7 Update θ, ϕ and µ with SGD optimizer; 8 until an epoch finished; 9 else 10 Calculate ν 1 , ν 2 , 11 repeat 12 Randomly 13 Calculate 14 L
Training Algorithm Input: Noisy dataset D, total number of training epochs S, warm-up epochs S 1 , µ, f θ and h ϕ Output: Classification network f θ 1 for t ← 1 to S do Randomly sample a mini-batch B from D; L ce and L ′ TCC on B; • • • , ν K on D; sample a mini-batch B from D; L ce , L ′ TCC and R on B; ← L ce + L ′ TCC + λR; 15 Update θ, ϕ and µ with SGD optimizer; 16 until an epoch finished; 17 end

Figure 1 .
Figure 1.Sensitivity of λ.We show the evolution of test accuracy during training with varying values of λ.

Figure 2 .
Figure 2. Sensitivity of K. (a-d) show the evolution of test accuracy, while (e,f) show the evolution of purity on the training set.

Figure 3 .
Figure 3. Ablation study.(a,b) show the evolution of test accuracy, while (c,d) show the evolution of purity on the training set.

Figure 4 .
Figure 4. t-SNE Visualization of learned representations on the CIFAR-10 training set with 0.8 symmetric noise.Each color represents a distinct class, and all points are colored according to clean labels.

Table 1 .
Test accuracies (%) on CIFAR-10 with different noise settings.All methods use the same backbone, ResNet-34.All results are shown as mean ± std.

Table 2 .
Test accuracies (%) on CIFAR-100 with different noise settings.All methods use the same backbone, ResNet-34.All results are shown as mean ± std.

Table 4 .
Test accuracies (%) of k-NN classifier based on representations.k is the number of nearest neighbors.

Table 5 .
Comparison of total training time in hours on CIFAR-10 with 0.8 symmetric noise