CoDC: Accurate Learning with Noisy Labels via Disagreement and Consistency

Inspired by the biological nervous system, deep neural networks (DNNs) are able to achieve remarkable performance in various tasks. However, they struggle to handle label noise, which can poison the memorization effects of DNNs. Co-teaching-based methods are popular in learning with noisy labels. These methods cross-train two DNNs based on the small-loss criterion and employ a strategy using either “disagreement” or “consistency” to obtain the divergence of the two networks. However, these methods are sample-inefficient for generalization in noisy scenarios. In this paper, we propose CoDC, a novel Co-teaching-basedmethod for accurate learning with label noise via both Disagreement and Consistency strategies. Specifically, CoDC maintains disagreement at the feature level and consistency at the prediction level using a balanced loss function. Additionally, a weighted cross-entropy loss is proposed based on information derived from the historical training process. Moreover, the valuable knowledge involved in “large-loss” samples is further developed and utilized by assigning pseudo-labels. Comprehensive experiments were conducted on both synthetic and real-world noise and under various noise types. CoDC achieved 72.81% accuracy on the Clothing1M dataset and 76.96% (Top1) accuracy on the WebVision1.0 dataset. These superior results demonstrate the effectiveness and robustness of learning with noisy labels.


Introduction
Inspired by the biological nervous system, deep neural networks (DNNs) have been successfully utilized in various tasks [1][2][3][4][5][6], particularly in computer vision [7][8][9][10][11][12].This is made possible by large-scale datasets with accurate labels, although collecting them can be challenging and costly, especially in certain professional fields that require personnel with relevant professional knowledge to label samples.As a result, researchers often seek cost-effective solutions such as crowdsourcing [13], web-crawling [14], and search engine queries [15] to build datasets.However, these methods can introduce noise to the labels, which can seriously affect the generalization of DNNs during the training process [16].
Learning with noisy labels has become a popular research topic; finding ways to reduce the impact of noisy labels on networks is key to solving the problem.Existing methods include designing robust loss functions and sample selection techniques, which are often combined to create a more effective model.Co-teaching [17], a recent approach, trains two networks that each treat small-loss samples as clean and then uses them to update each other's parameters.By starting with different random parameters, the networks are able to filter different types of errors, resulting in a more robust model.Other approaches, such as Co-teaching+ [18] and CoDis [19], have introduced disagreement strategies for sample selection and training, based on the view that the differences between the two networks are beneficial to their overall robustness.JoCoR, on the other hand, uses joint training and co-regularization to select samples with more consistent predictions and trains both networks simultaneously, arguing that disagreement strategies may not always select truly clean samples.The MLC [20] provides a detailed analysis of the significance of maintaining disagreement.
However, these methods constrain DNNs to one of a "disagreement" or "consistency" perspective, meaning that valuable knowledge involved in "high-loss" samples is ignored.Thus, we propose a new method called CoDC (Accurate Learning with Noisy Labels via Disagreement and Consistency).Specifically, we train two networks with disagreement loss at the feature level to prevent the two networks from becoming a self-training network.At the prediction level, we apply consistency loss to ensure that the predicted results of the network remain consistent.The joint application of disagreement loss and consistency loss can constrain the networks and the sample selection, enabling them to maintain a balance between disagreement and consistency and enhancing the generalization ability of the resulting network.
Furthermore, cross-entropy loss functions have been shown to exhibit an overfitting phenomenon [13] and DNNs have strong learning ability, making it inevitable that they will fit part of the noise from labels during training.By analyzing the historical training situation, we re-weight the cross-entropy loss to obtain the modified classification loss, which reduces the overfitting phenomenon of the network.Instead of discarding noisy samples directly, we use a peer network to guide learning, further improving the model's robustness.The main contributions of this work can be summarized as follows:

•
A weighted cross-entropy loss is proposed to remit the overfitting phenomenon, and the loss weights are learned directly from information derived from the historical training process.

•
We propose a novel method called CoDC to alleviate the negative effect of label noise.
CoDC maintains disagreement at the feature level and consistency at the prediction level using a balance loss function, which can effectively improve the generalization of networks.

Related Work
Robust loss.Robust loss functions have been widely explored in supervised learning to address the issue of overfitting caused by noisy labels.Several loss functions have been proposed to improve robustness against noisy labels, including Mean Absolute Error (MAE) [13], Reverse Cross-Entropy (RCE) [21], Generalized Cross-Entropy (GCE) [22], Active Passive Loss (APL) [23], Negative Learning+ (NL+), and Positive Learning+ (PL+) [24].Ghosh et al. [13] demonstrated the effectiveness of symmetric loss in handling noisy labels and introduced the MAE loss.Wang et al. [21] proposed RCE, which performs well in multi-classification tasks but may result in underfitting.Zhang et al. [22] designed GCE by combining MAE and CE, further enhancing robustness against noisy labels and mitigating underfitting issues.Ma et al. [23] introduced the APL loss to improve the convergence of the loss function.Kim et al. [24] proposed NL+ and PL+ to address the underfitting problem of robust loss functions by analyzing the loss gradient.
Label correction.Label correction is a crucial technique for improving the performance of deep learning models trained on noisy datasets.Label Smoothing [25], a commonly used method, is known to act as a regularization technique that helps the model parameters converge to a small norm solution [26].Generating pseudo-labels for noisy samples is another approach that can make full use of all samples' information.The Meta Pseudo-Labels [27] method treats suspected noise samples in training data as unlabeled data for semi-supervised learning.PENCIL [28] uses given labels to learn from the network and updates and corrects noisy labels through the prediction results of the network, thereby improving the prediction ability of the model.However, the effectiveness of label correction methods is highly dependent on the model's accuracy.Erroneous labels may be generated constantly, resulting in the accumulation of errors if the model's accuracy is poor.Co-LDL [29] is a method that trains on high-loss samples using label distribution learning and enhances the learned representations using a self-supervised module, which further boosts model performance and the use of training data.
Small-loss samples.The existing methods mainly apply small-loss criteria for selection.They consider the small-loss samples as clean samples and the large-loss samples as noisy samples, then adopt different strategies for learning.MentorNet [30] pretrains a teacher network to select clean samples, which then guides the student network.However, this method has the disadvantage of cumulative error caused by sample selection bias.Co-teaching [17] trains two networks; each network selects small loss samples to send to the peer network for learning.However, due to continuous peer-to-peer learning, the parameters of the two networks will gradually become consistent.Therefore, Co-teaching+ [18] only chooses the samples with divergent predictions between the two networks for learning.Similarly, CoDis [19] applies the regular variance term and selects samples with large variance to send to the peer network for learning in order to avoid consistency between the two networks.However, JoCoR [31] assumes that few of the samples with divergent predictions are likely to be clean.It is recommended that the two networks be trained jointly using samples with predictive consistency.As a result, the two networks will easily converge to a consensus or even produce the same prediction error.Liu et al. [20] analyzed the existence of appropriate "differences" between the two networks in Co-teaching, which is beneficial to the improvement of model robustness.In contrast, our proposed approach takes into account both the differences and consistency of the two networks.
In addition, DivideMix [32] fits the loss of all samples into a Gaussian mixture model (GMM), based on the idea that the loss distribution of all samples will conform to two Gaussian distributions.The clean samples belong to the component with a lower central value, while the noisy samples belong to the component with a higher central value, allowing the noisy and clean samples to be divided.LongReMix [33] selects clean samples with high confidence through a two-stage selection method.Kong et al. [34] proposed a penalty label and introduced new sample selection criteria to screen clean samples.However, these methods essentially apply the small-loss sample selection principle.

Preliminary
In this paper, we consider a classification task with C classes.The dataset contains noisy labels.D = {(x i , ỹi )} n i=1 , where n is the sample size, x i is the i-th sample, and ỹi ∈ [C] is the label with noise.For ỹi , its true and unobservable label is represented by y i .As in Co-teaching, two networks are trained, denoted as f 1 = f (x, θ 1 ) and f 2 = f (x, θ 2 ).The two networks have the same architecture and are initialized by random initialization, meaning that they have different network parameters.In this method, p 1 = p 1 1 , p 2 1 , . . ., p C 1 and p 2 = p 1 2 , p 2 2 , . . ., p C 2 represent the respective prediction probabilities of the two networks for the samples.The aim is to learn a robust classifier that can be trained on D to obtain correct predictions.

Training with Disagreement and Consistency
In this section, we mainly introduce the training functions used in our method, including the modified classification loss, balance loss, and peer network guiding learning.
Modified classification loss.During the initial training phase, Cross-Entropy (CE) is utilized as the classification loss.Considering that the two networks possess distinct learning capabilities, the cross-entropies of both networks are summed to minimize the distance between them in terms of predicting clean samples and their labels.This approach helps to minimize erroneous information.The classification loss can be represented as follows.
In previous research, it has been discovered that CE can overfit noise label learning [13].The small-loss sample selection strategy can result in incorrect selection of clean samples.As the number of epochs increases, CE may cause the network to fit noise labels, which can negatively impact the model's generalization.To address this issue, we propose a modified loss function that optimizes the weight of CE for each sample and reduces the fitting noise by combining the historical law of sample loss.Specifically, when the model separates clean and noisy samples, the classification loss of each sample should be maintained at its original value and should not fluctuate greatly during the learning process.Due to the overfitting problem, the loss of noisy samples with large losses may gradually decrease, resulting in the classifier's failure to correctly distinguish between clean and noisy samples.
Therefore, we have modified the classification loss to stabilize the sample loss after the network separates clean and noisy samples.We define the set to record the value of the classification loss of sample x in each epoch of training, where t is the current number of training epochs and s is the length of the set.We define L history as follows. The Balance loss.Co-teaching [17] has been successful in the field of learning with noisy labels.The approach involves training two networks simultaneously, with each network feeding what it deems to be clean samples to its peer network for learning.Because the initial random parameters of the two networks are different, they can filter different types of errors, which improves the generalization ability of the networks.However, as the two networks learn from each other their parameters gradually converge, resulting in decreased ability to filter different errors [18,20].
To control the disagreement between the dual networks at the feature level and encourage them to learn more feature knowledge, we introduce the disagreement loss L dis .The cosine similarity is used to calculate the output features of the two networks in order to measure such feature differences.The equation is as follows: where z 1 and z 2 are the feature representations of the sample in network one and network two, respectively, and K is the dimension of the feature.We apply L dis in the final loss function to control the difference between the two networks, encourage the networks to learn more at the feature level, and inhibit the convergence of the two networks.
In combination with the theory summarized above, we encourage the two networks to maintain differentiation at the feature level, preventing the two networks from reaching a consensus to ensure that more knowledge is learned.However, from the view of agreement maximization principles [35], different networks will agree on the prediction of clean samples.Thus, although the differentiation in features is encouraged, the two networks should be consistent in the prediction of clean samples.We designed the consistency loss L con to minimize the two networks' predictions p 1 and p 2 .Specifically, we apply the Jensen-Shannon (JS) Divergence, which measures the difference between two probability distributions, to capture the consistency of two networks' predictions: where .
We weight the L dis and L con to obtain L balance .
The significance of Equation ( 6) is that we aim for the two networks to maintain differences at the feature level while making the same predictions.
Peer network guided learning.Many previous works have simply discarded noisy samples, resulting in insufficient training due to the decrease in training samples.Semisupervised learning strategies can be used to generate pseudo-labels to replace the given labels for robust learning.However, the network's generalization ability affects the quality of the pseudo-labels.Moreover, the network tends to fall into self-cognitive errors as the number of network iterations increases, causing the self-generated pseudo-labels to accumulate errors.Therefore, we used a peer network to guide learning.The peer network prediction p is used for pseudo-label learning, and the cross-entropy loss can be formulated as follows: where The sharpening function ε(p, T) is utilized to enhance the guidance capabilities of the peer network, where T is the temperature parameter.To align with Jo-SRC [36], we set τ = 0.1.However, there may be label errors due to the network's limited generalization ability.To prevent the network from fitting to these errors, a small weight λ is applied; in our experiments, λ was typically set to 0.5.Additionally, we employed weak data augmentation techniques such as random cropping and random symmetry flipping to further enhance the guidance capabilities of the peer network.Peer-to-peer guided learning is described as follows.
L pl = λ ℓ pl1 + ℓ pl2 Training loss.Finally, the training loss is constructed as follows.
Unlike Co-teaching, Equation ( 9) is used to train the two networks simultaneously.Small-loss samples are more likely to be clean samples; therefore, we are more inclined to select these as clean samples for learning.As mentioned above, we aim to add more samples that exhibit different features in the two networks while maintaining consistent prediction results for learning.Such samples are also considered cleaner.Therefore, we define the loss of sample selection in the training process as follows.
L select = αL mcl + L balance (10) As in Equation ( 6), the L balance is used to ensure a balance between feature differences and prediction consistency, with α controlling the weight of the modified classification loss.
Equation ( 10) is used to select clean samples for learning, while the remaining samples are considered to be noise samples.The noise samples are learned using the peer network guided learning method to enhance the network's robustness.
Considering the ability of deep networks to learn clean samples first, we warmed up the network at the beginning of training using all samples and employed the cross-entropy loss as the training loss.After the warm-up period, we selected the top 1 − τ small-loss samples as clean samples and considered the rest to be noise samples, where τ represents the noise rate in the dataset.Specifically, our approach is as follows: Our sample selection method is depicted in Figure 1.The pseudocode is presented in Algorithm 1.For each sample, the classification loss, disagreement loss, and consistency loss are calculated and the small-loss samples are taken as the clean sample.In general, the essence of this method is to maintain a balance of disagreement and consistency between the two networks to achieve the best performance.

Datasets and Implementation Details
Datasets and noise types.We used four benchmark datasets with noise (F-MNIST [37] (http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/,accessed on 12 November 2023), SVHN [38]  Clothing1M is made up of one million training images collected from online shopping sites and contains fourteen classes with labels generated using surrounding text.WebVi-sion1.0 is a large dataset with real-world noise labels.It has taken over 2.4 million images from the internet, including 1000 classes contained in ImageNet ILSVRC2012 [42].In order to facilitate comparison with other methods, we followed the previous work and obtained the first 50 classes of images from Google images for both training and testing.
Clothing1M and WebVision1.0 are real-world datasets that contain noisy labels.All of the benchmark datasets are clean; thus, we set up the label transition matrix [25] and manually added noise to test the validity of the method.Our experiment mainly followed [43], and we conducted experiments with a variety of noise types: symmetric noise, pairflip noise, tridiagonal noise, and asymmetric noise.Noise rates of 20% and 40% were tested for each type of noise.It is necessary to ensure that the number of clean samples is greater than the number of noisy samples; otherwise, it will be impossible to distinguish the real class of samples [23].The construction of each noise type was as follows: (1) Symmetric noise: the clean labels in each class are uniformly flipped to labels of other wrong classes.(2) Asymmetric noise: considers the visual similarity in the flipping process, which is closer to real-world noise; for example, the labels of cats and dogs will be reversed, and the labels of planes and birds will be reversed.Asymmetric noise is an unbalanced type of noise.(3) Pairflip noise: this is realized by flipping clean labels in each class to adjacent classes.(4) Tridiagonal noise: realized by two consecutive pairflips of two classes in opposite directions.
The label transition matrix is shown in Figure 2, taking six categories and a 40% noise rate as an example.
In addition, we conducted experiments on noisy long-tailed datasets [19].We first built a long-tailed distribution dataset.Specifically, we reduced the number of samples in different classes such that the dataset formed a long-tailed distribution with class imbalance [44].In this paper, we use two ways of simulating the long-tailed distribution: exponential simulation (exp) and linear simulation (line), as shown in Figure 3. Taking SVHN as an example, the two different simulation appraoches are named SVHN-EXP and SVHN-line, respectively.We further added asymmetric noise to the long-tailed dataset and tested four different noise intensities.Models and parameters.Following previous works [31], we used a nine-layer CNN for F-MNIST, SVHN, CIFAR-10, and CIFAR-100.We used a SGD optimizer with momentum set at 0.9, an initial learning rate of 0.02, and a weight decay of 0.0005.The learning rate was reduced by a factor of 10 at the 100th and 150th epochs.The batch size was set to 128, and the total number of training epochs was 200.
In the real-world datasets, for the Clothing1M dataset the training used ResNet-18 [45] pretrained on ImageNet.For WebVision, the training used inception-resnet v2 [46].We trained the network for 100 epochs using a SGD optimizer with an initial learning rate of 0.002.The learning rate was reduced by a factor of 10 at the 30th and 60th epochs.The batch size was set to 64; the other settings were the same as those for the benchmark datasets.
We implemented all the methods with default parameters using PyTorch 1.13.1 and conducted all the experiments on a NVIDIA RTX3090 GPU.

Comparison with SOTA Methods
We compared our method with the Standard CE baseline and more recent SOTA methods, including: (1) Standard: trains a single network and uses only standard cross-entropy loss.
(2) Co-teaching [17]: trains two networks simultaneously; the two networks guide each other during learning.(3) Co-teaching-plus [18]: trains two networks simultaneously while considering the small loss samples of the two networks' diverging samples.(4) JoCoR [31]: trains two networks simultaneously and applies the co-regularization method to maximize the consistency between the two networks.(5) Co-learning [47]: a simple and effective method for learning with noisy labels; it combines supervised and self-supervised learning to regularize the network and improve generalization performance.( 6) Co-LDL [29]: an end-to-end framework proposed to train high-loss samples using label distribution learning and enhance the learned representations by a self-supervised module, further boosting model performance and the use of training data.( 7) CoDis [19]: trains two networks simultaneously and applies the covariance regularization method to maintain the divergence between the two networks.(8) Bare [48]: proposes an adaptive sample selection strategy to provide robustness against label noise.
Results on benchmark datasets.Table 1 shows the experimental results under four benchmark datasets with four noise types and two noise rates.The test accuracy consists of the mean and standard deviation (%) calculated from the last ten epochs of training.Based on the experimental results, we can conclude that our method is superior to state-of-the-art methods on the benchmark datasets.For example, on the CIFAR-10 dataset with symmetric noise and a 40% noise rate, our method scores 1.26% higher than Co-LDL and 10.75% higher than JoCoR.For pairflip noise with a 40% noise rate, our method scores 4.43% higher than Co-LDL and 21.17% higher than JoCoR.Because JoCoR only considers the consistency of the two networks, it is extremely easy for errors to accumulate with asymmetric and pairflip noise, which ultimately affects the generalization of the networks.This experiment proves the superiority of our method on the benchmark datasets.
Results on noisy long-tailed datasets.We tested the model on noisy long-tailed datasets with four noise rates.It can be seen that our method achieves the best performance in most cases.Although the accuracy is not higher than all advanced methods, our method remains competitive.Table 2 shows the results on the noisy long-tailed datasets.Figure 4 shows the test accuracy comparison for the number of epochs.For the results shown in Figure 4, each experiment was repeated three times; the curve represents the average accuracy of the three experiments, while the shaded part represents the error bar for the standard deviation.From the curve, it can be seen that our method has excellent performance under both high and low noise rates.The small area of the shaded part indicates that our method has good statistical significance and is less affected by random factors.
Results on real-world datasets.Tables 3 and 4 show the results on the real-world datasets Clothing1M and WebVision.In addition, we evaluated the generalization ability of these methods on ImageNet ILSVRC12.Our method achieves the best performance on real-world datasets among the compared methods.

Ablation Study
We conducted ablation experiments on the CIFAR-100 dataset under symmetric noise at a 40% noise rate to emphasize our design featuring the modified classification loss, balance loss, and peer network guiding learning.The accuracy when the networks lack a sample selection support strategy is only 33.20%, which is due to the influence of the noisy label environment.When the sample selection strategy is added, the accuracy increase to 64.74%.The modified classification loss, which can alleviate overfitting caused by CE, improves the accuracy by 1.17%.When the balance loss is added, the accuracy improves by 2.47%, which confirms our assumption that maintaining a balance between disagreement and consistency in the two networks can improve their learning performance.Finally, the addition of peer network guided learning further improves the accuracy of the networks.The experimental results are shown in Table 5.

Comparison of Running Time
We compared the running time on the CIFAR-10 dataset under symmetric noise at a 40% noise rate.All the methods were trained for 200 epochs in the same experimental environment.The results are shown in Table 6.Our method outperforms all the compared methods, with less time consumption than Co-LDL and Bare and similar time consumption to Co-teaching and Co-teaching-plus.

Conclusions
The existing co-teaching methods train two networks simultaneously; however, as iteration progresses, the parameters of the two networks gradually converge and they degenerate into a self-training network.To address this problem, we propose using the disagreement loss and consistency loss to balance the relationship between the two networks.Our ablation experiments confirm that balancing the relationship between the two networks can improve the generalization ability of the resulting model.In addition, to alleviate the over-fitting phenomenon when using cross-entropy, we propose a modified classification loss based on analyzing the change in the historical classification loss.Finally, we use a peer network to guide the learning of noisy samples to ensure that all samples are utilized for training.A large number of experiments demonstrate the effectiveness of the proposed method, providing a new research direction for co-teaching.In our future work, we intend to further explore the relationship between the two networks based on the idea of co-teaching in order to enhance their performance when learning with noisy labels; for example, more reasonable methods could be employed to balance the disagreement and consistency between the two networks, and more accurate pseudo-label correction strategies could be utilized.
difference between the loss of the current sample and the loss of the training history, denoted by L c t − L history , can indicate the model's training stability.To ensure stable training and prevent the model from overfitting on noisy samples, we utilized the modified classification loss L mcl to train the model.Minimizing L mcl enhances the stability of the training process.

Algorithm 1 2 : 6 : 7 :
CoDC algorithm Input: two networks with weights θ 1 and θ 2 , learning rate η, noise rate τ, epoch T warmup and T max , iteration t max ; 1: for T = 1 → T max do Shuffle training dataset D; for t = 1 → t max do Fetch mini-batch D n from D;

8 : 9 :θ 1 and θ 2 Figure 1 .
Figure 1.Sample selection of CoDC.For each sample, the classification loss, disagreement loss, and consistency loss are calculated and the small-loss samples are taken as the clean sample.In general, the essence of this method is to maintain a balance of disagreement and consistency between the two networks to achieve the best performance.

Figure 3 .
Figure 3. Construction of two long-tailed distributed datasets.

Figure 4 .
Figure 4. Test accuracy comparison for the number of epochs on the four long-tailed noisy datasets.Each experiment was repeated three times.The error bar for the standard deviation in each figure has been shaded.

Table 2 .
Mean and standard deviations of test accuracy (%) on four noisy long-tailed datasets with four different noise rates (20%, 30%, 40%, 45%).The test accuracy is calculated using the last ten epochs.The best mean results are in bold.

Table 4 .
Top-1 and Top-5 test accuracy (%) on the WebVision1.0 and ILSVRC12 datasets.The best results are in bold.

Table 5 .
Ablation study on CIFAR-100 under symmetric noise at a 40% noise rate.

Table 6 .
Comparison of running time between the proposed method and four co-teaching-based methods.