JoSDW: Combating Noisy Labels by Dynamic Weight

The real world is full of noisy labels that lead neural networks to perform poorly because deep neural networks (DNNs) are prone to overfitting label noise. Noise label training is a challenging problem relating to weakly supervised learning. The most advanced existing methods mainly adopt a small loss sample selection strategy, such as selecting the small loss part of the sample for network model training. However, the previous literature stopped here, neglecting the performance of the small loss sample selection strategy while training the DNNs, as well as the performance of different stages, and the performance of the collaborative learning of the two networks from disagreement to an agreement, and making a second classification based on this. We train the network using a comparative learning method. Specifically, a small loss sample selection strategy with dynamic weight is designed. This strategy increases the proportion of agreement based on network predictions, gradually reduces the weight of the complex sample, and increases the weight of the pure sample at the same time. A large number of experiments verify the superiority of our method.


Introduction
Deep neural networks have accomplished outstanding achievements in various artificial intelligence tasks, which is mainly due to the massive high-quality datasets with labels at this stage. Large amounts of data, on the other hand, need a lot of money and time to annotate with high-quality annotations. However, accurately labeling large amounts of data is a very time-consuming and labor-intensive task. To resolve this problem, data labeling companies have begun to seek alternative and inexpensive methods, such as searching commercial search engines [1], gathering web label information [2], employing machinegenerated labels [3], or having a single annotator mark each sample [4]. Although cheap and efficient, these alternative methods are always accompanied by samples with noisy labels. Even in industry, noise labels are generated due to errors in prop models or subtle differences in the cutting process [5]. Existing studies have shown that deep networks are liable to overfitting label noise during the training process, and their generalizability is low, resulting in a substantial drop in DNN performance [6]. With the advent of 5G, the amount of data is increasing rapidly, and it is more necessary to study noise label learning [7].
Because the existence of noisy tags severely restricts the implementation and development of neural network models in the industry. Initially, the method of data preprocessing is used to deal with noisy labels, but it is inefficient. On the one hand, research attempts to improve the quality of the data at the modeling stage, such as by boosting ensembles [8]. On the other hand, a large number of weakly supervised learning algorithms for learning with noise have been developed. The existing noise label learning methods mainly use sample selection methods and loss correction methods. The loss correction method estimates the noise transfer matrix. Then, the loss function is corrected with the noise transfer matrix. However, it is challenging to correctly estimate the noise transfer matrix. Some methods use

•
This article selects reliable samples through a small loss sample strategy by using relative loss and multi-class loss, subdivides it, and proposes a distinction between pure samples and complex samples based on the prediction consistency in the multiview of the sample.

•
In this paper, a dynamic weight is set between the pure sample and the complex sample to reduce the weight of the noise sample. While deepening the neural network, the complex sample weight is gradually reduced. The dynamic weight is determined based on the results of the previous round of iterative training. • By providing comprehensive experimental results, we show that our method outperforms the most advanced methods on noisy datasets. In addition, extensive ablation studies are conducted to verify the effectiveness of our method.
The remainder of the paper is structured as follows: Section 2 provides an overview of agreement and disagreement. Section 3 describes the framework of JoSDW. The experimental results of our method will be demonstrated in Section 4. Finally, we conclude this paper in Section 5. Table 1. Reference baseline and its description.

Baselines Description
F-correction F-correction corrects the label predictions by the label transition matrix.

Decoupling
Decoupling only uses instances with different predictions from the two classifiers to update the parameters.
Co-teaching Co-teaching trains two networks simultaneously and let them cross-update.
Co-teaching+ Co-teaching+ trains two networks simultaneously and lets them cross-update when the two networks predict disagreement.

JoCOR
JoCOR trains two networks at the same time and sets a contrast loss between the two networks to facilitate the two grids to reach an "Agreement". Table 2. Comparison of baseline.

Co-Teaching Co-Teaching+ JoCOR JoSDW
Small loss Samples with small loss are more likely to be clean [10,12,13,19], we can train them on these instances to obtain a classifier that is resistant to noise labels. However, after conducting experiments that combined the "Agreement" strategy with small loss samples, the cleanliness of the small loss samples proves that they can be further improved.

Disagreement and Agreement
The agreement strategy is inspired by semi-supervised learning, and JoCOR proposes the agreement maximization principle, which encourages two different classifiers to make closer predictions through an explicit regularization method. The samples that are considered to be consistent in the predictions of the two networks are more credible, and the consistency of the network predictions should be promoted as much as possible. The method is to add js divergence between the two network prediction results to judge the distance between the two network prediction results.
Disagreement was proposed by Decoupling, and its key idea is to decouple "when to update" from "how to update". In previous studies, the network was updated all the time. For the "Disagreement" strategy, "when to update" depends on the Disagreement between the two networks rather than all the time. Co-teaching+ combines the "Disagreement" strategy with the cross-update on Co-teaching and proposes that the prediction divergence of the two networks can better help network training and improve the robustness of the network. Therefore, Co-teaching only exchanges and updates the parts with inconsistent predictions in the small loss samples. In this study, we combine "Agreement" and "Disagreement", the "Agreement" part is called the pure sample, and the "Disagreement" part is called the complex sample, and dynamic weights are set for the complex sample.
We strongly agree with maximizing the agreement, which combined with small loss samples results in pure samples that are cleaner than small loss samples. For the complex sample in the early stage of network model training, it can help the network to train better. This solves the problem of "when to update". However, with the training of the network model, the complex samples may no longer be a good guide for model updating.

Contrastive Learning
In a range of tasks, contrastive learning, a recently suggested unsupervised learning paradigm [20][21][22][23][24], has achieved state-of-the-art achievement. The key difference between these approaches is the data augmentation approach and the contrastive loss they use. In summary, most contrastive learning approaches start by constructing positive and negative pairs at the instance level via a sequence of data augmentations. Following that, the contrastive loss can be used to optimize positive pair similarity while minimizing negative pair similarity. Such as NT-Xent [20], Triplet [25] and NCE [26].
The key idea of contrastive learning is to bring similar samples closer and dissimilar samples farther away. Given data x, the goal of contrastive learning is to learn an encoder f such that: where x is called pinpoint data, x + is a positive sample similar to x, x − is a negative sample that is not similar to x, and score() is a measurement function to measure positive and negative samples' similarity. The score() function often takes Euclidean distance, cosine similarity, etc.

The Proposed Method
As mentioned before, the essence of co-teaching is to seek the "Agreement" of the two networks in the "Disagreement" of the two networks. Therefore, this article claims that as the training of the co-teaching network deepens, clean labels tend to concentrate on the label noise rate of the "Agreement" part, which gradually decreases. In the experiment in this article, two neural networks with the same network framework but with distinct initializations were used as classifiers, and the two classifiers defined training samples for prediction. According to the small loss sample strategy and the results of the classifier, the samples were divided into three types of samples: pure samples, complex samples, and dirty samples. We processed these samples differently, which can be seen as follows: For multi-class classification with M classes. There are N samples in the dataset , where x i . represents the i-th instance and its given label as y i {1, . . . , M}. We formulate the proposed JoSDW approach with two deep neural networks denoted by f (x,Θ 1 ) and f (x,Θ 2 ), while p 1 = [p 1 1 , p 2 1 ,. . . , p M 1 ] and p 2 = [p 1 2 , p 2 2 , . . . , p M 2 ] denote the prediction probabilities of instance x i , while p 1 and p 2 represent the outputs of the "softmax" layer in Θ 1 and Θ 2 .
As shown in Figure 1, in our method JoSDW, the dataset D = {x i , y i } N i=1 , is fed into two different networks ( f (x,Θ 1 ) and f (x,Θ 2 )). The loss L(x i ) is calculated based on p 1 = [p 1 1 , p 2 1 ,. . . , p M 1 ] and p 2 = [p 1 2 , p 2 2 , . . . , p M 2 ]. In each mini-batch, L(x i ) are sorted from small to large. The large part called dirty sample are considered more likely to be noisy label instances, so dropout it. While the small part is divided into pure samples and complex samples by the consistency of the predicted labels. As shown in Figure 1, in our method JoSDW, the dataset D = { , } , is fed into two different networks ( (x,Θ ) and (x,Θ )). The loss L( ) is calculated based on = [ , ,… , ] and = [ , , … , ]. In each mini-batch, L( ) are sorted from small to large. The large part called dirty sample are considered more likely to be noisy label instances, so dropout it. While the small part is divided into pure samples and complex samples by the consistency of the predicted labels. Figure 1. The framework of JoSDW. For JoSDW, each network can be used to predict labels on its own. We determine whether the sample is a small loss sample based on the respective cross-entropy loss of the two networks and the joint loss between the two networks. Then, the two networks are classified into pure samples and complex samples by the consistency of the predicted labels of the two networks.

Network
For each network, the predicted pseudo labels are generated separately, but during the training process, they are trained through the pseudo conjoined paradigm, which means that their parameters are different, but they are updated through joint loss.

Loss Function
Our small loss sample selection is as follows: In the loss function, the first part is the comparison loss between the predictions of the two networks to achieve common regularization, and the second part Lsup represents the traditional supervised learning loss of the two networks.

•
Classification loss Figure 1. The framework of JoSDW. For JoSDW, each network can be used to predict labels on its own. We determine whether the sample is a small loss sample based on the respective cross-entropy loss of the two networks and the joint loss between the two networks. Then, the two networks are classified into pure samples and complex samples by the consistency of the predicted labels of the two networks.

Network
For each network, the predicted pseudo labels are generated separately, but during the training process, they are trained through the pseudo conjoined paradigm, which means that their parameters are different, but they are updated through joint loss.

Loss Function
Our small loss sample selection is as follows: In the loss function, the first part L con is the comparison loss between the predictions of the two networks to achieve common regularization, and the second part Lsup represents the traditional supervised learning loss of the two networks .

• Classification loss
For multi-classification tasks, we use cross-entropy loss as the supervision part to minimize the distance between the prediction and the label as follows: • Contrastive loss According to the agreement maximization principle [27,28], different networks under the label of most examples will agree, and they struggle to agree on incorrect labels. This can be calculated as follows:

Sample Selection
• Small loss sample selection Traditional sample selection uses a small loss sample selection strategy. Since the DNN tends to fit simple samples first [18], small loss samples are more likely to be clean samples. This standard method usually selects a predefined proportion of small loss samples in each small batch. Forget rate is an important parameter in the small loss sample strategy. According to [14,15,17], the forget rate takes the value of the noise rate assumed that is known, and there is an initialization process for the forget rate: Forget rate ϕ(e) gradually increased to noise rate σ in the first k (=10) epochs e. However, when training the DNN, the noise ratio in different small batches inevitably fluctuates. As shown in the figure above, and due to the deepening of the DNN training, the agreement part in the small sample selection quickly rises and then stabilizes, while the disagreement part in the small sample selection has a clean rate even lower than our 80% clean rate. Combined with the prediction accuracy map, this part of the noise samples began to affect the accuracy of the DNN model and reduce the prediction accuracy.

•
Pure sample The small loss samples are subdivided twice. Because "co-teaching" uses the comparative training of the two networks to promote the model from "Disagreement" to "Agreement" and because the DNN tends to fit simple samples first [18], we have reason to believe that a sample label that quickly reaches an "Agreement" is more likely to be credible. In the initial training stage of the "Agreement," although the label purity rate did not widen the gap from the disagreement part, with the deepening of the DNN training, the label purity rate opened a large gap.

•
Complex sample When choosing to use only pure samples for high purity samples, there are limitations. The limitations are as follows: Although the small loss sample strategy combined with the "Agreement" for secondary classification can help us quickly screen out high purity samples, this is under the premise of greatly reducing the total number of input samples, which causes the network model to overfit. Second, if only high clean rate samples are considered, the network model converges too slowly. Therefore, we also need to introduce the Disagreement part named the Complex sample.

Dynamic Weight
As the purity rate of the pure sample gradually increases during the experiment, the purity rate of the complex sample gradually decreases. Therefore, we apply dynamic weights to the samples between the pure sample and complex sample. As the network model fits deeper, the influence of the complex sample on the network parameters is gradually reduced as follows.
where λ represents a dynamic weight parameter, N disagreement represents the number of disagreements samples in the previous round of training, and N Complex represents the number of complex samples in the previous round of training.

Experiments
In this section, a series of experimental results are presented. We test the effectiveness of our proposed algorithm on three benchmark datasets, including MNIST [29], CIFAR-10, and CIFAR-100 [30]. The main information of these datasets is shown in Table 3. In past research, these datasets were commonly used to assess learning with noisy labels. [9,31,32]. We compare JoSDW with the following state-of-the-art algorithms, implement all methods with default parameters through PyTorch, and perform all experiments on NVIDIA 2080ti. The experimental data comes from the following three datasets: Mnist, CIFAR-10, CIFAR-100. In order to explore the performance in different noise cases, we use the following four noise ratios: Symmetry-20%, Symmetry-50%, Symmetry-80%, Asymmetry-40%. Figure 2 is an example of noise ratios. We utilize a two-layer MLP for MNIST and a seven-layer CNN network architecture for CIFAR-10 and CIFAR-100 in terms of the network structure. The network architecture of MLP and CNN is shown in Table 4. Regarding the optimizer, we use an Adam optimizer with momentum = 0.9. The initial learning rate is 0.001. The batch size is set to 128. We run 200 epochs totally, with the learning rate gradually decaying to 0 from 80 to 200 epochs.  The symmetric noise, which is also called random or uniform noise, is a common setup of noisy label experiments. M represents the number of classes, and α represents the noise ratio. The probability of a true label is P true = 1 − α. The probability of a noisy label is P noisy = α. The probability of each noisy label is P i noise = α M−1 . The asymmetric noise is closer to a real-world label noise because of flipping. For example, on Mnist, the asymmetric noise maps 2→4, 3→1, 6→3.
We utilize a two-layer MLP for MNIST and a seven-layer CNN network architecture for CIFAR-10 and CIFAR-100 in terms of the network structure. The network architecture of MLP and CNN is shown in Table 4. Regarding the optimizer, we use an Adam optimizer with momentum = 0.9. The initial learning rate is 0.001. The batch size is set to 128. We run 200 epochs totally, with the learning rate gradually decaying to 0 from 80 to 200 epochs.

Results on MNIST
On the left side of Figures 3-6, the comparison of precision accuracy on MNIST is shown. In these four pictures, we can see the memory effect of the network; the standard precision accuracy is first achieved at a high level and gradually decreases. Therefore, a solid, reliable training approach should be able to stop or slow down the decrease process. At this point, JoSDW regularly outperforms all other baselines in each of the four cases.  We plot label precision vs. epochs on the right side of Figures 3-6. These experiments demonstrate the characteristics of JoSDW. Although the peak precision accuracy of JoSDW training is not as high as other algorithms, the existence of dynamic weights makes JoSDW test the accuracy that comes from behind, and there is no substantial drop in precision accuracy.    We plot label precision vs. epochs on the right side of Figures 3-6. These experiments demonstrate the characteristics of JoSDW. Although the peak precision accuracy of JoSDW training is not as high as other algorithms, the existence of dynamic weights makes JoSDW test the accuracy that comes from behind, and there is no substantial drop in precision accuracy.    The right-hand sides of Figures 3-6 demonstrate the performance of label precision vs. epochs. From the experimental results, it can be seen that decoupling and co-teaching+ cannot effectively screen out reliable sample labels, but JoSDW, JoCOR, and co-teaching can still maintain excellent performance. The small loss sample selection strategy after secondary subdivision and dynamic weighting is higher than the traditional small loss sample selection strategy in the middle and late stages of training, and it performs well in the symmetry-80% and asymmetry-40% cases. This shows that JoSDW can better find clean examples. Table 6 shows the precision accuracy of CIFAR-10. JoSDW performs best again in all four cases. For the Symmetry-20% case, co-teaching+ performs better than co-teaching and decoupling. For the other three cases, co-teaching+ cannot even achieve the same performance as co-teaching.   The right-hand sides of Figures 3-6 demonstrate the performance of label precision vs. epochs. From the experimental results, it can be seen that decoupling and co-teaching+ cannot effectively screen out reliable sample labels, but JoSDW, JoCOR, and co-teaching can still maintain excellent performance. The small loss sample selection strategy after secondary subdivision and dynamic weighting is higher than the traditional small loss sample selection strategy in the middle and late stages of training, and it performs well in the symmetry-80% and asymmetry-40% cases. This shows that JoSDW can better find clean examples. Table 6 shows the precision accuracy of CIFAR-10. JoSDW performs best again in all four cases. For the Symmetry-20% case, co-teaching+ performs better than co-teaching and decoupling. For the other three cases, co-teaching+ cannot even achieve the same performance as co-teaching.   Table 5 compares the precision accuracy of various algorithms. In the conventional Symmetry-20% case and Symmetry-50% case, all the new methods are clearly superior to the standard method, which proves their robustness. However, when the noise label reaches the symmetry-80% case, the performance of decoupling and co-teaching+ based on disagreement declines substantially, while JoSDW is considerably better than the other methods. We plot label precision vs. epochs on the right side of Figures 3-6. These experiments demonstrate the characteristics of JoSDW. Although the peak precision accuracy of JoSDW training is not as high as other algorithms, the existence of dynamic weights makes JoSDW test the accuracy that comes from behind, and there is no substantial drop in precision accuracy.

Results on CIFAR-10
The right-hand sides of Figures 3-6 demonstrate the performance of label precision vs. epochs. From the experimental results, it can be seen that decoupling and co-teaching+ cannot effectively screen out reliable sample labels, but JoSDW, JoCOR, and co-teaching can still maintain excellent performance. The small loss sample selection strategy after secondary subdivision and dynamic weighting is higher than the traditional small loss sample selection strategy in the middle and late stages of training, and it performs well in the symmetry-80% and asymmetry-40% cases. This shows that JoSDW can better find clean examples. Table 6 shows the precision accuracy of CIFAR-10. JoSDW performs best again in all four cases. For the Symmetry-20% case, co-teaching+ performs better than co-teaching and decoupling. For the other three cases, co-teaching+ cannot even achieve the same performance as co-teaching. JoSDW is superior to all other comparison methods in terms of precision accuracy and label accuracy. In terms of label accuracy, decoupling and co-teaching+ relying on disagreement does not find a clean instance. When the noisy label reaches Symmetry-80%, JoSDW is considerably better than other methods in terms of precision accuracy and label accuracy, but in Asymmetry-40%, although in terms of label accuracy JoSDW is substantially better than other methods, there is no obvious advantage in precision accuracy. Additionally, although some methods outperform JoSDW in the initial stage, in all subsequent epochs, JoSDW consistently outperforms other approaches. JoSDW is superior to all other comparison methods in terms of precision accuracy and label accuracy. In terms of label accuracy, decoupling and co-teaching+ relying on disagreement does not find a clean instance. When the noisy label reaches Symmetry-80%, JoSDW is considerably better than other methods in terms of precision accuracy and label accuracy, but in Asymmetry-40%, although in terms of label accuracy JoSDW is substantially better than other methods, there is no obvious advantage in precision accuracy. Additionally, although some methods outperform JoSDW in the initial stage, in all subsequent epochs, JoSDW consistently outperforms other approaches.           Table 7 displays the precision accuracy. The precision accuracy and label precision vs. epochs are shown in Figures 11-14. In the MNIST and CIFAR-10 datasets, there are just 10 classes.   Table 7 displays the precision accuracy. The precision accuracy and label precision vs. epochs are shown in Figures 11-14. In the MNIST and CIFAR-10 datasets, there are just 10 classes.  However, JoSDW still achieves high precision accuracy on these datasets. In the simplest case of symmetric-20% and symmetric-50%, the effect of JoSDW is considerably better than other methods. In the most difficult case of symmetry-80%, JoSDW can still obtain higher precision accuracy, and JoCOR and co-teaching are combined.

Ablation Study
Ablation studies were carried out to determine the impacts of secondary subdivision and dynamic weighting, we conduct experiments on the MNIST dataset with Symmetry-50% noise and the CIFAR-10 dataset with Symmetry-20% noise. To eliminate the influence of secondary subdivision and dynamic weighting, we train all samples obtained by the small loss sample strategy. To verify the effect of dynamic weighting, we use fixed However, JoSDW still achieves high precision accuracy on these datasets. In the simplest case of symmetric-20% and symmetric-50%, the effect of JoSDW is considerably better than other methods. In the most difficult case of symmetry-80%, JoSDW can still obtain higher precision accuracy, and JoCOR and co-teaching are combined.

Ablation Study
Ablation studies were carried out to determine the impacts of secondary subdivision and dynamic weighting, we conduct experiments on the MNIST dataset with Symmetry-50% noise and the CIFAR-10 dataset with Symmetry-20% noise. To eliminate the influence of secondary subdivision and dynamic weighting, we train all samples obtained by the small loss sample strategy. To verify the effect of dynamic weighting, we use fixed weighting coefficients for the pure sample and the complex sample. According to the previous analysis, these two methods should play their respective roles in the training process.
Sample selection: The state-of-the-art performance of JoSDW is largely due to precise and dependable sample selection. We use graphs and tables to show the accuracy of sample selection. The graph shows the accuracy of the clean sample selection to study and verify the advantage of the sample selection method. It can be seen from the figure that JoSDW is efficient in accurately and reliably selecting clean samples. In every case, JoSDW outperforms the most advanced sample selection algorithms in terms of picking clean sample data. In addition, in demanding scenarios (i.e., Asymmetry-40%), although all other methods have an impact on finding clean samples, the accuracy of JoSDW's clean sample selection steadily improves as training progresses. These findings support the validity of our clean sample selection method. The last period in this table shows the best periods and last period's selection accuracy, respectively. The results confirm the effectiveness of JoSDW in selecting pure samples and complex samples. Tables 8 and 9 show the impact of various phases in our strategy. In training, the JoSDW-S denotes the case where pure samples and complex samples are used. In training, JoSDW-SD represents the utilization of clean samples and complex samples with dynamic weights. Finally, the JoSDW denotes the suggested approach in its ultimate form. The experimental process is shown in Figure 15.

Conclusions
In this section, we will discuss the problem of sample classification.

Conclusions
In this section, we will discuss the problem of sample classification. As shown in Figure 16, the pure samples always maintain a very high purity rate, while the performance of complex samples is complex, with a rapid decline from the average purity rate of the data. Dirty samples always have low purity. This result justifies our idea of distinguishing the two samples. As shown in Figure 16, the pure samples always maintain a very high purity rate, while the performance of complex samples is complex, with a rapid decline from the average purity rate of the data. Dirty samples always have low purity. This result justifies our idea of distinguishing the two samples. The complex sample can guide model training at the beginning and solves the problem of "when to update". In the middle and late stages of the experiment, a part of the complex sample with the ability to guide the training of the model became the pure sample. The remaining complex samples have an increasing proportion of noisy labels. The complex samples gradually lose the ability to guide model updates.
This paper studies the problem of noisy label learning in deep learning, which is an important issue in the cheap and fast implementation of deep learning. This paper proposes a robust JoSDW to improve the performance of deep neural networks with noisy labels. The key idea of JoSDW is to train two neural networks with the same structure at the same time and select high clean sample labels through a small-loss sample selection The complex sample can guide model training at the beginning and solves the problem of "when to update". In the middle and late stages of the experiment, a part of the complex sample with the ability to guide the training of the model became the pure sample. The remaining complex samples have an increasing proportion of noisy labels. The complex samples gradually lose the ability to guide model updates.
This paper studies the problem of noisy label learning in deep learning, which is an important issue in the cheap and fast implementation of deep learning. This paper proposes a robust JoSDW to improve the performance of deep neural networks with noisy labels. The key idea of JoSDW is to train two neural networks with the same structure at the same time and select high clean sample labels through a small-loss sample selection strategy. Then, the samples obtained in the first screening continue to be classified by the prediction results of the two neural networks. Those with the same prediction result are pure samples, and those with different prediction results are called complex samples. Simultaneously update according to the respective cross-entropy loss and a joint loss. We conduct experiments on three datasets (MNIST, CIFAR-10, CIFAR-100) to prove that JoSDW can train the depth model robustly under the noisy label.
In the future, our work can be divided into three points: first, we will explore label correction methods to recover discarded low-confidence sample labels to improve the utilization of samples. Second, we will try an ensemble of noisy label learning methods, which will be combined with methods such as boosting ensembles [8] and Nested Dropout [18]; Third, the current methods of noisy learning are mainly applied to artificially set noisy datasets, and with the rise of unsupervised learning, we will focus on the combination of pseudo-labels generated by clustering and noisy learning. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: http://yann.lecun.com/exdb/mnist/, https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 30 December 2021).

Conflicts of Interest:
The authors declare no conflict of interest.