Agreement and Disagreement-Based Co-Learning with Dual Network for Hyperspectral Image Classiﬁcation with Noisy Labels

: Deep learning-based label noise learning methods provide promising solutions for hyper-spectral image (HSI) classiﬁcation with noisy labels. Currently, label noise learning methods based on deep learning improve their performance by modifying one aspect, such as designing a robust loss function, revamping the network structure, or adding a noise adaptation layer. However, these methods face difﬁculties in coping with relatively high noise situations. To address this issue, this paper proposes a uniﬁed label noise learning framework with a dual-network structure. The goal is to enhance the model’s robustness to label noise by utilizing two networks to guide each other. Speciﬁcally, to avoid the degeneration of the dual-network training into self-training, the “disagree-ment” strategy is incorporated with co-learning. Then, the “agreement” strategy is introduced into the model to ensure that the model iterates in the right direction under high noise conditions. To this end, an agreement and disagreement-based co-learning (ADCL) framework is proposed for HSI classiﬁcation with noisy labels. In addition, a joint loss function consisting of a supervision loss of two networks and a relative loss between two networks is designed for the dual-network structure. Extensive experiments are conducted on three public HSI datasets to demonstrate the robustness of the proposed method to label noise. Speciﬁcally, our method obtains the highest overall accuracy of 98.62%, 90.89%, and 99.02% on the three datasets, respectively, which represents an improvement of 2.58%, 2.27%, and 0.86% compared to the second-best method. In future research, the authors suggest using more networks as backbones to implement the ADCL framework.


Introduction
With the development of spectral imaging technology, hyperspectral image (HSI) has been widely used in various fields such as agricultural monitoring [1], food quality inspection [2], urban ground object recognition and classification [3], post-disaster change detection [4], and soil heavy metal detection [5].The classification task is essential for these hyperspectral remote sensing applications.In the past few years, traditional machine learning methods such as support vector machine [6], random forests [7], extreme learning machine [8], and sparse representation classifier [9] have played an important role in HSI classification.Recently, deep learning has brought new prosperity to HSI classification with its powerful representation learning ability [10,11].In general, both traditional and deep learning-based methods use a certain number of accurately labeled samples to train reliable models.However, in real situations, the training sets that are available for training usually contain mislabeled or wrong samples, which is called the label noise problem.Label noise is not conducive to training effective models.
Many attempts have been made to address the label noise problem in HSI classification.Representative works of traditional methods include the following.Given that noisy labels are usually located in low-density regions, Refs.[12][13][14] investigated a series of density peak-based noisy label detection methods.To increase the detection accuracy of noisy labels, Ref. [15] proposed a random label propagation algorithm (RLPA) to detect label noise.The main idea of RLPA is to randomly divide the data and perform label propagation several times, and then ensemble the outcomes of multiple label propagations for noisy label detection.To overcome the shortcoming that RLPA is sensitive to superpixel segmentation scale, a multi-scale superpixel segmentation method and a new similarity graph construction approach were proposed [16].In order to overcome the influence of random noise and edge noise on label information, the spectral-spatial sparse graph was introduced into RLPA to construct an adaptive label propagation algorithm [17].Since ensemble learning can enhance the robustness of the model, Ref. [18] proposed an adapted random forest that can consider mislabeled training labels.
Compared to traditional label noise learning methods, deep learning-based label noise learning methods have more advantages owing to the powerful discriminative feature representation learning abilities of deep neural networks.Recently, researchers have investigated deep learning methods for HSI classification in the presence of label noise.For instance, an entropic optimal transport loss was designed for end-to-end style deep neural networks to improve their robustness to label noise [19].In order to enhance the robustness of the classification model, Ref. [20] investigated a novel dual-channel network structure and a noise-robust loss function.Ref. [21] designed a superpixel-guided sample network framework with end-to-end training style for handling label noise, comprising two stages: sample selection and sample correction.Ref. [22] proposed a lightweight heterogeneous kernel convolution (HetConv3D) to improve the robustness of the network to label noise.HetConv3D used two different types of convolutional kernels.Ref. [23] employed both labeled and unlabeled data to build a unified deep learning network, which was shown to be robust to noisy labels.To handle label noise and limited samples simultaneously, Ref. [24] presented a novel dual-level deep spatial manifold representation (SMR) network for HSI classification, embedding SMR-based feature extraction and classifier blocks into one framework.Ref. [25] investigated the robustness of several loss functions to convolutional neural networks and proposed an HSI pixel-to-image sampling method to prevent overfitting on label noise.To address the inaccurate supervision caused by label noise, selective complementary learning was introduced into convolutional neural networks for HSI classification with noisy labels [26].
The above methods have made positive contributions to HSI classification in the presence of label noise.Traditional methods typically detect and remove mislabeled samples before constructing classification models, while deep learning-based methods do not require this step.Instead, deep learning-based methods consider the effect of label noise on model construction and design robust loss functions or specific network structures to improve the learning ability of label noise.However, current deep learning-based methods for HSI classification with noisy labels still have limitations.For example, they often focus on improving one aspect, such as designing a robust loss function, revamping the network structure, or adding a noise adaptation layer, which may not be sufficient to handle relatively high noise rates.This issue deserves further study.
To address the above issue, we propose a unified label noise learning framework that can be adapted to various deep neural networks.Inspired by collaborative learning, we design a disagreement-based co-learning (DCL) framework with a dual-network structure, in which the "disagreement" strategy is incorporated with co-learning.In DCL, the two networks attempt to cross-propagate their own losses to the peer network through the "disagreement" strategy, which can avoid the dual-network training degenerating into self-training.However, the "disagreement" strategy can only select a subset of training samples, which are not guaranteed to have real labels especially with high label noise.Therefore, we introduce the idea of "agreement" in co-training into DCL and propose an agreement and disagreement-based co-learning (ADCL) framework for HSI classification with noisy labels.Additionally, a joint loss function is designed for our dual-network framework.The designed loss consists of a supervision loss of two networks and a relative loss between two networks.
The remainder of this paper is organized as follows.Related work and contributions are described in Section 2. The detailed description of ADCL is shown in Section 3. Experimental results and analysis are reported in Section 4. Section 5 shows the discussion.The conclusions of this work are shown in Section 6.

Label Noise Learning Based on Deep Learning
Deep learning-based label noise learning has been extensively studied in the fields of machine learning and computer vision.Some surveys on this topic can be found in [27][28][29].Deep learning-based label noise learning approaches can be roughly divided into five categories: (1) Robust network architecture: Adding a noise adaptation layer [30] or designing a specific architecture [31] to improve the reliability of estimating label transition probabilities and to mimic the label transition behavior in deep network learning.The goal of the specific architecture is to improve the reliability of estimating label transition probabilities.(2) Robust loss function: Developing a loss function that is robust to label noise [32,33].
Generally, robust loss functions attempt to achieve a small risk on the training set with label noise.Current studies of robust loss function mainly rely on the basis of mean absolute error loss and cross entropy loss.(3) Robust regularization: Adding a regularization term into optimization objective to alleviate the overfitting of deep learning on training samples with label noise.Regularization techniques include explicit regularization (such as weight decay [34] and dropout [35]) and implicit regularization (such as mini-batch stochastic gradient descent [36] and data augmentation [37]).(4) Loss adjustment: Adjusting the loss of all training samples to reduce the effects of label noise.Unlike robust loss functions, loss adjustment adjusts update rules to minimize the negative effects of label noise.Loss adjustment includes loss correction [38], loss reweighting [39], and label refurbishment [40].(5) Sample selection: Selecting true-labeled samples from the training set with noisy labels.The aim of sample selection is to update deep neural networks for the selected clean samples.Sample selection generally includes multi-network collaborative learning [21,41], multi-round iterative learning [42], and the combination with other learning paradigms [43].
To handle the noisy label data for building extraction, Ref. [46] proposed a general deep neural network model that is adaptive to label noise, which consists of a base network and an additional probability transition module.To suppress the impact of label noise on the semantic segmentation of RS images, Ref. [47] constructed a general network framework by combining an attention mechanism and a noise robust loss function.(2) Robust loss function [19,20,25,[47][48][49][50].Ref. [47] added two hyperparameters into the symmetric cross-entropy loss function for label noise learning.Refs.[48,49] proposed two novel loss functions for deep learning, the first being the robust normalized softmax loss used for the characterization of RS images based on deep metric learning, and the second being the noise-tolerant deep neighborhood embedding, which accurately encodes the semantic relationships among RS scenes.Ref. [50] constructed a joint loss consisting of a cross-entropy loss with the updated label and a cross-entropy loss with the original noisy label.(3) Label correction [45,[50][51][52][53][54].For road extraction from RS images, Ref. [45] introduced label probability sequence into sequence deep learning framework for correcting error labels.Ref. [50] utilized the information entropy to measure the uncertainty of the prediction, which served as a basis for label correction.Ref. [51] adopted unsupervised clustering to recognize the sample's label, and the network trained on augmented samples with clean labels was used to correct noisy labels further.Similarly, for object detection in aerial images, Ref. [52] designed a new noise filter named probability differential to recognize and correct mislabeled labels.Ref. [53] used the initial pixellevel labels to train an under-trained initial network that was treated as starting training for network updating and initial label correction.In addition, Ref. [54] proposed a novel adaptive multi-feature collaborative representation classifier to correct the labels of uncertain samples.(4) Hybrid approach [21,23,26,51,[55][56][57][58].Both [51] and [55] introduced unsupervised method into label noise leaning.In [55], an unsupervised method was combined with domain adaptation for HSI classification.In addition, complementary learning was combined with deep learning for HSI classification [26] and RS scene classification [56].Recently, Ref. [57] incorporated knowledge distillation with label noise learning to improve building extraction.To obtain datasets containing less noise, Ref. [58] introduced semisupervised learning into the objective learning framework to produce a low-noise dataset.

Co-Training in Remote Sensing
Co-training, originally proposed by Blum and Mitchell [59], uses two sufficiently redundant and conditionally independent views to improve the generalization performance of the model.In the past few years, researchers have studied the theory of co-training and developed various variations of co-training.Recently, some studies incorporated co-training with deep learning for label noise learning [60][61][62][63].
In the study of the remote sensing field, the idea of co-training has been introduced into several tasks such as land cover mapping, image segmentation, image classification and recognition, and so on [64][65][66][67][68][69].For example, Ref. [64] proposed an improved cotraining method for semisupervised HSI classification, which used spectral features and two-dimensional Gabor features as two different views to train collaboratively.Similarly, Ref. [65] implemented a co-training paradigm with the P-N learning method, in which the P-expert assumes that adjacent pixels in space have the same class label, while the N-expert believes that the pixels with similar spectra have the same class label.Then, co-training was combined with a deep stacked autoencoder for semisupervised HSI classification [66].In the application of RS, Ref. [67] proposed conditional co-training method and applied it to RS image segmentation in coastal areas.Refs.[68,69] proposed novel co-training methods for land cover mapping and crop mapping, respectively.

Contributions
Compared to previous work, we improve the label noise robustness of the model by addressing both network structure and loss function.Specifically, we construct a unified dual-network structure that leverages the mutual information between the two networks to guide each other.In addition, we design a more robust loss function for the specific network structure.The main contributions of our work are as follows: (1) A new framework incorporating "disagreement" strategy into co-learning, named DCL, is proposed for HSI classification with noisy labels.
(2) A stronger framework that introduces an "agreement" strategy into DCL, termed ADCL, is designed.(3) A joint loss function is proposed for the dual-network structure.(4) Extensive experiments on public HSI data sets demonstrate the effectiveness of the proposed method.

Proposed ADCL Method
The loss function is an essential component of deep neural networks, and a noise robust loss function can significantly improve their performance.In the proposed ADCL framework, two networks with the same structure are used, and a joint loss function is designed to make the framework more robust.The loss function takes into account the supervision information and the mutual guidance information between the two networks.
The main idea of ADCL is to have the two networks guide each other in learning.To achieve this goal, in the training process of ADCL, two networks predict all samples, and the samples with inconsistent predictions constitute the disagreement data.At the end of forward propagation, each deep network selects data with small loss from disagreement data to minimize the loss of the network.In the back propagation process, each network uses the data with small loss from the peer network to update the weight parameters.To make the model more powerful, in addition to selecting its own small loss data from disagreement data, each network also adds the data with the same classification on two networks into the peer network for back propagation.This design makes full use of the mutual information between the two networks, enhancing the noise tolerance of the model.
Taking CNN as the backbone for our ADCL, Figure 1 shows the overall framework of ADCL.Firstly, the training data with noisy labels is fed into two CNNs: A and B. Secondly, after training one mini-batch, disagreement and consistent data predictions will be generated by the two networks.Thirdly, the two networks select their own small loss data according to the designed loss.Fourthly, each network uses the small loss data from the peer network and consistent data from two networks to update its own convolution kernels.Finally, after multiple epochs of training and updating, the two trained networks are combined to generate the classification map.
Compared to previous work, we improve the label noise robustness of the model by addressing both network structure and loss function.Specifically, we construct a unified dual-network structure that leverages the mutual information between the two networks to guide each other.In addition, we design a more robust loss function for the specific network structure.The main contributions of our work are as follows: (1) A new framework incorporating "disagreement" strategy into co-learning, named DCL, is proposed for HSI classification with noisy labels.(2) A stronger framework that introduces an "agreement" strategy into DCL, termed ADCL, is designed.(3) A joint loss function is proposed for the dual-network structure.(4) Extensive experiments on public HSI data sets demonstrate the effectiveness of the proposed method.

Proposed ADCL Method
The loss function is an essential component of deep neural networks, and a noise robust loss function can significantly improve their performance.In the proposed ADCL framework, two networks with the same structure are used, and a joint loss function is designed to make the framework more robust.The loss function takes into account the supervision information and the mutual guidance information between the two networks.
The main idea of ADCL is to have the two networks guide each other in learning.To achieve this goal, in the training process of ADCL, two networks predict all samples, and the samples with inconsistent predictions constitute the disagreement data.At the end of forward propagation, each deep network selects data with small loss from disagreement data to minimize the loss of the network.In the back propagation process, each network uses the data with small loss from the peer network to update the weight parameters.To make the model more powerful, in addition to selecting its own small loss data from disagreement data, each network also adds the data with the same classification on two networks into the peer network for back propagation.This design makes full use of the mutual information between the two networks, enhancing the noise tolerance of the model.
Taking CNN as the backbone for our ADCL, Figure 1 shows the overall framework of ADCL.Firstly, the training data with noisy labels is fed into two CNNs: A and B. Secondly, after training one mini-batch, disagreement and consistent data predictions will be generated by the two networks.Thirdly, the two networks select their own small loss data according to the designed loss.Fourthly, each network uses the small loss data from the peer network and consistent data from two networks to update its own convolution kernels.Finally, after multiple epochs of training and updating, the two trained networks are combined to generate the classification map.In the next subsections, we will introduce the designed joint loss, then detail the proposed ADCL framework, and finally show the formula analysis of the proposed method.

Joint Loss
In the case of the dual network, the most straightforward way to construct a loss function is to apply independent regularization when training each individual network.Although regularization can improve generalization performance by promoting consistency between two networks, it can still be influenced by the memory effect of label noise [35].Therefore, we adopt a joint loss function based on regularization techniques in this work.
Let T = {x i , y i } N i=1 be the training set with N samples, where x i represents the ith sample, and its corresponding observation label is y i ∈ {1, . . . ,C}.The joint loss function is designed as where l S represents the supervision loss under two networks, l R represents the relative loss, and parameter β is utilized to balance the supervision loss and relative loss.
The symmetric cross-entropy (SCE) adds reverse cross-entropy (RCE) to cross-entropy (CE), so as to have a certain robustness to label noise [70].This paper adopts SCE to construct supervision loss l S .Before introducing SCE, the relationship between CE and Kullback-Leibler (KL) divergence is analyzed first, and then, the definition of SCE is introduced.Based on SCE, the supervision loss l S is constructed.
For each sample x i , the class predictive distribution predicted by a classifier is denoted as p(c|x i ) , and q(c|x i ) is used to represent the ground-truth distribution of the sample x i on the observation label.The CE loss is defined as when c = y i , q(c|x i ) = 1; otherwise, q(c|x i ) = 0.The relationship between cross-entropy H(q, p) and KL divergence can be written as generally, H(q) is a constant for a given ground-truth distribution, so it can be omitted from Formula (3) to obtain Formula (2).From the perspective of KL divergence, the essence of classification is to learn a prediction distribution p(c|x i ) that is close to the ground-truth distribution q(c|x i ), which minimizes the KL divergence between the two distributions.
In the case of label noise, q(c|x i ) as the ground-truth distribution, it does not represent a real class probability distribution.On the contrary, p(c|x i ) to a certain extent reflects the true distribution.Therefore, in addition to q(c|x i ) as a ground truth, we also need to consider the KL divergence in the other direction, namely KL(p||q) .Thus, the symmetric KL divergence is written as According to the relationship between KL divergence and CE, the SCE and its corresponding loss can be written as follows: (5) For the two networks A and B, the supervision loss l S constructed with SCE loss is defined as Generally speaking, the two networks can filter out the errors caused by noisy labels due to their different learning abilities, enabling the model to iterate forward stably.As can be seen from Formula (7), the supervision loss l S comes from the loss combinations under two networks, and each individual network uses SCE loss with good noise resistance, which enables l S to optimize the model in the right direction.
In addition to the supervision loss l S on the two networks allowing the model to be more stable, the relative loss l R between the two networks is also useful in identifying noisy labels.According to the principle of consistency maximization, different models will make an agreement on the correct labels for most samples, while it is unlikely to agree on the wrong labels.Suppose the predictive distributions of sample x i on two networks A and B are p A and p B , respectively; we use R-Drop [71] to regularize the model predictions by minimizing the bidirectional KL divergence between these two predictive distributions for the sample x i .The R-Drop-based relative loss is defined as where It can be seen from Formula (8) that the relative loss between the two networks is paired, and the relative loss is only related to the predictive distributions.The relative loss reflects the degree of consistency discrimination of two networks for the same sample.
Through the above analysis, the supervision loss and relative loss are obtained by Formulas ( 7) and ( 8), respectively.To this end, the joint loss induced by Formula (1) can be written as

Agreement and Disagreement-Based Co-Learning Framework
Different learners utilize their own unique structures to learn decision boundaries and thus enjoy distinct learning abilities.Therefore, they are desired to exhibit distinct abilities to filter label noise when learning data with noisy labels.In this work, we propose an HSI classification method based on co-learning with a dual network, so that two networks can exchange and select samples with small losses, that is, update network A (corresponding to B) with mini-batch data selected from B (corresponding to A).If the selected sample is not completely "clean", the two networks adapt to correct the peer-to-peer training errors.This is similar to cross-validation; since errors from one network are not propagated directly back to itself, it can be expected that the method based on co-learning with a dual network can handle higher noise.
As the number of iterations increases, the two networks reach an agreement, and the co-learning function decays into two self-training networks.To make the learning more robust, we incorporate "disagreement" strategy into the co-learning and put forward a more robust learning paradigm, namely, disagreement-based co-learning (DCL).The training process of DCL includes two update steps: data update and parameter update.First, in the data update stage, the two deep networks predict all samples in mini-batch and retain the data with inconsistent prediction results of the two deep networks, which maintains the divergence of two deep networks trained by the DCL.Then, in the parameter update stage, each deep network chooses data with small loss from the disagreement data to minimize the loss of the deep network and utilize the data with small loss from the peer network to update its own weight parameters.However, the "disagreement" strategy cannot guarantee real supervision information.Therefore, we leverage the "agreement" strategy in co-training to improve the DCL and propose an agreement and disagreementbased co-learning (ADCL) framework for HSI classification.During the parameter update of ADCL, each network selects its own small loss data from the disagreement data and adds the data with the same classification results of the two networks into the peer network for back propagation.
Figure 2 shows detailed procedure of forward propagation and back propagation.For the t-th mini-batch, the two networks A and B predict the mini-batch according to the parameters w A and w B , respectively.The disagreement data D (t) are determined by Formula (10), i.e., the samples with inconsistent predictions from the two networks.
the peer network for back propagation.
Figure 2 shows detailed procedure of forward propagation and back propagation.For the t-th mini-batch, the two networks A and B predict the mini-batch according to the parameters  and  , respectively.The disagreement data D ( ) are determined by Formula (10), i.e., the samples with inconsistent predictions from the two networks.
Net A Mini-batch data M (1)   disagreement data D (1)  consistent data C (1)   w consistent data C (1)   small loss data D A (1) consistent data C (1)   Net A Mini-batch data M (t)   disagreement data D (t)  consistent data C (t) In the end of forward propagation, D ( ) and D ( ) are determined by Formula (11) so that the losses of network A and B are minimized.
where () is used to control how much small loss data should be chosen in every ℎ().Because of the memory effect, the deep network will first match data without noise and then gradually fit data with label noise.Formula (12) relates the noise rate  and the parameter (), which controls the amount of small loss data to be chosen in each epoch.
where  and  represent a constant and the biggest ℎ value, respectively.As can be seen from Formula (12), () is large at the beginning of the training phase, which would maintain more data with small loss.As the ℎ increase, less data with small loss are retained in each mini-batch.The gradual decrease in () alleviates the overfitting on noisy data of deep networks to a great extent.
In back propagation, we use Formula (13) to update the network weight parameters, which can ensure that there are real distributions playing a role in the training under the In the end of forward propagation, D A and D

(t)
B are determined by Formula (11) so that the losses of network A and B are minimized.
where λ(e) is used to control how much small loss data should be chosen in every epoch(e).
Because of the memory effect, the deep network will first match data without noise and then gradually fit data with label noise.Formula (12) relates the noise rate r and the parameter λ(e), which controls the amount of small loss data to be chosen in each epoch.
where E k and E max represent a constant and the biggest epoch value, respectively.As can be seen from Formula (12), λ(e) is large at the beginning of the training phase, which would maintain more data with small loss.As the epoch increase, less data with small loss are retained in each mini-batch.The gradual decrease in λ(e) alleviates the overfitting on noisy data of deep networks to a great extent.
In back propagation, we use Formula (13) to update the network weight parameters, which can ensure that there are real distributions playing a role in the training under the condition of high noise rate.It can be seen that when updating the weight parameters of networks A and B, not only the disagreement data are used, but also the consistent data are added to calculate the loss.The consistent data C (t) in Formula (13) can be obtained by Formula (14).

Formula Analysis
In this subsection, we use formulas to analyze the main procedure of the proposed method.We use the 2D convolutional neural network as the backbone to describe the method.
The symbols A and B represent two convolutional neural networks with initial convolutional kernels w A and w B , l represents the l-th layer, and δ means the gradient.Suppose that the data are divided into m mini-batch.For the i-th mini-batch data D, the training process can be described as follows: Forward Propagation: 1. Assign the input data x to the input neurons a 1 A and a 1 B , a 1 A = x, a 1 B = x.2. For the second layer to the L − 1 layer, perform forward propagation calculations according to the following three cases: 2.1.If the current layer is a convolutional layer, then we have If the current layer is a pooling layer, then we have If the current layer is a fully connected layer, then we have Obtain the small loss data D A and D B through Formula (11).Obtain the consistent data C by Formula (14).
Back Propagation: 1. Compute the gradient of output layer δ L A (D B + C) and δ L B (D A + C). 2. For the L − 1 layer to the second layer, perform backward propagation according to the following three cases: 2.1.If the current layer is a fully connected layer, then we have If the previous layer is a pooling layer, then we have Then, we can update the model parameters: 3.1.If the current layer is a fully connected layer, we have If the current layer is the convolutional layer, we have The above description is the main steps of the proposed method.It can be seen that the dual-network structure uses the information of the peer network to guide each other in the training process, and the network structure is easy to implement.

HSI Data Sets
To demonstrate the effectiveness of our proposed method, we conducted experiments on three publicly available HSI data sets.The detailed descriptions of the three data sets are provided below:

HSI Data Sets
To demonstrate the effectiveness of our proposed method, we conducted experiments on three publicly available HSI data sets.The detailed descriptions of the three data sets are provided below:

HSI Data Sets
To demonstrate the effectiveness of our proposed method, we conducted experiments on three publicly available HSI data sets.The detailed descriptions of the three data sets are provided below:   For each data set, we randomly selected 10% of the samples as the training set, and the remaining 90% of the samples were treated as the testing set.Detailed descriptions of the three HSI data sets are given in Tables 1-3.For each data set, we randomly selected 10% of the samples as the training set, and the remaining 90% of the samples were treated as the testing set.Detailed descriptions of the three HIS data sets are given in Tables 1-3.

Experiment Settings
In our experiments, we used a 2D CNN as the backbone to implement our ADCL.For simplicity, we denoted the 2D CNN-based ADCL as 2D-ADCL.In 2D-ADCL, the Adam optimizer was adopted to dominate the training process.We set the learning rate to 0.001 with 150 epochs.We implemented 2D-ADCL in PyTorch 1.8.1, and a single NVIDIA RTX 3070 GPU with CUDA 11.1 was used to boost the training process.
To generate training sets with different label noise levels, we randomly selected a portion of samples from the training set and uniformly assigned any other class labels to them.We set different noise ratios r to obtain training sets with different levels of label noise.We used several metrics, including overall accuracy (OA), average accuracy (AA), and Kappa coefficient (k), to evaluate the classification performance of the proposed method.Detailed calculations of these metrics are given below.
The overall accuracy can be calculated by Formula (15).
where C represents the number of class, and M i represents the number of correctly classified samples in the i-th class.The calculation of average accuracy is shown in Formula ( 16).
where UA i = M ii / ∑ C j=1 M ij is the ratio of the number of correctly classified samples in the i-th class to the total number of samples in the i-th class, and M ij represents the number of samples of the i-th class that are classified as the j-th class.
The kappa coefficient can be obtained by Formula (17).

Evaluation of the Joint Loss Function
To evaluate the performance of the designed joint loss function, we set the noise ratio r to 0.3 for the experiments, i.e., 30% of training samples were randomly assigned with wrong labels.The proposed joint loss function has a parameter β that is used to balance the supervision loss and relative loss in the joint loss.We conducted experiments with different values of β varying from 0.05 to 0.95 with a step of 0.05.The OA curves of 2D-ADCL on the three HSI data sets were illustrated in Figure 6.As can be seen from Figure 6, a relatively small β can obtain better performance than a large β, which means that the relative loss should obtain more attention.
where UA =  ∑  ⁄ is the ratio of the number of correctly classified samples in the i-th class to the total number of samples in the i-th class, and  represents the number of samples of the i-th class that are classified as the j-th class.
The kappa coefficient can be obtained by Formula (17).

Evaluation of the Joint Loss Function
To evaluate the performance of the designed joint loss function, we set the noise ratio  to 0.3 for the experiments, i.e., 30% of training samples were randomly assigned with wrong labels.The proposed joint loss function has a parameter  that is used to balance the supervision loss and relative loss in the joint loss.We conducted experiments with different values of  varying from 0.05 to 0.95 with a step of 0.05.The OA curves of 2D-ADCL on the three HSI data sets were illustrated in Figure 6.As can be seen from Figure 6, a relatively small  can obtain better performance than a large , which means that the relative loss should obtain more attention.The designed loss is related to CE loss, SCE loss, and R-Drop loss.We compared the proposed joint loss with CE loss, SCE loss and R-Drop loss on three HSI data sets, where  was set to 0.15 for the proposed joint loss.The classification accuracies of 2D-ADCL with different loss functions on the three data sets are shown in Table 4.The results in Table 4 demonstrate that the proposed joint loss obtains the best performance, R-Drop loss achieves suboptimal performance, and the other two loss functions perform less well.The designed loss is related to CE loss, SCE loss, and R-Drop loss.We compared the proposed joint loss with CE loss, SCE loss and R-Drop loss on three HSI data sets, where β was set to 0.15 for the proposed joint loss.The classification accuracies of 2D-ADCL with different loss functions on the three data sets are shown in Table 4.The results in Table 4 demonstrate that the proposed joint loss obtains the best performance, R-Drop loss achieves suboptimal performance, and the other two loss functions perform less well.

Comparison with State-of-the-Art Methods
We compared 2D-ADCL with state-of-the-art methods to demonstrate its performance.The comparison methods include RLPA [15], DPNLD [12], SALP [17], DCRN [20], and S3Net [21].The detailed settings for the above methods were consistent with their corresponding references.It should be noted that the SVM classifier was used as the base classifier for RLPA, DPNLD, and SALP, while 2D-CNN was adopted as the backbone network for S3Net.The noise rate r was set to 0.3.We repeated each algorithm ten times to obtain the average results.Tables 5-7 show the OAs, Aas, kappa coefficients, and class-specific accuracies of the comparison methods on the SV, HOU, and KSC data sets, respectively.The best classification accuracies of different methods are highlighted in bold.
The classification result maps of different methods on the SV, HOU, and KSC data sets are illustrated in Figures 7-9, respectively.Several results can be observed from Tables 5-7 and Figures 7-9.Firstly, 2D-ADCL achieves higher class-specific accuracy than other methods in most cases.Specifically, 2D-ADCL attains 13, 11, and 10 best class-specific accuracies on SV, BOT, and KSC data sets, respectively.Secondly, 2D-ADCL achieves the best OAs, AAs, and kappa coefficients on all data sets.The average OA of 2D-ADCL is more than two percentage points higher than the second-place and more than 10 percentage points higher than the last place.
Thirdly, the accuracies of deep learning-based methods (DCRN, S3Net, and 2D-ADCL) are significantly higher than those of traditional methods (RLPA, DPNLD, and SALP).Fourthly, the classification maps of different methods on the three data sets demonstrate that 2D-ADCL achieves satisfactory classification results.We compared 2D-ADCL with state-of-the-art methods to demonstrate its performance.The comparison methods include RLPA [15], DPNLD [12], SALP [17], DCRN [20], and S3Net [21].The detailed settings for the above methods were consistent with their corresponding references.It should be noted that the SVM classifier was used as the base classifier for RLPA, DPNLD, and SALP, while 2D-CNN was adopted as the backbone network for S3Net.The noise rate  was set to 0.3.We repeated each algorithm ten times to obtain the average results.Tables 5-7 show the OAs, Aas, kappa coefficients, and classspecific accuracies of the comparison methods on the SV, HOU, and KSC data sets, respectively.The best classification accuracies of different methods are highlighted in bold.The classification result maps of different methods on the SV, HOU, and KSC data sets are illustrated in Figures 7-9, respectively.

Comparison with State-of-the-Art Methods
We compared 2D-ADCL with state-of-the-art methods to demonstrate its performance.The comparison methods include RLPA [15], DPNLD [12], SALP [17], DCRN [20], and S3Net [21].The detailed settings for the above methods were consistent with their corresponding references.It should be noted that the SVM classifier was used as the base classifier for RLPA, DPNLD, and SALP, while 2D-CNN was adopted as the backbone network for S3Net.The noise rate  was set to 0.3.We repeated each algorithm ten times to obtain the average results.Tables 5-7 show the OAs, Aas, kappa coefficients, and classspecific accuracies of the comparison methods on the SV, HOU, and KSC data sets, respectively.The best classification accuracies of different methods are highlighted in bold.The classification result maps of different methods on the SV, HOU, and KSC data sets are illustrated in Figures 7-9, respectively.

Performance Evaluation under Different Noise Rates
In order to study the effect of noise rate on classification performance, we set different noise rates for experiments, in which the noise rate ranged from 0.1 to 0.7 with a step of 0.05.First, when the noise rate was equal to 0.1, the training set contained only 10% of noisy samples.Then, the number of noisy samples in the training set increased gradually with the increase in noise rate.Finally, when the noise rate was equal to 0.7, the number of noisy samples in the training set reached 70%.We ran all methods ten times to obtain average results.The OA curves of different methods on SV, HOU, and KSC data sets are plotted in Figure 10. the second-place and more than 10 percentage points higher than the last place.Thirdly, the accuracies of deep learning-based methods (DCRN, S3Net, and 2D-ADCL) are significantly higher than those of traditional methods (RLPA, DPNLD, and SALP).Fourthly, the classification maps of different methods on the three data sets demonstrate that 2D-ADCL achieves satisfactory classification results.

Performance Evaluation under Different Noise Rates
In order to study the effect of noise rate on classification performance, we set different noise rates for experiments, in which the noise rate ranged from 0.1 to 0.7 with a step of 0.05.First, when the noise rate was equal to 0.1, the training set contained only 10% of noisy samples.Then, the number of noisy samples in the training set increased gradually with the increase in noise rate.Finally, when the noise rate was equal to 0.7, the number of noisy samples in the training set reached 70%.We ran all methods ten times to obtain average results.The OA curves of different methods on SV, HOU, and KSC data sets are plotted in Figure 10.As seen from Figure 10, for each data set, the classification result of 2D-ADCL is consistently better than that of the other methods in terms of OA.The average OA of all comparison methods decreases with the increase in noise rate.It can be seen that the deep learning-based methods (DCRN, S3Net, and 2D-ADCL) show a lower attenuation speed on OA than the traditional methods (RLPA, DPNLD, and SALP).When the noise rate increases from 0.1 to 0.7, the OA attenuation values of 2D-ADCL on SV, HOU, and KSC data sets are approximately within 5%, 8% and 4%, respectively.This indicates that ADCL is robust to high noise rates.

Computational Cost
Computational cost is also an important metric to evaluate classification algorithms.We set the noise rate to 0.5 to compare the running times of different methods, including training time and test time.Table 8 displays the running times of different methods on SV, HOU, and KSC data sets.The results from Table 8 show that the running times for comparison methods range from tens of seconds to hundreds of seconds, and none of them require too much running time.Another finding is that the running times of deep learning-based methods are longer than those of traditional methods, since deep learning methods need more time for training models.For the three deep learning methods (DCRN, S3Net, 2D-ADCL), the running time of 2D-ADCL is slightly longer than that of the other two methods, which is still within the acceptable range.

Further Analysis
In order to investigate the role of the "agreement" strategy in ADCL, we took ADCL and DCL for comparison, where DCL only adopted the data with a small loss from the peer network to update the weight parameters.The other settings for DCL are the same as ADCL.Table 9 shows the classification results of the two methods when the noise rate was 0.5.As seen from the results in Table 9, the average OAs, AAs, and kappa coefficients of 2D-ADCL on the three data sets are higher than those of 2D-DCL, indicating that the "agreement" strategy plays an important role in learning from label noise.

Discussion
Previous experiments revealed some important findings that require further discussion.As demonstrated in Section 4.3, compared with related loss functions, the proposed joint loss function has better performance because it makes full use of both networks' own supervision information and mutual information between the two networks.Additionally, the experimental results illustrate that the relative loss in the joint loss plays a more important role, because the supervision information from the peer network is more effective than its own supervision information in the presence of label noise.
As shown in Sections 4.4 and 4.5, compared with state-of-the-art methods, 2D-ADCL obtains better classification performance in terms of OA, AA, and kappa coefficient.In addition, 2D-ADCL has better robustness to high noise rate.This can be attributed to several factors, including the unified framework with a dual network that leverages the mutual guidance abilities of the two networks, the "disagreement" and "agreement" strategies that enhance the model's discrimination ability, and the designed loss function that improves the model's robustness to label noise.
The experimental results from Section 4.6 indicate that the running time of 2D-ADCL is acceptable.The main reason for this result is that the proposed framework is simple, and it does not have complicated network structures.The results in Section 4.7 suggest that the "agreement" strategy is crucial in label noise learning, particularly in the presence of a high noise ratio, as the agreement data playa a significant role.

Conclusions
In this paper, we proposed an ADCL framework for HSI classification with noisy labels.The proposed ADCL adopted a unified framework with a dual-network structure for label noise learning.The experimental results demonstrated the effectiveness of the proposed method.Previous results and analysis can be summarized in the following four conclusions:

•
The proposed framework, based on a dual-network structure, proved to be robust to label noise, and it can achieve good classification performance even in the case of a high noise rate.

•
The designed joint loss function, composed of the supervision loss and relative loss, demonstrated good robustness to label noise.This is because when there is label noise, the self-supervised information of each network may not be completely accurate, but the mutual supervised information from both networks will help to correct and improve the accuracy of the predictions.

•
In terms of time efficiency, the proposed method is acceptable because we do not use a complex network except for a dual-network structure.

•
The "agreement" strategy plays an important role in improving the classification accuracy, as it helps mitigate the problem of difficult convergence of neural networks when there is a high ratio of label noise.
The limitation of this work is that ADCL requires estimating the noise rate to determine the small loss data, which may not be feasible in some scenarios.Future research could explore small loss data selection methods that are independent of the noise rate.Additionally, this work only used a 2D-CNN as the backbone for ADCL, but other advanced neural networks could be adopted to further improve the performance of the proposed framework.

Figure 1 .Figure 1 .
Figure 1.The framework of ADCL.The data set with noisy labels is fed into two convolutional neural networks A and B firstly.Then, each network uses the small loss data from the peer networkFigure 1.The framework of ADCL.The data set with noisy labels is fed into two convolutional neural networks A and B firstly.Then, each network uses the small loss data from the peer network and consistent data from two networks to update its own parameters.At last, the trained networks A and B are fused to classify the data.

Figure 2 .
Figure 2. Detailed procedure of forward propagation and back propagation.

Figure 2 .
Figure 2. Detailed procedure of forward propagation and back propagation.

( 1 )
Salinas Valley (SV) [72]: The SV data set was collected by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the agricultural area described as Salinas Valley in California, USA, in 1998.The data set contains 512 × 217 pixels characterized by 224 spectral bands.A total of 204 bands were used for experiments after removing 20 redundant ones.The spatial resolution of SV is 3.7 m per pixel, and the land cover contains 16 classes.The three-band pseudocolor image of the SV and its corresponding reference map are illustrated in Figure 3.

( 1 )Figure 3 .
Figure 3. Pseudocolor image and reference map of the SV.(a) Three-band pseudocolor image.The image is generated by using bands 180, 27 and 17 as the R, G, and B channels, respectively.(b) Reference map.The number represents the class number, where 0 represents the background.

( 2 )Figure 3 .
Figure 3. Pseudocolor image and reference map of the SV.(a) Three-band pseudocolor image.The image is generated by using bands 180, 27 and 17 as the R, G, and B channels, respectively.(b) Reference map.The number represents the class number, where 0 represents the background.

( 2 )
Houston (HOU) [73]: The HOU data set was obtained by the ITRES CASI-1500 sensor and provided by the 2013 IEEE GRSS Data Fusion Competition.The data set contains 349 × 1905 pixels characterized by 144 spectral bands ranging from 364 to 1046 nm.The spatial resolution of HOU is 2.5 m per pixel, and the land cover includes 15 classes.The three-band pseudocolor image of the HOU and its corresponding reference map are shown in Figure 4.

( 1 )Figure 3 .
Figure 3. Pseudocolor image and reference map of the SV.(a) Three-band pseudocolor image.The image is generated by using bands 180, 27 and 17 as the R, G, and B channels, respectively.(b) Reference map.The number represents the class number, where 0 represents the background.

( 2 )
Houston (HOU) [73]: The HOU data set was obtained by the ITRES CASI-1500 sensor and provided by the 2013 IEEE GRSS Data Fusion Competition.The data set contains 349 × 1905 pixels characterized by 144 spectral bands ranging from 364 to 1046 nm.The spatial resolution of HOU is 2.5 m per pixel, and the land cover includes 15 classes.The three-band pseudocolor image of the HOU and its corresponding reference map are shown in Figure 4.

Figure 4 . 5 .Figure 5 .
Figure 4. Pseudocolor image and reference map of the HOU.(a) Three-band pseudocolor image.The image is generated by using bands 70, 50 and 20 as the R, G, and B channels, respectively.(b) Reference map.The number represents the class number, where 0 represents the background.(3) Kennedy Space Center (KSC) [72]: The KSC data set was acquired by the AVIRIS sensor over the KSC, Florida, on 23 March 1996.The data set contains 512 × 614 pixels characterized by 224 spectral bands.A total of 176 bands were retained for our experiment after removing water absorption bands and low signal-to-noise ratio bands.The spatial resolution of KSC is 3.7 m per pixel, and the land cover includes

Figure 5 .
Figure 5. Pseudocolor image and reference map of the KSC.(a) Three-band pseudocolor image.The image is generated by using bands 28, 19 and 10 as the R, G, and B channels, respectively.(b) Reference map.The number represents the class number, where 0 represents the background.

Figure 6 .
Figure 6.Classification accuracies (%) of 2D-ADCL on three data sets under different values of , where  ranges from 0.05 to 0.95 with step size of 0.05.

Figure 6 .
Figure 6.Classification accuracies (%) of 2D-ADCL on three data sets under different values of β, where β ranges from 0.05 to 0.95 with step size of 0.05.

Figure 10 .Figure 10 .
Figure 10.The influence of noise rate (r) on the classification accuracy.The horizontal axis represents the noise rate ranging from 0.1 to 0.7, and the vertical axis represents the OA changes of RLPA, DPNLD, SALP, DCRN, S3Net, and 2D-ADCL.(a) SV.(b) HOU.(c) KSC.As seen from Figure10, for each data set, the classification result of 2D-ADCL is consistently better than that of the other methods in terms of OA.The average OA of all comparison methods decreases with the increase in noise rate.It can be seen that the deep learning-based methods (DCRN, S3Net, and 2D-ADCL) show a lower attenuation speed on OA than the traditional methods (RLPA, DPNLD, and SALP).When the noise rate in-

Table 1 .
The class information and data partition of SV data set.

Table 1 .
The class information and data partition of SV data set.

Table 2 .
The class information and data partition of HOU data set.

Table 3 .
The class information and data partition of KSC data set.

Table 4 .
Classification accuracy of different loss functions on three data sets.The comparison loss functions include CE, SCE, and R-Drop.The Oas, Aas, and kappas of different methods are reported.

Table 4 .
Classification accuracy of different loss functions on three data sets.The comparison loss functions include CE, SCE, and R-Drop.The OAs, AAs, and kappas of different methods are reported.

Table 4 .
Classification accuracy of different loss functions on three data sets.The comparison loss functions include CE, SCE, and R-Drop.The OAs, AAs, and kappas of different methods are reported.

Table 8 .
Running times (s) on three data sets.Running time for RLPA, DPNLD, SALP, DCRN, S3Net, and 2D-ADCL, where the running time consists of training time and testing time.

Table 9 .
Classification results in terms of OA, AA, and kappa on three data sets.Classification accuracy obtained by 2D-DCL and 2D-ADCL with 50% noisy labels in the training set.