Hyperspectral Images Weakly Supervised Classiﬁcation with Noisy Labels

: The deep network model relies on sufﬁcient training samples to achieve superior processing performance, which limits its application in hyperspectral image (HSI) classiﬁcation. In order to perform HSI classiﬁcation with noisy labels, a robust weakly supervised feature learning (WSFL) architecture combined with multi-model attention is proposed. Speciﬁcally, the input noisy labeled data are ﬁrst subjected to multiple groups of residual spectral attention models and multi-granularity residual spatial attention models, enabling WSFL to reﬁne and optimize the extracted spectral and spatial features, with a focus on extracting clean samples information and reducing the model’s dependence on labels. Finally, the fused and optimized spectral-spatial features are mapped to the multilayer perceptron (MLP) classiﬁer to increase the constraint of the model on the noisy samples. The experimental results on public datasets, including Pavia Center, WHU-Hi LongKou, and HangZhou, show that WSFL is better at classifying noise labels than excellent models such as spectral-spatial residual network (SSRN) and dual channel residual network (DCRN). On Hangzhou dataset, the classiﬁcation accuracy of WSFL is superior to DCRN by 6.02% and SSRN by 7.85%, respectively.


Introduction
Hyperspectral remote sensing uses a large number of narrow electromagnetic wave channels to obtain spatial, radiation, and spectral triple information of ground objects, which obtains information about ground objects through the band range of visible light and infrared light in the electromagnetic spectrum [1,2].Due to the characteristic of "combination of image and spectrum" in hyperspectral images, they contain a much higher degree of ground information.By fully utilizing this feature, accurate classification of ground objects can be achieved [3].Therefore, hyperspectral remote sensing has been widely applied in urban planning [4], environmental monitoring [5], and precision agriculture [6,7].
Early hyperspectral image classification mostly used supervised classification methods, whose performance relied on high-quality labels [8].Pal et al. mapped hyperspectral data to a high-dimensional space, and found an optimal segmentation hyperplane in the space to maximize the distance between different types and achieve the best classification effect [9].Cui et al. considered the relationship between hyperspectral data classes, effectively combining a sparse representation classifier and K-nearest neighbor to increase classification accuracy [10].
Due to intra-class complexity and the scarcity of labeled samples, it is challenging to achieve high-precision ground object classification only through spectral features [11].Gu et al. proposed a multi-kernel learning (MKL) architecture to learn spectral and spatial information, combined with a support vector machine (SVM) [12].Liu et al. proposed a multi-morphic superpixel method to extract spectral and spatial features and complete the sets of residual spectral attention models were designed in the spectral dimension.The spectral features in the later stage were differentiated in the space of multiple sets of spectral features to avoid excessive concentration of single layer spectral features and memory of noisy samples.In addition, in order to obtain more high-quality features and reduce the dependence of the model on samples, a multi-granularity residual spatial attention model was designed in the spatial dimension.In the multi-granularity space, the spatial features were further refined to obtain finer spatial features.Finally, in order to eliminate the adverse effects of local connectivity in the model and focus on the spatial structure information of more data, a MLP model was introduced, with a focus on learning spectral-spatial features to enhance the overall model's feature capture ability.The main contributions are summarized as follows: 1.
This paper proposes a weakly supervised feature learning architecture combined with multi-model attention, which can build a more robust network that can classify noisy samples more stably and accurately; 2.
In order to enhance the constraint of spectral dimension on noisy samples, multiple sets of residual spectral attention models were designed to enhance the ability to learn clean samples and weaken the model's fitting ability for noisy samples; 3.
In order to improve the utilization of clean samples in weakly supervised models, a multi-granularity residual spatial attention model was designed to gradually extract clean sample information from spatial dimensions and obtain more significant features; 4.
We introduced a MLP model to further extract spectral-spatial features, eliminate the adverse effects of local connectivity of the model, pay more attention to the spatial structure information of the data, and improve the overall model's anti-interference ability against noise.

Methodology
In this section, we will introduce the main architecture of WSFL in detail as shown in Figure 1, including multi-group residual spectral attention model (MGRSAM), multigranularity residual spatial attention model (MRSAM), MLP model, noise loss function and Lion optimizer.In addition, samples labeled incorrectly are called noisy samples, and samples labeled correctly are called clean samples.First, the 3D data cube is input to MGRSAM and MRSAM to extract spectral and spatial features.In MGRSAM, the first two convolutional layers are used to perform coarse feature extraction on the spectral dimension.Subsequently, the extracted features are mapped to multiple sets of spectral feature spaces through the Group Convolution (GConv) layer to reduce the model's ability to fit noise samples.In addition, the features of the first layer are mapped into this space by means of skip connections, which solves the problem of gradient descent due to the increase in network depth.Secondly, the output features are mapped to the spectral feature attention space, focusing on extracting clean samples' features and suppressing the influence of noisy samples.

Spectral and Spatial Feature Extraction
In order to improve the robustness and generalization ability of image classification with noisy labels, this paper addresses two aspects separately.On the one hand, in response to the fact of neural networks easily remembering clean samples in the early stage and gradually remembering noisy samples in the later stage, this paper designs multiple sets of residual spectral attention models in the spectral dimension of hyperspectral data.In the early training of the spectral dimension, rough extraction of spectral dimension features is performed to obtain higher quality feature maps and enhance noise resistance.Secondly, in the later training, in order to avoid the model overfitting the features of noisy samples, the input features are mapped to multiple sets of spectral feature spaces, and the later spectral features are processed in a grouping manner.Each set of spectral features is finely extracted to avoid a single layer of spectral features being too concentrated, thus memorizing the noisy samples.Secondly, while reducing the fitting of noise samples, the ability to fit clean samples is strengthened, and more clean spectral features are learned through the spectral attention model.

Spectral and Spatial Feature Extraction
In order to improve the robustness and generalization ability of image classification with noisy labels, this paper addresses two aspects separately.On the one hand, in response to the fact of neural networks easily remembering clean samples in the early stage and gradually remembering noisy samples in the later stage, this paper designs multiple sets of residual spectral attention models in the spectral dimension of hyperspectral data.In the early training of the spectral dimension, rough extraction of spectral dimension features is performed to obtain higher quality feature maps and enhance noise resistance.Secondly, in the later training, in order to avoid the model overfitting the features of noisy samples, the input features are mapped to multiple sets of spectral feature spaces, and the later spectral features are processed in a grouping manner.Each set of spectral features is finely extracted to avoid a single layer of spectral features being too concentrated, thus memorizing the noisy samples.Secondly, while reducing the fitting of noise samples, the ability to fit clean samples is strengthened, and more clean spectral features are learned through the spectral attention model.
On the other hand, a multi-granularity residual spatial attention model is designed in the spatial dimension of HSI to solve the problem that supervised learning is too dependent on labeled samples and easy to overfit noisy samples.In early training of spatial dimensions, learning more discriminative spatial features through the spatial attention space weakens the spatial feature weights of noisy samples.Secondly, we map the input features to a multi-granularity space, extract important features in the spatial domain, obtain the similarity features of a large number of positive and negative sample pairs, mine the feature representation information of the dataset, obtain predictive tags, and reduce the dependence of supervised learning on labels.These two parts will be introduced in detail as follows.On the other hand, a multi-granularity residual spatial attention model is designed in the spatial dimension of HSI to solve the problem that supervised learning is too dependent on labeled samples and easy to overfit noisy samples.In early training of spatial dimensions, learning more discriminative spatial features through the spatial attention space weakens the spatial feature weights of noisy samples.Secondly, we map the input features to a multi-granularity space, extract important features in the spatial domain, obtain the similarity features of a large number of positive and negative sample pairs, mine the feature representation information of the dataset, obtain predictive tags, and reduce the dependence of supervised learning on labels.These two parts will be introduced in detail as follows.

Spectral Feature Extraction
This paper carefully designed the network architecture in the spectral dimension of hyperspectral data.In the early training of spectral dimension, the focus is on obtaining higher quality feature maps to improve the anti-noise ability of spectral dimension in the early stage.Secondly, in order to prevent the model from overfitting noisy samples in the later training, different channels are grouped to avoid excessive concentration of noise features in one layer, reducing the ability of spectral dimension to overfitting noisy samples in the later stage.However, this approach also reduces the ability to fit clean samples.To achieve this, spectral feature attention space is used to focus on more discriminative clean sample features among numerous features, while suppressing unnecessary noise information and enhancing the spectral dimension's ability to fit clean samples in the later stage.Multiple residual spectral attention models are composed of convolutional layers, spectral feature attention spaces, multiple spectral feature spaces, and residual blocks, as shown in Figure 2.
samples in the later stage.However, this approach also reduces the ability to fit clean samples.To achieve this, spectral feature attention space is used to focus on more discriminative clean sample features among numerous features, while suppressing unnecessary noise information and enhancing the spectral dimension's ability to fit clean samples in the later stage.Multiple residual spectral attention models are composed of convolutional layers, spectral feature attention spaces, multiple spectral feature spaces, and residual blocks, as shown in Figure 2.  In the early training process of spectral dimension, we use a 1 × 1 × 7 convolutional layer for coarse feature extraction.The step of the first convolutional kernel is (1, 1, 2) to remove information redundancy in adjacent bands, allowing the model to focus on more important spectral features and maintain the original spatial correlation, improving the early noise resistance of the spectral dimension model.
In the later stage of spectral dimension training, when the concatenated features of the spectral dimension are mapped to multiple sets of spectral feature spaces, different channels are grouped to prevent noise features from being too concentrated in one layer, and to avoid fitting noisy samples in that layer and affecting the overall training of the spectral dimension in the later stage.Multiple spectral feature spaces can reduce the ability of later models to fit noisy samples.In addition, feature extraction for each group of spectral features can also better explore spectral information and enhance the noise resistance of multiple spectral feature spaces.
As shown in Figure 3, there are multiple groups of spectral feature spaces, where Group = 3 and the size of each group of feature maps is , which corresponds to the height, width, and number of channels.The size of each group of convolu- In the early training process of spectral dimension, we use a 1 × 1 × 7 convolutional layer for coarse feature extraction.The step of the first convolutional kernel is (1, 1, 2) to remove information redundancy in adjacent bands, allowing the model to focus on more important spectral features and maintain the original spatial correlation, improving the early noise resistance of the spectral dimension model.
In the later stage of spectral dimension training, when the concatenated features of the spectral dimension are mapped to multiple sets of spectral feature spaces, different channels are grouped to prevent noise features from being too concentrated in one layer, and to avoid fitting noisy samples in that layer and affecting the overall training of the spectral dimension in the later stage.Multiple spectral feature spaces can reduce the ability of later models to fit noisy samples.In addition, feature extraction for each group of spectral features can also better explore spectral information and enhance the noise resistance of multiple spectral feature spaces.
As shown in Figure 3, there are multiple groups of spectral feature spaces, where Group = 3 and the size of each group of feature maps is H × W × C 1 /3, which corresponds to the height, width, and number of channels.The size of each group of convolution kernels is h 1 × w 1 × C 1 /3, which corresponds to the height, width, and number of channels of the convolution kernels.Convolution is performed in the corresponding group, and the output features are obtained by stacking the output features.
Remote Sens. 2023, 14, x FOR PEER REVIEW 6 of 24 tion kernels is 1 , which corresponds to the height, width, and number of channels of the convolution kernels.Convolution is performed in the corresponding group, and the output features are obtained by stacking the output features.Although the fitting ability of noisy samples can be effectively reduced in multiple sets of feature spectral spaces, it also reduces the fitting ability of clean samples.In order to enhance the feature capture ability of clean samples in the later stage of spectral dimension, the features c Z output from multiple sets of spectral feature spaces are mapped to the spectral attention space.In this space, different features have different Although the fitting ability of noisy samples can be effectively reduced in multiple sets of feature spectral spaces, it also reduces the fitting ability of clean samples.In order to enhance the feature capture ability of clean samples in the later stage of spectral dimension, the features Z c output from multiple sets of spectral feature spaces are mapped to the spectral attention space.In this space, different features have different weights, and the allocation of weights is based on similarity.Features with high similarity are considered clean sample features, while features with low similarity are considered noisy sample features.
Firstly, we transform Z c from top to bottom to obtain K c , Q c , and V c representing the key vector, query vector, and numerical vector, respectively.The subscript c represents the channel attention module, P × P is the spatial dimension with a channel count of 48.We calculate and compare the similarity between K c and Q c , as shown in Formula (1).
Then, the Softmax classifier is used to obtain the pixel weight matrix W c , where W c (i, j) represents the similarity of pixel i to pixel j.
The spectral attention feature is obtained by multiplying V c and W c T , and the spectral attention feature obtained by weighting V c with W c T is a discriminative feature.
Finally, we transform the channel attention feature into A * c , so that its dimension is the same as the input feature, and add the spectral attention feature to the later training, as shown in Formula (4).

Spatial Feature Extraction
For spatial feature extraction, residual networks can effectively extract spatial features and prevent overfitting.However, the presence of noise samples can easily lead to a portion of the noise sample features being transmitted to the lower layer after each jump connection.In order to obtain more significant semantic features, noise information around the target pixel is suppressed at different spatial positions, highlighting clean sample features, improving the model's efficiency in feature utilization, and reducing the model's dependence on annotated data.
This paper designs a network architecture in the hyperspectral spatial dimension.In the early training of the spatial dimension, the attention space of spatial features is used to generate a weight value for each pixel in the input patch.This is done to suppress the negative impact of noisy samples on feature extraction and thereby strengthen spatial texture features.The size of the patch is 7 × 7. Compared to smaller domains, larger domains mean that the input contains more spatial information, which will also increase the number of noisy samples.Hence, relying solely on spatial feature attention space cannot completely reduce the interference of noise samples in spatial dimensions.Therefore, in the later training of the spatial dimension, each layer of feature maps is separated from one another through multi-granularity space, and each feature map is further subdivided into 3 × 3 regions, where multi-granularity refers to the processing and extraction of features in feature maps at different levels.In a multi-granularity space, emphasis is placed on the ground feature information within the granularity to obtain more discriminative spatial features and further enhance the feature capture ability for clean samples.The multi-granularity residual spatial attention model consists of convolutional layers, spatial Remote Sens. 2023, 15, 4994 7 of 24 feature attention spaces, multi-granularity spaces, and residual blocks, with an architecture shown in Figure 4.
Therefore, in the later training of the spatial dimension, each layer of feature maps is separated from one another through multi-granularity space, and each feature map is further subdivided into 3 × 3 regions, where multi-granularity refers to the processing and extraction of features in feature maps at different levels.In a multi-granularity space, emphasis is placed on the ground feature information within the granularity to obtain more discriminative spatial features and further enhance the feature capture ability for clean samples.The multi-granularity residual spatial attention model consists of convolutional layers, spatial feature attention spaces, multi-granularity spaces, and residual blocks, with an architecture shown in Figure 4.In early training of the spatial dimensions, 3D convolution is first used to map hyperspectral data to a multi-granular residual spatial attention model with only one band, and the convolution kernel size is 1 × 1 × 102.Then the coarse feature extraction is performed using the first layer of convolution, and the domain patch size of 7 × 7 is mapped to the attention space of spatial features.In order to better process spatial information, both the second and third convolutional layers use a 2D convolution size of 3 × 3 extracts In early training of the spatial dimensions, 3D convolution is first used to map hyperspectral data to a multi-granular residual spatial attention model with only one band, and the convolution kernel size is 1 × 1 × 102.Then the coarse feature extraction is performed using the first layer of convolution, and the domain patch size of 7 × 7 is mapped to the attention space of spatial features.In order to better process spatial information, both the second and third convolutional layers use a 2D convolution size of 3 × 3 extracts of spatial features, and after each convolution, they are cascaded with a BatchNorm (BN) layer and a ReLU layer.
Firstly, a convolution operation is performed on input Z s to obtain three feature maps, namely K s , Q s , and V s from top to bottom, where the subscript s represents the weight of the convolutional layer for spatial attention modules W K , W Q , and W V , respectively.B K , B Q , and B V represent bias terms.The three feature maps are obtained as shown in Formula (5).
We transform three feature maps to obtain K T s , Q s , and V s , and multiply K T s by Q s to calculate the correlation of pixels in the spatial feature map, as shown in Formula (6).
Then, Softmax is used to obtain the pixel weight matrix W s , where W s (i, j) represents the impact of pixel i on pixel j.Similarly, a larger weight value indicates a stronger correlation between spatial pixels, as shown in Equation (7).
Remote Sens. 2023, 15, 4994 8 of 24 Subsequently, by multiplying V s and W s T to obtain spatial attention features, the spatial features with significant weight are more helpful in improving classification results, as shown in Formula (8).
A s = W T s V s (8) Finally, we transform the spatial attention feature into A * S , and add the spatial attention feature to the input until convergence, as shown in Formula (9).
During the space dimension post training, we use residual blocks to concatenate the output of the first layer network with the output Z * s of the spatial attention feature space in a spatial dimension of 7 × 7, which is divided into 3 × 3 regions in a multi-granularity space.The region forms different particles, with varying levels of information contained within each particle.The purpose of multi-granularity is to reduce the concentration of noisy sample features in a certain part of the feature map, thereby affecting the feature information of adjacent clean samples.The multi-granularity space is shown in Figure 5.After multigranularity, the features exist in the form of particles, weakening the interference of other particles and re-extracting more significant feature information from each particle.The particle size is the size of the convolution kernel in deep convolution.Assuming that the feature map of the multi-granularity residual spectral attention model is I ∈ R w×h×c , deep convolution divides the feature map into several semantic markers of different granularity.w is the width of the feature map and h is the height of the feature map; c is the number of bands.T i can be obtained by Formula (10).
i represents the i th granularity branch, and DWConv2D represents a two-dimensional deep convolution operation.By setting the size of deep convolution, the granularity can be adjusted.
Remote Sens. 2023, 14, x FOR PEER REVIEW 9 of 24 the height of the feature map; c is the number of bands.Ti can be obtained by Formula (10).
i represents the i th granularity branch, and DWConv2D represents a twodimensional deep convolution operation.By setting the size of deep convolution, the granularity can be adjusted.

Spectral-Spatial Feature Extraction
At present, most algorithms for noise label classification only use a single network model for classification, which has lower binding force on noise labels and higher misjudgment rate compared to multiple network models.Based on the above reasons, after redesigning the architecture of spectral and spatial dimensions, this article also uses the MLP model to further obtain spectral-spatial features for the fused spectral-spatial features.The MLP model is cascaded with multiple sets of residual spectral attention models and multi-granularity residual spatial attention models to form a weakly supervised feature learning architecture, and noise labels are constrained by different models in different dimensions.
As a neural network with fewer constraints, MLP can eliminate the adverse effects of local connectivity, thus enabling the model to have strong discrimination ability for small differences in the local field of view, effectively extract deep features, achieve accurate acquisition of spectral-spatial structure information, and further reduce the interference of noisy samples in the model [31].Therefore, this paper introduces the MLP neural network as the final model for processing spectral-spatial dimensions.
In this section, first, the concat function is used to simultaneously integrate spectral and spatial information of the data, integrating dimension 128 × 7 × 7 spectral information and dimension 24 × 7 × 7 spatial information, resulting in dimension 152 × 7 × 7 feature maps that combine spectral and spatial information, followed by the use of the

Spectral-Spatial Feature Extraction
At present, most algorithms for noise label classification only use a single network model for classification, which has lower binding force on noise labels and higher misjudgment rate compared to multiple network models.Based on the above reasons, after redesigning the architecture of spectral and spatial dimensions, this article also uses the MLP model to further obtain spectral-spatial features for the fused spectral-spatial features.The MLP model is cascaded with multiple sets of residual spectral attention models and multi-granularity residual spatial attention models to form a weakly supervised feature learning architecture, and noise labels are constrained by different models in different dimensions.
As a neural network with fewer constraints, MLP can eliminate the adverse effects of local connectivity, thus enabling the model to have strong discrimination ability for small differences in the local field of view, effectively extract deep features, achieve accurate acquisition of spectral-spatial structure information, and further reduce the interference of noisy samples in the model [31].Therefore, this paper introduces the MLP neural network as the final model for processing spectral-spatial dimensions.In this section, first, the concat function is used to simultaneously integrate spectral and spatial information of the data, integrating dimension 128 × 7 × 7 spectral information and dimension 24 × 7 × 7 spatial information, resulting in dimension 152 × 7 × 7 feature maps that combine spectral and spatial information, followed by the use of the average pooling layer size of 7 × 7 to reduce the size of feature maps while maintaining spatial information, thereby reducing the number of parameters that need to be optimized in the network, resulting in a vector size of 152 × 1.Finally, the vector is input to an MLP composed of the full connection layer, the GELU activation function, and the Dropout layer, and propagates forward to complete the final classification.Through the multi-layer perceptron classifier, the spectral-spatial dimension feature information can be further obtained, and the feature information of the noise label can be constrained to the greatest extent.
Next, MLP will be introduced as shown in Figure 6, which consists of three parts: full connection layer (FC), the GELU activation function, and the Dropout layer, in which the layers are fully connected.By introducing the GELU activation function to process data, when the input is negative, the input will be mapped to a non-zero value, so as to avoid the problem that some neurons of the ReLU activation function are invalid, and retain the characteristic information of the model in the negative signal, increasing the learning ability of MLP models for small differences within local features.In addition, by randomly discarding the values of 0.1% of neurons through the Dropout layer, overfitting of the model is avoided.

Lion Optimization
Hyperspectral data contain noise samples, which makes each batch of training have many confounding data points.If a small training batch is used, when the number of noise samples extracted from the batch is greater than the number of clean samples, the model will not be able to fully learn the features of clean samples.Therefore, we increase the number of batch sizes and the number of learnable samples per batch during each training phase.However, the currently popular AdamW optimizers often apply small batch sizes.
Compared to AdamW and various adaptive optimizers that require both first-order and second-order moments to be saved simultaneously, Lion only requires momentum, reducing the additional memory footprint by half, which will be beneficial for training large models and batch sizes [32].Therefore, this paper introduces the Lion optimizer to simplify the process of parameter updating.Taking the t th iteration of gradient descent as an example, the Lion optimizer process is shown in Formula (15).

Lion Optimization
Hyperspectral data contain noise samples, which makes each batch of training have many confounding data points.If a small training batch is used, when the number of noise samples extracted from the batch is greater than the number of clean samples, the model will not be able to fully learn the features of clean samples.Therefore, we increase the number of batch sizes and the number of learnable samples per batch during each training phase.However, the currently popular AdamW optimizers often apply small batch sizes.
Compared to AdamW and various adaptive optimizers that require both first-order and second-order moments to be saved simultaneously, Lion only requires momentum, reducing the additional memory footprint by half, which will be beneficial for training large models and batch sizes [32].Therefore, this paper introduces the Lion optimizer to simplify the process of parameter updating.Taking the t th iteration of gradient descent as an example, the Lion optimizer process is shown in Formula (15).
When the input value is positive, sign is 1, and when the input value is negative, sign is −1.m t and v t are the first order momentum term and the second order momentum term, respectively, β 1 and β 2 are the default values of the hyperparameters with 0.9 and 0.99, the deviation correction values of m t and v t are mt and vt , and g t is the gradient of the loss function of the current sample.

Results
In order to verify the accuracy and efficiency of the proposed model, experiments were conducted on three datasets, and the model was evaluated using three evaluation criteria: overall accuracy (OA), average accuracy (AA), and Kappa coefficient.At the same time, this paper also studied the running time of each model to evaluate its efficiency.
The Pavia Center dataset was captured by the ROSIS sensor during a flight campaign over Pavia, Northern Italy.It consists of 1906 × 715 pixels with a spatial resolution of 1.3m.After removing 13 bad bands, it has 102 bands(430~860nm).The ground truth contains nine classes representing a typical urban site.The WHU-Hi-LongKou dataset covers a simple agricultural area and was captured by an 8mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops.The image size was 550 × 400 pixels, with 270 bands ranging from 400 to 1000 nm.The Hangzhou dataset was obtained by the EO-1 Hyperion hyperspectral sensor, which kept 198 bands after removing 22 bad bands and 590 × 230 pixels.The false-color images and corresponding ground-truth maps of the three datasets can be seen in Tables 1-3.
The Pavia Center dataset was captured by the ROSIS sensor during a flight campaign over Pavia, Northern Italy.It consists of 1906 × 715 pixels with a spatial resolution of 1.3m.After removing 13 bad bands, it has 102 bands(430~860nm).The ground truth contains nine classes representing a typical urban site.The WHU-Hi-LongKou dataset covers a simple agricultural area and was captured by an 8mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops.The image size was 550 × 400 pixels, with 270 bands ranging from 400 to 1000 nm.The Hangzhou dataset was obtained by the EO-1 Hyperion hyperspectral sensor, which kept 198 bands after removing 22 bad bands and 590 × 230 pixels.The false-color images and corresponding ground-truth maps of the three datasets can be seen in Tables 1-3.
The Pavia Center dataset was captured by the ROSIS sensor during a flight campaign over Pavia, Northern Italy.It consists of 1906 × 715 pixels with a spatial resolution of 1.3m.After removing 13 bad bands, it has 102 bands(430~860nm).The ground truth contains nine classes representing a typical urban site.The WHU-Hi-LongKou dataset covers a simple agricultural area and was captured by an 8mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops.The image size was 550 × 400 pixels, with 270 bands ranging from 400 to 1000 nm.The Hangzhou dataset was obtained by the EO-1 Hyperion hyperspectral sensor, which kept 198 bands after removing 22 bad bands and 590 × 230 pixels.The false-color images and corresponding ground-truth maps of the three datasets can be seen in Tables 1-3.
The Pavia Center dataset was captured by the ROSIS sensor during a flight campaign over Pavia, Northern Italy.It consists of 1906 × 715 pixels with a spatial resolution of 1.3m.After removing 13 bad bands, it has 102 bands(430~860nm).The ground truth contains nine classes representing a typical urban site.The WHU-Hi-LongKou dataset covers a simple agricultural area and was captured by an 8mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops.The image size was 550 × 400 pixels, with 270 bands ranging from 400 to 1000 nm.The Hangzhou dataset was obtained by the EO-1 Hyperion hyperspectral sensor, which kept 198 bands after removing 22 bad bands and 590 × 230 pixels.The false-color images and corresponding ground-truth maps of the three datasets can be seen in Tables 1-3.
The Pavia Center dataset was captured by the ROSIS sensor during a flight campaign over Pavia, Northern Italy.It consists of 1906 × 715 pixels with a spatial resolution of 1.3m.After removing 13 bad bands, it has 102 bands(430~860nm).The ground truth contains nine classes representing a typical urban site.The WHU-Hi-LongKou dataset covers a simple agricultural area and was captured by an 8mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops.The image size was 550 × 400 pixels, with 270 bands ranging from 400 to 1000 nm.The Hangzhou dataset was obtained by the EO-1 Hyperion hyperspectral sensor, which kept 198 bands after removing 22 bad bands and 590 × 230 pixels.The false-color images and corresponding ground-truth maps of the three datasets can be seen in Tables 1-3.
The Pavia Center dataset was captured by the ROSIS sensor during a flight campaign over Pavia, Northern Italy.It consists of 1906 × 715 pixels with a spatial resolution of 1.3m.After removing 13 bad bands, it has 102 bands(430~860nm).The ground truth contains nine classes representing a typical urban site.The WHU-Hi-LongKou dataset covers a simple agricultural area and was captured by an 8mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops.The image size was 550 × 400 pixels, with 270 bands ranging from 400 to 1000 nm.The Hangzhou dataset was obtained by the EO-1 Hyperion hyperspectral sensor, which kept 198 bands after removing 22 bad bands and 590 × 230 pixels.The false-color images and corresponding ground-truth maps of the three datasets can be seen in Tables 1-3.
The Pavia Center dataset was captured by the ROSIS sensor during a flight campaign over Pavia, Northern Italy.It consists of 1906 × 715 pixels with a spatial resolution of 1.3m.After removing 13 bad bands, it has 102 bands(430~860nm).The ground truth contains nine classes representing a typical urban site.The WHU-Hi-LongKou dataset covers a simple agricultural area and was captured by an 8mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops.The image size was 550 × 400 pixels, with 270 bands ranging from 400 to 1000 nm.The Hangzhou dataset was obtained by the EO-1 Hyperion hyperspectral sensor, which kept 198 bands after removing 22 bad bands and 590 × 230 pixels.The false-color images and corresponding ground-truth maps of the three datasets can be seen in Tables 1-3.
The Pavia Center dataset was captured by the ROSIS sensor during a flight campaign over Pavia, Northern Italy.It consists of 1906 × 715 pixels with a spatial resolution of 1.3m.After removing 13 bad bands, it has 102 bands(430~860nm).The ground truth contains nine classes representing a typical urban site.The WHU-Hi-LongKou dataset covers a simple agricultural area and was captured by an 8mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops.The image size was 550 × 400 pixels, with 270 bands ranging from 400 to 1000 nm.The Hangzhou dataset was obtained by the EO-1 Hyperion hyperspectral sensor, which kept 198 bands after removing 22 bad bands and 590 × 230 pixels.The false-color images and corresponding ground-truth maps of the three datasets can be seen in Tables 1-3.
The Pavia Center dataset was captured by the ROSIS sensor during a flight campaign over Pavia, Northern Italy.It consists of 1906 × 715 pixels with a spatial resolution of 1.3m.After removing 13 bad bands, it has 102 bands(430~860nm).The ground truth contains nine classes representing a typical urban site.The WHU-Hi-LongKou dataset covers a simple agricultural area and was captured by an 8mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops.The image size was 550 × 400 pixels, with 270 bands ranging from 400 to 1000 nm.The Hangzhou dataset was obtained by the EO-1 Hyperion hyperspectral sensor, which kept 198 bands after removing 22 bad bands and 590 × 230 pixels.The false-color images and corresponding ground-truth maps of the three datasets can be seen in Tables 1-3.
The Pavia Center dataset was captured by the ROSIS sensor during a flight campaign over Pavia, Northern Italy.It consists of 1906 × 715 pixels with a spatial resolution of 1.3m.After removing 13 bad bands, it has 102 bands(430~860nm).The ground truth contains nine classes representing a typical urban site.The WHU-Hi-LongKou dataset covers a simple agricultural area and was captured by an 8mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops.The image size was 550 × 400 pixels, with 270 bands ranging from 400 to 1000 nm.The Hangzhou dataset was obtained by the EO-1 Hyperion hyperspectral sensor, which kept 198 bands after removing 22 bad bands and 590 × 230 pixels.The false-color images and corresponding ground-truth maps of the three datasets can be seen in Tables 1-3.
The Pavia Center dataset was captured by the ROSIS sensor during a flight campaign over Pavia, Northern Italy.It consists of 1906 × 715 pixels with a spatial resolution of 1.3m.After removing 13 bad bands, it has 102 bands(430~860nm).The ground truth contains nine classes representing a typical urban site.The WHU-Hi-LongKou dataset covers a simple agricultural area and was captured by an 8mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops.The image size was 550 × 400 pixels, with 270 bands ranging from 400 to 1000 nm.The Hangzhou dataset was obtained by the EO-1 Hyperion hyperspectral sensor, which kept 198 bands after removing 22 bad bands and 590 × 230 pixels.The false-color images and corresponding ground-truth maps of the three datasets can be seen in Tables 1-3.
The Pavia Center dataset was captured by the ROSIS sensor during a flight campaign over Pavia, Northern Italy.It consists of 1906 × 715 pixels with a spatial resolution of 1.3m.After removing 13 bad bands, it has 102 bands(430~860nm).The ground truth contains nine classes representing a typical urban site.The WHU-Hi-LongKou dataset covers a simple agricultural area and was captured by an 8mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops.The image size was 550 × 400 pixels, with 270 bands ranging from 400 to 1000 nm.The Hangzhou dataset was obtained by the EO-1 Hyperion hyperspectral sensor, which kept 198 bands after removing 22 bad bands and 590 × 230 pixels.The false-color images and corresponding ground-truth maps of the three datasets can be seen in Tables 1-3.
The Pavia Center dataset was captured by the ROSIS sensor during a flight campaign over Pavia, Northern Italy.It consists of 1906 × 715 pixels with a spatial resolution of 1.3m.After removing 13 bad bands, it has 102 bands(430~860nm).The ground truth contains nine classes representing a typical urban site.The WHU-Hi-LongKou dataset covers a simple agricultural area and was captured by an 8mm focal length steeple-wall Headwall Nano-HyperSpec sensor equipped with a receiver Matrix 600 Pro UAV platform with six kinds of crops.The image size was 550 × 400 pixels, with 270 bands ranging from 400 to 1000 nm.The Hangzhou dataset was obtained by the EO-1 Hyperion hyperspectral sensor, which kept 198 bands after removing 22 bad bands and 590 × 230 pixels.The false-color images and corresponding ground-truth maps of the three datasets can be seen in Tables 1-3.

Experimental Setting
The GPU server used in this article is manufactured by Finehoo Technology Co., Ltd., located in Shanghai, China.The Python version used is 3.7.The experimental environment was an Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz processor, 128 GB running memory (RAM), and NVIDIA GeForce RTX 2080Ti GPU.In addition, the deep learning framework was Pytorch and Tensorflow.The initial learning rate was set to 0.00012, the Lion optimization algorithm was used for training models, and the number of iterations was set to 6000.At the same time, in order to reduce the randomness brought by training samples, each experiment was repeated 10 times and the average accuracy is taken as the final experimental result.In order to evaluate the effectiveness of the model in this paper, the model in this paper is compared with five other algorithms, including Depthwise separable neural network (DSNN) [36], 2DCNN [37], 3DCNN [38], SSRN [39], and the advanced classification network with noisy samples, dual channel residual network (DCRN) [30].The learning rate of DSNN, 2DCNN, 3DCNN, SSRN, and DCRN was set to 0.001, the optimizer was AdamW, the early stop method was used to train the network, and the epoch was set to 4000.

Classification Results of Different Methods
Tables 4, 5, and 6, respectively, show the classification results of different methods in 24 clean samples of each class for PC, LK, and HZ data.It can be found that the OA of WSFL is the highest, with 98.77%, 97.58%, and 80.14%, respectively.Taking HZ data as an example, WSFL increased by 0.62%, 0.95%, 3.74%, 5.63%, and 3.05% compared to DCRN, SSRN, 3DCNN, 2DCNN, and DSNN, respectively.In summary, it can be fully demonstrated that the WSFL model is still the best performing model without adding noise samples.

Experimental Setting
The GPU server used in this article is manufactured by Finehoo Technology Co., Ltd., located in Shanghai, China.The Python version used is 3.7.The experimental environment was an Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz processor, 128 GB running memory (RAM), and NVIDIA GeForce RTX 2080Ti GPU.In addition, the deep learning framework was Pytorch and Tensorflow.The initial learning rate was set to 0.00012, the Lion optimization algorithm was used for training models, and the number of iterations was set to 6000.At the same time, in order to reduce the randomness brought by training samples, each experiment was repeated 10 times and the average accuracy is taken as the final experimental result.In order to evaluate the effectiveness of the model in this paper, the model in this paper is compared with five other algorithms, including Depthwise separable neural network (DSNN) [36], 2DCNN [37], 3DCNN [38], SSRN [39], and the advanced classification network with noisy samples, dual channel residual network (DCRN) [30].The learning rate of DSNN, 2DCNN, 3DCNN, SSRN, and DCRN was set to 0.001, the optimizer was AdamW, the early stop method was used to train the network, and the epoch was set to 4000.

Classification Results of Different Methods
Tables 4, 5, and 6, respectively, show the classification results of different methods in 24 clean samples of each class for PC, LK, and HZ data.It can be found that the OA of WSFL is the highest, with 98.77%, 97.58%, and 80.14%, respectively.Taking HZ data as an example, WSFL increased by 0.62%, 0.95%, 3.74%, 5.63%, and 3.05% compared to DCRN, SSRN, 3DCNN, 2DCNN, and DSNN, respectively.In summary, it can be fully demonstrated that the WSFL model is still the best performing model without adding noise samples.

Experimental Setting
The GPU server used in this article is manufactured by Finehoo Technology Co., Ltd., located in Shanghai, China.The Python version used is 3.7.The experimental environment was an Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz processor, 128 GB running memory (RAM), and NVIDIA GeForce RTX 2080Ti GPU.In addition, the deep learning framework was Pytorch and Tensorflow.The initial learning rate was set to 0.00012, the Lion optimization algorithm was used for training models, and the number of iterations was set to 6000.At the same time, in order to reduce the randomness brought by training samples, each experiment was repeated 10 times and the average accuracy is taken as the final experimental result.In order to evaluate the effectiveness of the model in this paper, the model in this paper is compared with five other algorithms, including Depthwise separable neural network (DSNN) [36], 2DCNN [37], 3DCNN [38], SSRN [39], and the advanced classification network with noisy samples, dual channel residual network (DCRN) [30].The learning rate of DSNN, 2DCNN, 3DCNN, SSRN, and DCRN was set to 0.001, the optimizer was AdamW, the early stop method was used to train the network, and the epoch was set to 4000.

Classification Results of Different Methods
Tables 4, 5, and 6, respectively, show the classification results of different methods in 24 clean samples of each class for PC, LK, and HZ data.It can be found that the OA of WSFL is the highest, with 98.77%, 97.58%, and 80.14%, respectively.Taking HZ data as an example, WSFL increased by 0.62%, 0.95%, 3.74%, 5.63%, and 3.05% compared to DCRN, SSRN, 3DCNN, 2DCNN, and DSNN, respectively.In summary, it can be fully demonstrated that the WSFL model is still the best performing model without adding noise samples.

Experimental Setting
The GPU server used in this article is manufactured by Finehoo Technology Co., Ltd., located in Shanghai, China.The Python version used is 3.7.The experimental environment was an Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz processor, 128 GB running memory (RAM), and NVIDIA GeForce RTX 2080Ti GPU.In addition, the deep learning framework was Pytorch and Tensorflow.The initial learning rate was set to 0.00012, the Lion optimization algorithm was used for training models, and the number of iterations was set to 6000.At the same time, in order to reduce the randomness brought by training samples, each experiment was repeated 10 times and the average accuracy is taken as the final experimental result.In order to evaluate the effectiveness of the model in this paper, the model in this paper is compared with five other algorithms, including Depthwise separable neural network (DSNN) [36], 2DCNN [37], 3DCNN [38], SSRN [39], and the advanced classification network with noisy samples, dual channel residual network (DCRN) [30].The learning rate of DSNN, 2DCNN, 3DCNN, SSRN, and DCRN was set to 0.001, the optimizer was AdamW, the early stop method was used to train the network, and the epoch was set to 4000.

Classification Results of Different Methods
Tables 4, 5, and 6, respectively, show the classification results of different methods in 24 clean samples of each class for PC, LK, and HZ data.It can be found that the OA of WSFL is the highest, with 98.77%, 97.58%, and 80.14%, respectively.Taking HZ data as an example, WSFL increased by 0.62%, 0.95%, 3.74%, 5.63%, and 3.05% compared to DCRN, SSRN, 3DCNN, 2DCNN, and DSNN, respectively.In summary, it can be fully demonstrated that the WSFL model is still the best performing model without adding noise samples.

Experimental Setting
The GPU server used in this article is manufactured by Finehoo Technology Co., Ltd., located in Shanghai, China.The Python version used is 3.7.The experimental environment was an Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz processor, 128 GB running memory (RAM), and NVIDIA GeForce RTX 2080Ti GPU.In addition, the deep learning framework was Pytorch and Tensorflow.The initial learning rate was set to 0.00012, the Lion optimization algorithm was used for training models, and the number of iterations was set to 6000.At the same time, in order to reduce the randomness brought by training samples, each experiment was repeated 10 times and the average accuracy is taken as the final experimental result.In order to evaluate the effectiveness of the model in this paper, the model in this paper is compared with five other algorithms, including Depthwise separable neural network (DSNN) [36], 2DCNN [37], 3DCNN [38], SSRN [39], and the advanced classification network with noisy samples, dual channel residual network (DCRN) [30].The learning rate of DSNN, 2DCNN, 3DCNN, SSRN, and DCRN was set to 0.001, the optimizer was AdamW, the early stop method was used to train the network, and the epoch was set to 4000.

Classification Results of Different Methods
Tables 4, 5, and 6, respectively, show the classification results of different methods in 24 clean samples of each class for PC, LK, and HZ data.It can be found that the OA of WSFL is the highest, with 98.77%, 97.58%, and 80.14%, respectively.Taking HZ data as an example, WSFL increased by 0.62%, 0.95%, 3.74%, 5.63%, and 3.05% compared to DCRN, SSRN, 3DCNN, 2DCNN, and DSNN, respectively.In summary, it can be fully demonstrated that the WSFL model is still the best performing model without adding noise samples.

Experimental Setting
The GPU server used in this article is manufactured by Finehoo Technology Co., Ltd., located in Shanghai, China.The Python version used is 3.7.The experimental environment was an Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz processor, 128 GB running memory (RAM), and NVIDIA GeForce RTX 2080Ti GPU.In addition, the deep learning framework was Pytorch and Tensorflow.The initial learning rate was set to 0.00012, the Lion optimization algorithm was used for training models, and the number of iterations was set to 6000.At the same time, in order to reduce the randomness brought by training samples, each experiment was repeated 10 times and the average accuracy is taken as the final experimental result.In order to evaluate the effectiveness of the model in this paper, the model in this paper is compared with five other algorithms, including Depthwise separable neural network (DSNN) [36], 2DCNN [37], 3DCNN [38], SSRN [39], and the advanced classification network with noisy samples, dual channel residual network (DCRN) [30].The learning rate of DSNN, 2DCNN, 3DCNN, SSRN, and DCRN was set to 0.001, the optimizer was AdamW, the early stop method was used to train the network, and the epoch was set to 4000.

Classification Results of Different Methods
Tables 4-6, respectively, show the classification results of different methods in 24 clean samples of each class for PC, LK, and HZ data.It can be found that the OA of WSFL is the highest, with 98.77%, 97.58%, and 80.14%, respectively.Taking HZ data as an example, WSFL increased by 0.62%, 0.95%, 3.74%, 5.63%, and 3.05% compared to DCRN, SSRN, 3DCNN, 2DCNN, and DSNN, respectively.In summary, it can be fully demonstrated that the WSFL model is still the best performing model without adding noise samples.The results of using different methods to classify PC datasets are shown in Table 7.In the PC dataset, 24 samples were taken from each class of clean samples and four, eight, and 12 noisy samples were taken from each class to verify the processing ability of different deep learning models.It was found that WSFL had the best overall classification results, reaching 98.52%, 97.50%, and 96.77%, respectively.In addition, the number of training samples selected in this paper accounts for approximately 0.1944% of the total sample size.Compared with the approximately 3% training sample size required for other popular depth models, the sample size required in this paper is greatly reduced, which can also prove that the model proposed has a reduced dependence on labeled samples.Table 7 shows that in 24 clean samples and four noisy samples, the OA of the proposed WSFL model reached 98.52%, which is 1.4%, 3.1%, and 9.43% higher than DCRN, SSRN, and 3DCNN, respectively.Among 24 clean samples and eight noisy samples, our OA reached 97.50%, which was 1.16%, 3.48%, and 17.52% higher than DCRN, SSRN, and 3DCNN, respectively.Among 24 clean samples and 12 noisy samples, our OA reached 96.77%, which was 1.33%, 6.57%, and 27.79% higher than DCRN, SSRN, and 3DCNN, respectively.In summary, it can be found that WSFL significantly improves OA under different noise sample sizes.Although the method proposed in this paper cannot achieve the best accuracy for each class, out of 24 clean samples and four noisy samples, seven classes are the best in this paper.Out of 24 clean samples and eight noise samples, six categories are the best category, and out of 24 clean samples and 12 noise samples, six categories are the best category.This can also prove that WSFL can better handle noisy samples compared to models such as DCRN, SSRN, and 3DCNN.As the number of noisy samples increases, the OA of WSFL decreases by only about 1%, which is acceptable as a multiple of the number of noisy samples.This fully demonstrates the effectiveness of WSFL in HSI classification tasks with noisy labels.
Finally, Figure 7 shows the pseudo-color images of the classification results of various classification methods in the PC dataset.False-color images, as a subjective evaluation indicator, can more intuitively display the classification effect.From Figure 7, it can be seen that WSFL has a significant improvement in classification performance compared to DCRN, SSRN, and 3DCNN.The area of misclassification is greatly reduced, and it is closer to the true distribution of ground objects.tiveness of WSFL in HSI classification tasks with noisy labels.
Finally, Figure 7 shows the pseudo-color images of the classification results of various classification methods in the PC dataset.False-color images, as a subjective evaluation indicator, can more intuitively display the classification effect.From Figure 7, it can be seen that WSFL has a significant improvement in classification performance compared to DCRN, SSRN, and 3DCNN.The area of misclassification is greatly reduced, and it is closer to the true distribution of ground objects.

Results of LK Datasets with Different Numbers of Noise Samples
The classification results of LK dataset are shown in Table 8.There are abundant annotated samples in each category of the LK dataset, so all methods perform well.Compared to other models, the proposed WSFL model achieved the best classification performance, showing significant improvements in most categories.Although the method proposed in this paper cannot achieve the best accuracy for each class, out of 24 clean samples and four noisy samples, as well as out of 24 clean samples and eight noisy samples, all six classes are considered the best class.Among the 24 clean samples and 12 noisy samples, seven were the optimal categories, which also proves that WSFL can better handle noisy samples compared to models such as DCRN, SSRN, and 3DCNN, fully demonstrating the effectiveness of WSFL in HSI classification tasks with noisy labels.

Results of LK Datasets with Different Numbers of Noise Samples
The classification results of LK dataset are shown in Table 8.There are abundant annotated samples in each category of the LK dataset, so all methods perform well.Compared to other models, the proposed WSFL model achieved the best classification performance, showing significant improvements in most categories.Although the method proposed in this paper cannot achieve the best accuracy for each class, out of 24 clean samples and four noisy samples, as well as out of 24 clean samples and eight noisy samples, all six classes are considered the best class.Among the 24 clean samples and 12 noisy samples, seven were the optimal categories, which also proves that WSFL can better handle noisy samples compared to models such as DCRN, SSRN, and 3DCNN, fully demonstrating the effectiveness of WSFL in HSI classification tasks with noisy labels.Finally, Figure 8 shows the pseudo-color images of various classification methods on LK dataset.Taking 24 clean + four noise as an example, the broad-leaf soybeans in the middle part have severe classification confusion.Compared with the DCRN, SSRN, and 3DCNN, the model proposed in this paper has fewer misclassification phenomena and is closer to the true distribution of ground objects.

Results of HZ Datasets with Different Numbers of Noise Samples
The HZ dataset has the characteristics of small inter-class differences and large intraclass differences.Therefore, the accuracy rates of various methods are relatively low, among which the indicators of WSFL are at the best, and OA reaches 79.44%, 72.90%, and 63.57%, respectively.The classification results of the HZ dataset are shown in Table 9.
Table 9 shows that in 24 clean samples and four noisy samples, the OA of this article reached 79.44%, which is 1.93%, 2.37%, and 4.72% higher than DCRN, SSRN, and 3DCNN, respectively.Among 24 clean samples and 12 noisy samples, our OA reached 63.57%, which was 1.26%, 2.39%, and 4.38% higher than DCRN, SSRN, and 3DCNN, respectively.For WSFL, out of 24 clean samples and four noisy samples, one class is the best class in this article.Among 24 clean samples and eight noise samples, as well as 24 clean samples and 12 noise samples, there are two optimal categories.Finally, Figure 9 shows the pseudo-color images of the classification results on the HZ dataset.Compared to DCRN, SSRN, and 3DCNN, WSFL has the smallest staggered area of water.
Finally, Figure 8 shows the pseudo-color images of various classification methods on LK dataset.Taking 24 clean + four noise as an example, the broad-leaf soybeans in the middle part have severe classification confusion.Compared with the DCRN, SSRN, and 3DCNN, the model proposed in this paper has fewer misclassification phenomena and is closer to the true distribution of ground objects.

Results of HZ Datasets with Different Numbers of Noise Samples
The HZ dataset has the characteristics of small inter-class differences and large intra-class differences.Therefore, the accuracy rates of various methods are relatively low, among which the indicators of WSFL are at the best, and OA reaches 79.44%, 72.90%, and 63.57%, respectively.The classification results of the HZ dataset are shown in Table 9.

The Numbers of Clean and Noisy Samples
In the classification with noisy labels, the number of clean samples is crucial, and even when there are few available clean samples, the proposed WSFL framework can perform well in classification.As shown in Tables 10-12, when the number of clean samples in each category is limited, the proposed WSFL still has relatively good performance.Compared with DCRN, SSRN, and 3DCNN, the WSFL model has a more robust network structure, resulting in a significant improvement in performance.In addition, as the number of noise samples increases, WSFL exhibits a slow performance decline.Therefore, compared to other methods, WSFL has higher stability.Table 9 shows that in 24 clean samples and four noisy samples, the OA of this article reached 79.44%, which is 1.93%, 2.37%, and 4.72% higher than DCRN, SSRN, and 3DCNN, respectively.Among 24 clean samples and 12 noisy samples, our OA reached 63.57%, which was 1.26%, 2.39%, and 4.38% higher than DCRN, SSRN, and 3DCNN, respectively.For WSFL, out of 24 clean samples and four noisy samples, one class is the best class in this article.Among 24 clean samples and eight noise samples, as well as 24 clean samples and 12 noise samples, there are two optimal categories.Finally, Figure 9 shows the pseudo-color images of the classification results on the HZ dataset.Compared to DCRN, SSRN, and 3DCNN, WSFL has the smallest staggered area of water.

The Numbers of Clean and Noisy Samples
In the classification with noisy labels, the number of clean samples is crucial, and even when there are few available clean samples, the proposed WSFL framework can perform well in classification.As shown in Tables 10-12, when the number of clean samples in each category is limited, the proposed WSFL still has relatively good performance.Compared with DCRN, SSRN, and 3DCNN, the WSFL model has a more robust network structure, resulting in a significant improvement in performance.In addition, as the number of noise samples increases, WSFL exhibits a slow performance decline.Therefore, compared to other methods, WSFL has higher stability.
Tables 10-12 show the impact of different numbers of clean and noisy samples on the PC, LK, and HZ datasets.Taking the PC dataset as an example, the OA of the WSFL model reached 98.52%, 97.50%, and 96.77%, respectively.Although the OA of WSFL is also decreasing, compared to other models this model has higher performance.When the number of noise samples is fixed, the value of OA continuously increases as the number of clean samples increases.By comparing different clean sample numbers and noise sample numbers, it can be found that when the clean sample number is 24 and the noise sample number is four, WSFL has the optimal indicator.In summary, the model in this paper has better processing ability when dealing with noisy samples, which can also prove that WSFL has a more robust network structure and stronger feature learning ability.Tables 10-12 show the impact of different numbers of clean and noisy samples on the PC, LK, and HZ datasets.Taking the PC dataset as an example, the OA of the WSFL model reached 98.52%, 97.50%, and 96.77%, respectively.Although the OA of WSFL is also decreasing, compared to other models this model has higher performance.When the number of noise samples is fixed, the value of OA continuously increases as the number of clean samples increases.By comparing different clean sample numbers and noise sample numbers, it can be found that when the clean sample number is 24 and the noise sample number is four, WSFL has the optimal indicator.In summary, the model in this paper has better processing ability when dealing with noisy samples, which can also prove that WSFL has a more robust network structure and stronger feature learning ability.

Investigation of Running Time
Table 13 gave the computation time comparison for three HSI datasets.In addition, compared to complex models such as DCRN and SSRN, WSFL is 4.47s faster than DCRN in PC datasets and 27.4s slower than SSRN.In summary, compared to single model neural networks such as DSNN, 2DCNN, 3DCNN, etc., WSFL has a significant improvement in performance despite being slower in time, and the model has stronger anti-interference ability against noise labels.Secondly, compared to complex models such as DCRN and SSRN, it ranks second on the PC and LK datasets and third on the HZ dataset.Considering the balance between accuracy and efficiency, WSFL as proposed in this paper is optimal.

Effectiveness of the Attention Model
In order to verify the effectiveness of MGRSAM and MRSAM on WSFL, this paper conducted ablation experiments and compared the OA, AA, and Kappa coefficients of MGRSAM, MRSAM, and MGRSAM + MRSAM as shown in Table 14.It can be seen that the simultaneous presence of MGRSAM and MRSAM has indeed improved OA, AA, and Kappa on the three datasets.However, on the WHU-Hi-LongKou dataset, the AA value of MRSAM is 0.26% higher than that of MGRSAM and MRSAM, as the accuracy of a certain class of MRSAM is slightly higher than that of the final method.Although the AA value of MRSAM has increased on the WHU-Hi-LongKou dataset, compared by OA, Kappa, and overall, the coexistence of MGRSAM and MRSAM is superior.

Effectiveness of the Number of Groups on the Model
In this section, in order to verify the impact of the number of groups in MGRSAM on the WSFL model, this paper selects two, three, four, six, and eight groups for comparison.From Figure 10, it can be seen that when the number of groups is three, OA, AA, and Kappa reach their highest values.Although the AA of three groups in the PC dataset is slightly lower than the AA of two groups, the improvement in OA, Kappa, and the value of three in the LK and HZ data are all in the optimal solution, which is acceptable.In addition, when the group value is changed from three to six, the overall indicator shows a decline phenomenon, which is because the features are too scattered in the later training of the spectral dimension.Although the model's ability to fit noisy labels is significantly reduced, it also reduces the ability to fit clean samples.Therefore, this paper selects three groups to train our model, in order to reduce the fitting of noisy samples while retaining the ability to fit clean samples.

Conclusions
In this article, we propose WSFL, a novel weakly supervised feature learning architecture with the core goal of exploring the robustness of the model to different noise levels.The uniqueness of WSFL lies in its specific feature learning strategy for noisy labels, where it can adaptively learn features through multi-model attention adaptive feature learning without removing noisy samples.This preserves the diversity of features and reduces the influence of noisy samples on the model.
In addition, different architectures have been designed based on the characteristics of hyperspectral data in spectral, spatial, and spectral-spatial dimensions.Compared with other methods, WSFL can effectively capture information in hyperspectral data and transform it into discriminative feature representations.Specifically, multiple sets of residual spectral attention models were carefully designed in the spectral dimension, which differentiated features through multiple sets of spectral feature spaces to avoid excessive concentration of single layer spectral features and memory of noisy samples.Secondly, more clean spectral features were learned in the spectral attention space.In addition, a multi-granularity residual spatial attention model has been carefully designed in the spatial dimension.In the spatial feature attention space, the similarity between samples is calculated to reduce the weight of noisy samples and improve the influence of clean samples.Then, the spatial features are refined in the multi-granularity space to obtain more discriminative spatial features, improving the quality of capturing spatial features and enhancing the model's constraint on noise samples.Finally, the MLP model is introduced to eliminate the adverse effects of local connectivity in the model, obtaining more spatial structure information from the HSI dataset.
A large number of experimental results indicate that the framework proposed in this article surpasses state-of-the-art algorithms and can still achieve good accuracy even in the presence of a large number of noisy samples.Therefore, the architecture of this article is more suitable for HSI classification with noisy labels.The future work direction of this article is to apply the proposed framework to other hyperspectral images, rather than just processing the aforementioned open-source datasets, in order to enhance the universality of the model in practice.

Conclusions
In this article, we propose WSFL, a novel weakly supervised feature learning architecture with the core goal of exploring the robustness of the model to different noise levels.The uniqueness of WSFL lies in its specific feature learning strategy for noisy labels, where it can adaptively learn features through multi-model attention adaptive feature learning without removing noisy samples.This preserves the diversity of features and reduces the influence of noisy samples on the model.
In addition, different architectures have been designed based on the characteristics of hyperspectral data in spectral, spatial, and spectral-spatial dimensions.Compared with other methods, WSFL can effectively capture information in hyperspectral data and transform it into discriminative feature representations.Specifically, multiple sets of residual spectral attention models were carefully designed in the spectral dimension, which differentiated features through multiple sets of spectral feature spaces to avoid excessive concentration of single layer spectral features and memory of noisy samples.Secondly, more clean spectral features were learned in the spectral attention space.In addition, a multi-granularity residual spatial attention model has been carefully designed in the spatial dimension.In the spatial feature attention space, the similarity between samples is calculated to reduce the weight of noisy samples and improve the influence of clean samples.Then, the spatial features are refined in the multi-granularity space to obtain more discriminative spatial features, improving the quality of capturing spatial features and enhancing the model's constraint on noise samples.Finally, the MLP model is introduced to eliminate the adverse effects of local connectivity in the model, obtaining more spatial structure information from the HSI dataset.
A large number of experimental results indicate that the framework proposed in this article surpasses state-of-the-art algorithms and can still achieve good accuracy even in the presence of a large number of noisy samples.Therefore, the architecture of this article is more suitable for HSI classification with noisy labels.The future work direction of this article is to apply the proposed framework to other hyperspectral images, rather than just processing the aforementioned open-source datasets, in order to enhance the universality of the model in practice.

Figure 1 .
Figure 1.Framework of the proposed WSFL for HSI classification.

Figure 1 .
Figure 1.Framework of the proposed WSFL for HSI classification.

Figure 2 .
Figure 2. Multiple groups residual spectral attention model.

Figure 2 .
Figure 2. Multiple groups residual spectral attention model.

Table 1 .
The Number of Samples of the PC Dataset.

Table 1 .
The Number of Samples of the PC Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 1 .
The Number of Samples of the PC Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 1 .
The Number of Samples of the PC Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 1 .
The Number of Samples of the PC Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 1 .
The Number of Samples of the PC Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 1 .
The Number of Samples of the PC Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 1 .
The Number of Samples of the PC Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 1 .
The Number of Samples of the PC Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 1 .
The Number of Samples of the PC Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 1 .
The Number of Samples of the PC Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 1 .
The Number of Samples of the PC Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 1 .
The Number of Samples of the PC Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 2 .
The Number of Samples of the LK Dataset.

Table 3 .
The Number of Samples of the HZ Dataset.

Table 3 .
The Number of Samples of the HZ Dataset.

Table 3 .
The Number of Samples of the HZ Dataset.

Table 3 .
The Number of Samples of the HZ Dataset.

Table 3 .
The Number of Samples of the HZ Dataset.

Table 4 .
The classification results of the PC dataset by different methods.

Table 3 .
The Number of Samples of the HZ Dataset.

Table 4 .
The classification results of the PC dataset by different methods.

Table 4 .
The classification results of the PC dataset by different methods.

Table 5 .
The classification results of the LK dataset by different methods.

Table 6 .
The classification results of the HZ dataset by different methods.

Table 7 .
The classification results of the PC dataset with 24 clean + 4/8/12 noisy samples.

Table 9 .
The classification results of the HZ dataset with 24 clean + 4/8/12 noisy samples.

Table 10 .
The classification results of the PC dataset with different numbers of clean samples and noisy samples.

Table 11 .
The classification results of the LK dataset with different numbers of clean samples and noisy samples.

Table 12 .
The classification results of the HZ dataset with different numbers of clean samples and noisy samples.

Table 13 .
Computation time comparison for three HSI datasets(s).

Table 14 .
The classification effectiveness of different attention models.Effectiveness of the MLP ModelIn order to verify the effectiveness of the MLP model in the model proposed in this paper, ablation experiments were conducted as shown in Table15.It can be observed that after adding the MLP model, OA, AA and Kappa show a significant increase.

Table 15 .
The effectiveness of the MLP model on classification results.