PatchMask: A Data Augmentation Strategy with Gaussian Noise in Hyperspectral Images

: Data augmentation (DA) is an effective way to enrich the richness of data and improve a model’s generalization ability. It has been widely used in many advanced vision tasks (e.g., classiﬁcation, recognition, etc.), while it can hardly be seen in hyperspectral image (HSI) tasks. In this paper, we analyze whether existing augmentation methods are suitable for the task of HSI denoising and ﬁnd that the biggest challenge lies in neither losing the spatial information of the original image nor destroying the correlation between the various bands for HSI denoising. Based on this, a new data augmentation method named PatchMask is proposed, which makes the training samples as diverse as possible while preserving the spatial and spectral information. The training data augmented by this method are somewhere between clear and noisy, which can make the network learn more effectively and generalize. Experiments demonstrate that our method outperforms other data augmentation methods, such as the benchmark CutBlur, in enhancing HSI denoising. In addition, the given DA method was used on several popular denoising networks, such as QRNN3D, DnCNN, MPRnet, CBDNet, and HSID-CNN, to verify the effectiveness of the proposed method. The results show that the given DA could increase the value of the PSNR by 0.2 ∼ 0.5 dB in various examples.


Introduction
Hyperspectral images (HSIs) have important applications in many fields, such as remote sensing [1,2], food safety [3,4], astronomy [5], medicine [6], and agriculture [7,8]. However, during the imaging process, due to the interference of complex human factors and the influence of the natural environment, such as illumination, the collected HSIs often have various noises (e.g., Gaussian noise). Therefore, there has been much work aimed at improving the performance of hyperspectral remote sensing images, such as through pansharpening [9][10][11], super-resolution [12], and denoising [13].
Most successful traditional HSI denoising methods are based on certain strong prior knowledge, such as low-rank representation [14][15][16][17][18], sparse coding [19][20][21][22][23], global correlation along spectra [24,25], and so on. With the development of deep learning (DL), DL-based methods have drawn more and more attention [13,[26][27][28], such as those using convolutional neural networks (CNNs). In order to achieve a better denoising effect, DL-based methods need a large number of training samples to learn network parameters. However, the currently widely used datasets (ICVL [29], Pavia [30], etc.) have a limited number of training samples because HSIs are more challenging to obtain than RGB images. Therefore, we hope to extend the data augmentation method to the task of HSI denoising, generate new samples with more positive feedback for the network, and further improve the denoising effect of the network.

1.
Limited DA methods are currently explicitly designed for HSIs; thus, the proposed method that combines the characteristics of HSIs makes them more advantageous. Our PatchMask can learn the the difference between clear and noisy samples more precisely and pay more attention to the noisy areas. 2.
Our PatchMask method was applied to several HSI denoising models and achieved good performance in the presence of Gaussian noise. Plenty of experiments on the ICVL and CAVE datasets show that our method can improve the performance of multiple networks and has a certain universality.
This paper presents our work in the following order. First, Section 2 gives a brief review of HSI denoising methods and DA methods. In Section 3, the new DA method, namely, PatchMask, is described in detail. In Section 4, we describe the extensive experiments that were conducted to demonstrate the effectiveness of our method. Finally, Section 5 concludes the whole paper.

HSI Denoising Methods
HSI denoising methods can be classified into two categories: traditional methods and DL-based methods. The traditional HSI denoising task [14][15][16][17][18][49][50][51][52][53][54][55] is generally based on some strong prior knowledge or the similarities and correlations of some image blocks. For example, Block Matching 3D (BM3D [53]) uses the idea of non-local block matching, that is, finding similar blocks and performing domain transformations on similar blocks. The authors also used collaborative filtering to reduce the noise contained in similar blocks. In addition, expected patch log likelihood (EPLL [54]) forms a learned model of natural image patches for image restoration, and low-rank matrix recovery (LRMR [55]) recovers HSI through a low-rank matrix. These methods have all achieved good results.
With artificial intelligence technology's rapid development and gradual maturity, various DL-based methods have also been widely used in HSI denoising tasks [56][57][58][59][60]. For the first time, a fully convolutional neural network (CNN), HSI-DeNet [57], was used for HSI recovery; it simultaneously introduced residual learning, dilated convolution, and multi-channel filtering, and it achieved good results. Zhang et al. [61] took one step forward by investigating the construction of feed-forward denoising convolutional neural networks (DnCNNs) to embrace the progress in very deep architecture, learning algorithms, and regularization methods in image denoising. Based on this, the spatial-spectral gradient network (SSGN [58]) adopted the spatial-spectral gradient learning strategy to effectively extract the deep features of HSIs. Maffei et al. [28] efficiently took into consideration both the spatial and spectral information contained in HSIs and proposed a model called HSI single-denoising CNN (HSI-SDeCNN). Due to the multi-band characteristics of HSIs, 3D convolution has also gradually been used in HSI denoising tasks, such as in 3DADCNN [59] and QRNN3D [60]. QRNN3D [60] proposed an alternating directional 3D quasi-recurrent neural network that could effectively embed the domain knowledge of structural spectral correlations and global correlations along the spectrum. In CVPR 2021, MPRnet [62] was a proposal of a multi-stage architecture that progressively learns restoration functions for degraded inputs, thereby breaking down the overall recovery process into more manageable steps. However, the limited training samples restrict the effects of DL-based denoising methods.

Data Augmentation Methods
DA is an effective method for improving data diversity and network generalization, and its essence is the process of using existing data to create new data. The whole process not only increases the number of samples, but also makes the dataset "stronger". The newly generated samples have more positive feedback for the network.
When processing traditional RGB images, basic DA methods, such as flipping, rotation, cropping, translation, etc., are generally used the most. For example, Ballester et al. [63] used the RandomResizedCrop strategy. These methods have strong practicality and can be used in many tasks, and not just for high-level tasks (e.g., image recognition, image classification, etc.). They are also widely used in many low-level tasks, such as the image dehazing task, to which Wang et al. [64] applied random cropping and flipping.
In addition, the original samples can be expanded by erasing or occluding part of the image. For example, DeVries et al. [65] occlude the input image with a random Cutout Box Mask. After that, Zhong et al. [66] replace the original image area with random values or the average pixels of the training set. It is also possible to add new samples by superimposing pixels [67]. These methods all improve the generalization ability of the network, but they also have certain limitations and are usually applied to high-level tasks.
Recently, Yoo et al. [47] provided a comprehensive analysis of the existing augmentation methods applied to the super-resolution task. CutBlur [47] brings DA methods into the super-resolution field. The method of clipping the image blocks of the same position to each other further highlights the regional differences between low-resolution and highresolution, making the network pay more attention to the target area and improving the performance of the network. Ghaffar et al. [46] fused auxiliary channels (or custom bands) with each training sample, which helped the model learn useful representations.

Motivation
Due to the simplicity and effectiveness of CutBlur, it inspired us to consider the idea of CutBlur for HSI problems, for which there has been limited work on DA. However, CutBlur focuses on natural images and only considers a fixed patch with a large image region for DA, ignoring the randomness of small patches with different image contents. This point may weaken the representation and diversity of data, limiting the performance of DA. For example, a fixed patch with a large image region for CutBlur may only contain one type of image context, e.g., seawater, whereas if we conduct the randomness of small patches in an image, they may cover more types of image contents, which will help the representation of DA. In addition, since we mainly focus on hyperspectral images, where the important task is to preserve the spectral continuity, when conducting DA on hyperspectral images, it is better to consider patches with complete spectral bands. Motivated by the aforementioned points, we propose a new DA approach for the fundamental HSI denoising problem.

Proposed Method
In this section, we introduce PatchMask, a new DA method designed for the task of HSI denoising.
The denoising task is a process of separating or decoupling noise from an image, and the same is true for the HSI denoising task. The source of the noise map is described as follows: where y refers to the HSI that we obtained. Due to various influences in the acquisition process, y contains various noises n (e.g., Gaussian noise). After removing the noise n, a clear image x can be obtained. y, x, n ∈ R H×W×B , H and W represent the height and width of the spatial resolution, respectively, and B represents the number of HSI bands.

Algorithm
First, the noisy and clear images are divided into α patches. Then, a certain number of patches in the noise image are randomly selected and pasted into the corresponding positions in the clean image, and vice versa. This generates two new training samples (x (noise→gt) ,x (gt→noise) ). Our description can be summed up in the following formulas: where M ∈ {0, 1} represents the binary mask, 1 means using all of the image, and 0 means the opposite. α represents the total number of patches, and eN represents the number of swaps. means that the array elements are multiplied in sequence. The above formula can generally be described as synthesizing a new sample, that is, the patches of the two images at random positions are exchanged with each other. The entire process is shown in Figure 1. In Figure 1, the clear image and the noise image are first divided into blocks, and the parameter α determines the number of blocks. Then, according to the parameter eN (exchange ratio), the clear image and the noise image in the corresponding position are exchanged to obtain two new samples. In addition, we summarize the whole procedure of data augmentation in Algorithm 1. In Algorithm 1, we first judge the input parameters α and prob. If they do not satisfy the conditions of method execution, we train them according to the original noise-free image pair. If they satisfy the conditions of method execution, we enter the given DA method. In the execution of the method, firstly, the noise map x noise and the clear map x gt are chunked, and then the patches at the same position of the eN × α block are randomly taken and exchanged to get a new noise mapx gt→noise orx noise→gt , which is used to generate a new image pair for training with the previous clear map. Correspondingly, we also present the entire algorithm flow in Figure 2 for a better understanding of our algorithm.

Algorithm 1 PatchMask implementation process
Require: Pictures with noise x noise , a clear picture x gt , probability of the method being used prob, the total number of patches α, and the ratio of patch swaps eN. Ensure: The new noise map generatedx gt→noise orx noise→gt 1: if α ≤ 0 or Random probability ≥ prob then 2: return x noise 3: else 4: Divide the x noise into x

Principle
For HSI denoising tasks, it should be noted that in the process of DA, information other than that of the original image should not be introduced, and the pixel structure of the original image should not be greatly affected. Yoo et al. in [47] tested methods such as Cutout [65] and CutMix [68] on super-resolution tasks, and they found that excessive use of these methods would lead to a significant decrease in the performance of the results. For low-level vision tasks, the sharp changes in an image during DA increase the difficulty of the image reconstruction process, which also degrades the performance of the network.
At the same time, unlike RGB images, HSIs also have a significant issue to consider: The influence of information between bands is important for the denoising task. We tried copy-pasting the clear image and the noisy image at different positions on each band, but this reduced the network performance. The main reason is that when the network performs feature extraction, the information between different bands loses its original connection, affecting the performance of the network model. Therefore, we suggest allowing the network model to learn the distribution of noise while preserving the overall structure of the original image space and the information connections between the bands. Simultaneously, we want the model to pay more attention to these noises.

Why Use Patches?
We adopt this mechanism of dividing into several patches mainly because the network can pay more attention to more details when splitting. For example, in Figure 3, we can see that some patches mainly contain high-spatial-frequency regions of the image, while others mainly contain low-spatial-frequency regions of the image. High-frequency parts (e.g., streets) describe fast-changing fine details of the image and are difficult-to-restore clean images, while low-frequency parts (e.g., the squares) describe slightly changing structures that are less difficult to restore and may be excluded in the process of screening patches. Sun et al. [69] also had a similar approach that involved extracting the more important parts of the image, participating in the training of the network, and achieving good results in the task of demosaicking. This effect was equivalent to an attention mechanism; in this way, some important parts of the image could be given additional attention.
We divide the whole process into blocks, and then randomly sample them. In this way, each patch can be selected with equal probability, and if a whole block is operated upon, some patches will be selected repeatedly with a high probability. As shown in the central part of Figure 3b, this part is likely to be repeatedly selected during the operation.

Why Does Our Method Work for HSI Denoising?
As shown in Figure 4, regardless of the imaging principle or the application field, HSIs are completely different from RGB images.
As they are limited by capabilities of devices and requirements of tasks, RGB images have only three bands of data format. However, HSIs have a more complex frequency band and data structure. Therefore, the band information and textural information are equally important for HSI denoising, and ignoring the correlation between adjacent bands will lead to a significantly lower denoising effect. When performing data enhancement operations, we should start with HSIs' spectral and spatial aspects.
In image restoration, the image characteristics, e.g., the changes in pixels, the spatial priors of images, the preservation of information along the spectral direction, etc., will greatly affect the quality of the recovery of images. Therefore, when designing this method, we fully considered the characteristics of HSIs that were mentioned in this section, as well as their spatial resolution, when processing. Instead of using a patch from another image or just clearing it out, we swap it with its original patch. In this way, the entire spatial structure remains intact. In addition, spectral information is also taken into account. The different bands throughout the image are unchanged, thus maximizing the preservation of spectral information.

What Can Models Learn from PatchMask?
For an un-enhanced dataset, the data in the dataset are relatively singular. When a model is learning, the convergence speed is faster than that when the model is learning with DA, but the value of the loss function is higher, which was proven by the experiments described in Section 4.6.
Compared with the original data, the biggest difference in the processed data was that the distribution of noise changed greatly. After this process, we hoped that the model could learn not only the density of the noise, but also the noise distribution, as it was changed by the DA method. In this way, the model could pay more attention to more complex areas, making the entire model more refined for the HSI denoising task, and the performance of the model was improved. It can also be seen in our experimental results that after using this method, the details of some images were better recovered.

Experiments and Results
In this section, we describe our experimental procedure and present the results. To demonstrate the effectiveness of our method, we conducted comparative experiments on a dataset with other DA methods. In addition, to demonstrate the effect of our method in networks with different parameter amounts, we selected four networks with different parameter values for testing. We also tested different application scales to illustrate the effect of adding samples through DA methods on the network.

Comparisons with Other Methods
To demonstrate the effectiveness of our method, we selected the ICVL dataset and the QRNN3D [60] network for testing, and the test results are shown in Table 1. We can see that both the original CutBlur method and our newly proposed PatchMask method achieved better performance than the baselines. The main reason is that both CutBlur and PatchMask retain the contextual content of the original image, which does not cause an excessive semantic loss. On the contrary, they only change the distribution of noise or add a layer of the mask to the original image. This increases the variety of data in the dataset and increases the number of samples between clear and noisy. At the same time, our proposed method enables the high-frequency information in some patches to be sufficiently trained. In this experiment, we tried to make the parameters of the two DA methods consistent. For CutBlur, we used the parameters p = 0.3 and α = 0.7 (we followed the parameter settings used in the original paper). For our proposed method, we set the parameters to p = 0.3, α = 16, and eN = 0.3, which were set to better compare the impacts of the two on the network. Table 1. Comparison of PSNR and SSIM when using different methods; δ represents an increased value. The baselines used the QRNN3D and ICVL datasets (with Gaussian noise and σ = 50), and these models were trained from scratch. Best scores are highlighted. As shown in Figure 5, we tested other DA methods, such as Cutout [65], which is also commonly used on RGB images, where, unlike tasks such as classification and recognition, a large number of pixels are lost, which causes difficulties in the recovery of the network. The obtained results differed significantly from the original image. Therefore, we set the parameters here as follows: The number of blocks was 1 and the size was 2. We found that the network results became worse after the loss of the original pixels, mainly because the network had no response for some of the parameters in the missing pixels, and therefore, the missing pixels caused the network performance to be degraded in the image recovery task. In addition, we also conducted experiments on another DA method, i.e., Mixup [67]. In our experiments, we set the parameter of α = 0.2, and we found that the network performance was also degraded to some extent, which was mainly because when performing the operation, the method confused the information between the clear and noisy map bands, thus causing the network performance to be degraded.

ICVL Dataset
When conducting experiments, there are certain difficulties in obtaining HSIs. Therefore, we used the ICVL [29] dataset. The ICVL dataset is an HSI set that was acquired by using a Specim PS Kappa DX4 hyperspectral camera and a rotating stage for spatial scanning. It includes 200 images with a spatial resolution of 1392 × 1300, The number of bands was 31. For the accuracy of the experiment, we used 100 images as the training set and another 50 as the test set, and the spatial resolution input into the network was uniformly set to 64 × 64. Figure 6 is the RGB rendering of the HSI dataset.

CAVE Dataset
The CAVE dataset is a database of hyperspectral images that are used to simulate a GAP camera, and the entire dataset contains 32 hyperspectral images of different scenes. As shown in Figure 7, these images cover a wide range of real-world materials and objects, and each image includes full-spectral-resolution reflectance data in 31 bands from 400 to 700 nm with 10 nm steps.
To demonstrate the generalization ability of the proposed method, we tested it on this dataset as well. To ensure the accuracy of the experimental results, the experimental setup here was the same as that described in Section 4.1. We only replaced the ICVL dataset with the CAVE dataset, and the network followed QRNN3D. The reason that the improvement here was not so significant on the ICVL dataset was mainly due to the difference in the data volume. The ICVL dataset had 100 units training data, while the CAVE dataset only had 26 units of training data. However, we can see that our approach was equally effective on the CAVE dataset. Please see Figure 8 and Table 2 for detailed experimental results and data. Figure 7. Presentation of the CAVE dataset. The images above were provided by the CAVE dataset [71] and are all RGB images that were rendered under neutral daylight illumination (D65).

Implementation Details
In this section, we will describe the implementation details of the experiments. First, in the experiments, we used two datasets to accomplish the task, i.e., the ICVL [70] and CAVE [71] datasets. As the dataset for most of the experiments, we chose the ICVL dataset due to its large data volume and high image quality. From that dataset, we chose 100 of the images as the training set and 50 images as the test set. The CAVE dataset was only used to demonstrate the generalization ability of our method. Notably, all images in both datasets had 31 bands.
The QRNN3D [60] network was shown in experiments to have good performance as a network dedicated to the task of hyperspectral denoising; thus, we used it as a benchmark network. In our experiments, we added Gaussian noise (σ = 50) to the dataset and used the original loss function of the network for each network according to the original text. The images of the input networks were uniformly cropped to the size of 64 × 64 for training, and all of the band information was retained. Meanwhile, we used the Adam optimizer, whose initial learning rate was set to 1 × 10 4 , and we used a cosine annealing strategy [72] for training.
In addition, we also used two commonly used metrics, i.e., the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM), to evaluate the performance. The PSNR describes the ratio of the maximum power of the signal to the corrupted noise and is commonly employed to measure the reconstruction quality of images and videos. Moreover, the SSIM defines structural information as a feature of the scene-wide structure of an object independent of luminance and contrast, and the distortion is modeled as the interaction of these three elements. It is used in most hyperspectral denoising tasks; see, e.g., SDeCNN [26], SSDRN [73], etc. Therefore, to ensure the validity and fairness of the comparisons, we still used the two most commonly used metrics for evaluation. In addition, all experiments were performed on the same NVIDIA GeForce RTX3090 (24G) GPU for fair comparisons.

Comparison of Different Models
Generally speaking, the more parameters a model has, the more beneficial it will be for a network to learn. This is because the more parameters a model has, the larger the capacity of the entire model is, and the more things are learned. We selected four networks with different parameters (DnCNN [61], CBDNet [74], HSID-CNN [28], MPRnet [62], and QRNN3D [60]). To fairly compare the impacts of several networks on the DA method, we set the parameters for applying DA to p = 0.3, α = 16, and eN = 0.3.
Below, we will briefly introduce the four networks mentioned above and illustrate the improvements that we made to the network in our experiments to deal with HSIs. Then, we will show the results of our method on different models. DnCNN: The structure is shown in Figure 9. This network was the first to use residual learning for noise reduction. By combining residual learning and batch normalization (BN), the training of the noise reduction model can be greatly improved and accelerated. For a specific noise level, DnCNN can achieve an outstanding level of visual effects and evaluation indicators. For this network, we improved its structure and adjusted the network's input and output channels uniformly to 31 to adapt to the large number of bands in HSIs. It should be noted that this network does not pay attention to the information between different bands; therefore, it has certain limitations for HSIs. MPRnet: This is a progressive multi-stage network. As shown in Figure 10, the first two stages of the network adopt the U-Net structure. There are many attention modules embedded in the network. For example, each stage passes through a channel attention block in advance, and the skip-connection part of U-Net also has a channel attention block (CAB) module. In addition, there is an attention adjustment module-the supervised attention module (SAM)-that introduces supervision information between stages. Between stages, cross-stage feature fusion (CSFF) is also performed in the encoder-decoder part of the U-Net to better preserve contextual information. We also set the input and output channels to 31 to accommodate the number of bands in the dataset. CBDNet: This is from a paper included in CVPR 2019, and it reached state-of-the-art (SOTA) performance on the DND dataset at that time. The model is more inclined to remove noise from real environments, and the whole network has two components, i.e., a noise estimation network and a non-blind denoising sub-network that removes noise with unknown noise levels. In this work, synthetic noisy images and real-world noisy images were both utilized to train the network; thus, the network was able to represent the noise in real-world images and improve the denoising performance. The network structure is shown in Figure 11.

HSID-CNN:
Aiming at the characteristics of high redundancy and correlation of information in HSIs, this network considers the spatial-spectral joint at the input of a convolutional neural network. The end-to-end nonlinear mapping of noisy images and clear images is realized with deep convolutional neural networks, which solves the inflexibility present in other methods. This network uses multi-scale feature extraction and multi-level representation to obtain multi-scale spatial-spectral features and fuses different features for restoration. In this way, the network achieves better performance. The network structure is shown in Figure 12. QRNN3D This network is an alternating-orientation 3D recurrent neural network for HSI denoising, and it effectively exploits the structural spatial-spectral correlation and global correlation information along the spectrum. The alternating-direction structure that is introduced removes causal dependencies without adding extra computational costs. The model can model spatial-spectral dependence while maintaining the flexibility of HSIs with arbitrary bands. The network structure is shown in Figure 13. As shown in Table 3, for the above models, the performance thereof was improved after applying our DA method. However, for models with simpler structures, the improvement was limited, which was mainly because there was not much information that simple models could learn, and the addition of DA methods made it more difficult for these simple models to adapt to this change. Moreover, 3D convolutions are very useful for hyperspectral denoising tasks. In Table 3, we can see that QRNN3D achieved good performance with fewer parameters because the 3D convolution was able to extract inter-band information more efficiently. The experimental outcomes are shown in Figure 14.  The number of samples generated through DA should be investigated. If there are too many samples generated through DA, the learning of the network will be biased towards the new samples rather than the original samples. Otherwise, if the data samples generated through DA are too few, the network cannot learn the differences between the new samples and the original samples. Therefore, we designed a set of experiments to verify how different scales of augmented datasets affected the network. For experimental accuracy, we empirically set α = 16 and eN = 0.3 for better performance.
From Table 4, we can see that when the number of new samples was increased to 30% of the original dataset, the performance of the entire network was better. With the increase in the number of new samples, the performance of the network showed a certain level of decline. It is believed that the proportion of the original sample information learned by the network decreased as the proportion of new samples increased; thus, the performance of the network model also decreased. In Figure 15, we can also see that the images were best reconstructed at 30% of the new samples.

Total Number of Patches-α
The parameter α has an important role in the proposed PatchMask DA method; it refers to the total number of patches into which the image is divided. The greater the total number of patches is, the smaller the area of each patch will be, and the finer the overall area formed by a patch will be. It was possible to fit more complex textured areas of the image. In this way, more complex regions were trained, and the network performance was improved. To prove our conjecture, we conducted a set of experiments on the parameter α, and the experimental results are shown in Table 5 and Figure 16.

The Ratio of Patch Swaps-eN
Another key parameter, eN, refers to the proportion of patches that are swapped. Our DA method generated two new complimentary samples. As shown in Table 6, we found that the performance dropped significantly when the exchange ratio was closer to eN = 0.5. When eN = 0.5, the number of noisy patches decreased, and the probability that noisy patches happened to be in information-dense regions further decreased. It was originally envisaged that there would be complex textures and other areas in which the denoising task was more complex, and the method of covering the areas with noise masks lost the original effect, resulting in some performance degradation. The visualization results of the experiment are presented in Figure 17, where we can see that the image reconstruction was better in terms of detail at eN = 0.1.
The experimental results are shown in Table 6. This part of the experiment was performed with the other parameters kept the same. We set the proportion of the increased dataset to 30% of the original dataset, and the alpha parameter that we described above was set to 16. In addition, we still chose QRNN3D for the model and ICVL for the dataset. During training, we randomly added Gaussian noise with σ = 50 to the dataset.

Convergence Analysis
To prove that our DA method did not cause divergence in the original network, we show a comparison of the training loss curve obtained with our method. The abscissa in Figure 18 represents the training epoch, and the ordinate represents the training loss. As shown in Figure 18, the early stage of training (before epoch = 15), shown with the red curve (without our method), had consistently lower loss values than the blue curve (with our method), and the training loss curve decreased faster than the blue curve. The main reason for our analysis is that when the DA method was not used, the network needed to learn less content, and the network did not have to learn the changes in the noise distribution after DA and the parts that required the network to pay attention. Therefore, the convergence rate of the network was faster. At the later stage of training (shown after we have zoomed in on the image), we found that the loss value of the blue curve was significantly lower than that of the red curve, which was also in line with our predictions. After using the DA method, the performance of the network was further improved.

Conclusions
This paper investigated the problem of DA for HSI denoising. Existing DA methods are unsuitable for HSI denoising, so we designed a simple yet effective DA method called PatchMask for the task of HSI denoising. The proposed PatchMask generates new training samples by randomly exchanging some patches between noisy and clean images, and this does not increase the network's computation time. Newly generated samples encourage the model to learn which patches are more critical for noise removal, thereby discarding some unimportant patches. We qualitatively and quantitatively analyzed the parameters of the method and conducted plenty of experiments to verify the effectiveness and advantages of the proposed method on different models. This paper can shed some light on the future work concerning DA in HSI denoising and other computer vision tasks.

Data Availability Statement:
The datasets generated during the study are available from the corresponding author upon reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: