Memory Augmentation and Non-Local Spectral Attention for Hyperspectral Denoising

: In this paper, a novel hyperspectral denoising method is proposed, aiming at restoring clean images from images disturbed by complex noise. Previous denoising methods have mostly focused on exploring the spatial and spectral correlations of hyperspectral data. The performances of these methods are often limited by the effective information of the neighboring bands of the image patches in the spectral dimension, as the neighboring bands often suffer from similar noise interference. On the contrary, this study designed a cross-band non-local attention module with the aim of finding the optimal similar band for the input band. To avoid being limited to neighboring bands, this study also set up a memory library that can remember the detailed information of each input band during denoising training, fully learning the spectral information of the data. In addition, we use dense connected module to extract multi-scale spatial information from images separately. The proposed network is validated on both synthetic and real data. Compared with other recent hyperspectral denoising methods, the proposed method not only demonstrates good performance but also achieves better generalization.


Introduction
Unlike natural images and traditional grayscale images, hyperspectral images (HSIs) collected through hyperspectral sensors contain over a hundred spectral bands for the same scene [1][2][3].Benefiting from the richness of spectral information, HSIs play a crucial role in earth observation, such as target detection [4], mineral exploration [5], image classification [6], and more.However, due to the complexity and uncertainty of imaging, HSIs inevitably suffer from noise interference, including Gaussian noise, striping noise, and mixed noise [7].The presence of image noise reduces image quality and affects the interpretation of target information, which greatly hinders the application of hyperspectral images.Therefore, hyperspectral denoising algorithms have emerged to enhance image quality as much as possible.
In recent years, a large number of denoising algorithms have been proposed for HSIs disturbed by noises.According to the solution method, these can be divided into three categories, which are filtering-based denoising methods, optimization-based denoising methods, and deep learning-based denoising methods [7][8][9].

Filtering-Based Methods
Filtering-based methods comprise spatial filtering denoising approaches and transform domain-denoising methods [10].Specifically, spatial filtering methods employ various operators in the spatial domain to eliminate image noise [11][12][13][14][15].In [13], a multidimensional Wiener filtering was designed for hyperspectral denoising, treating an HSI as a third-order tensor and utilizing filtering along different directions to remove noise in different dimensions of the image.A Gabor filter was employed to detect stripe patterns in each band [14].Simultaneously, transform-based denoising methods utilize various projection transformations to recover a clean image from the contaminated image, including Fourier transform [16], Wavelet transform [15], and Principal Component Analysis (PCA) [17], among others.To be specific, the proposed approach HSSNR [15] uses Wavelet transform to learn the signal variation in the spectral and spatial dimensions of hyperspectral data.PCA was introduced in [17], in which a 2-D bivariate wavelet threshold and 1-D dual-tree complex wavelet transform method are introduced to improve image quality.Additionally, the well-known BM4D [18] implements the group collaborative filtering paradigm, in which similar patches are stacked in a high-dimensional array, and the image quality is improved via joint filtering in the transform domain.

Optimization-Based Methods
Recently, optimization-based denoising methods have been greatly developed, which use a variety of optimization methods to improve the stability and effectiveness of algorithms, including low-rank [19], sparse representation [20,21] and self-similarity [22,23] methods.Specifically, due to the limited distribution of spatial ground objects and the correlation of multiple adjacent spectra, low-rank attributes widely exist in the spatial and spectral dimensions of hyperspectral images.So far, a large number of methods based on low-rank constraints and subspace learning have been applied to HSI denoising [24].Meanwhile, a KBR-based tensor sparsity measure was proposed in [25], where a tensor is sparsely represented using Tucker decomposition and CP decomposition.In [26], a strategy utilizing low-rank matrix recovery (LRMR) was proposed, which focuses on learning the features of stripe noise.Considering that each band in an HSI may be affected by different levels of noise interference, an adaptive iterative factor selection strategy combining low-rank matrix factorization was proposed (NAILRMA) in [27].Its main innovation was to optimize and solve the problem of noise inconsistency in different bands.In addition, due to the continuity and universality of the distribution of hyperspectral objects, self-similarity is an inherent property of hyperspectral images, which has been widely used in many popular denoising methods.At present, the development trend of spectral denoising methods is to combine the spatial and spectral similarity of images to improve the quality of images [28,29].In [30], the authors proposed a low-rank restoration method that combines spatial and spectral information for image denoising, which simultaneously embeds the TV regularization, nuclear norm, and L 1 norm (LRTV).TV regularization is used to maintain the spatial structural information of the image, while the nuclear norm is used to learn the low-rank attributes of the spectrum.This method has a good effect on removing multiple types of noise.Unfortunately, optimization-based methods usually require hand-crafted priors and iterative solutions, which hinders the performance of HSI denoising to a certain extent.

Deep Learning-Based Methods
Due to the automatic learning of features, deep learning-based denoising methods have attracted increasing attention [31].As is well known, there is the strong correlation in the spatial and spectral dimensions of HSIs; hence, various methods have been dedicated to learning the spatial and spectral information.For instance, Chang et al. proposed a denoising method based on convolutional neural networks (HSIDeNet) [32] that aggregates the multi-scale contextual information of images through dilated convolution and multichannel filters.Ref. [33] proposed a spectrally enhanced rectangular transformer to explore the non-local similarity and the low-rank properties of HSIs (SERT).A model with noise intensity estimation (Partial-DNet) was designed for HSI blind denoising in [34].The noise intensity of each frequency band is estimated, and the channel attention mechanism is introduced, which is subsequently fused with the observed image to generate a feature map.In [35], a recursive neural network (QRNN3D) was used for denoising that can simultaneously explore the spatial spectral correlation and global correlation of images.Dong et al. proposed a separable 3-D denoising method that can significantly reduce computational costs [36].Yuan et al. proposed a single-band deep convolutional neural network denoising method (HSI-DCNN) that takes into account the spatial and spectral information of HSIs [37].HSI-DCNN uses a single band and its adjacent bands as inputs to the network, while utilizing multi-scale features to improve feature-expression capabilities.In addition, Maffei et al. proposed an image denoising method (HSI-SDeCNN) based on a single-model convolutional neural network [38].HSI-SDeCNN uses noise level mapping to balance denoising results, and the raw data detail information.Unfortunately, these methods only consider the relationships among adjacent bands and fail to capture the inter-relationships on a global scale in the spectral dimension.The above methods all fail to consider the non-local and global similarity of hyperspectral images in the spectral dimension.
To alleviate the above problems, this paper proposes a novel hyperspectral denoising method based on non-local memory-augmented spectral attention.Specifically, the proposed network consists of two stages.In the first stage, a Dense Connected Module (DCM) is used to extract local spatial information from hyperspectral images.In addition, this method utilizes the multi-layer spatial information of HSIs as much as possible by inputting three different scales.The second stage of the network mainly explores spectral similarity information in the HSIs, which includes two modules: a non-local spectral attention and a global spectral memory-augmented module (MAN).Among them, the non-local attention module aims to extract useful information from adjacent band features to supplement the current band information and remove noise.In this module, similar band features in the neighborhood are queried based on the current band features, which can explore non-local structural information in the spectral dimension.However, this approach ignores the global structural relationship of the spectrum.To address this issue, this study designed a global memory-augmented attention module.This module sets up a global memory library to remember the information of all bands in the data and uses the features of the current band to directly query all useful information in the memory library for restoring the current band.Subsequently, combining the outputs of the two spectral attention modules and processing them further through convolutional layers before fusing them with the original features can help the model capture different spectral aspects of the input data and integrate them effectively.Finally, the upsampling module is used to fuse the information from three scales and output the denoising result.The denoising performance of our proposed method conducted on synthetic and real data demonstrates its superiority over state-of-the-art methods.In summary, the contributions of this study can be concluded as follows:

•
Using the current band and its adjacent K bands as inputs to the network, the DCM is used to extract spatial information from the inputs, and it is applied on multi-scale spaces to fully learn the spatial structure of the image.

•
The non-local memory-augmented spectral attention module is designed to learn the non-local and global correlations among data spectra at each scale.• A series of ablation experiments were conducted, and the results were compared with those of existing methods on both synthetic and real data, which demonstrate the superiority of the proposed method.
The remainder of this paper is arranged as follows.Section 2 introduces the concepts related to hyperspectral denoising.The proposed network is described in detail in Section 3. The experimental results are discussed in Section 4. A summary of this paper is presented in Section 5.

Related Work
Due to the instability of sensors and the influence of the atmosphere, the collected HSIs often contain various types of noise.The purpose of the hyperspectral denoising task is to restore clean images X from noisy images Y, and the noise model can be expressed as follows: where Y represents a noisy HSI with the size of H × W × B, and B is the band number of the hyperspectral data.X is the clean image, and N denotes all random noise in the image.As previously studied [39], N may be the stripe noise, Gaussian noise, and the impulse noise, as well as various complex mixed noises.
To obtain clean images X, a large number of hyperspectral denoising methods have been proposed.Specifically, deep learning-based methods are widely used to learn the deep features of images.The mainstream idea of deep learning-based denoising methods is to explore the spatial and spectral information of the hyperspectral data, which can compensate for the information loss of the input spectral band and improve the denoising performance [40][41][42][43].However, most of these methods are unable to fully learn the spectral structure of the data.In particular, when adjacent bands are affected by similar noise interference, the learned neighborhood band information cannot provide more effective information supplementation for the input band, resulting in a low denoising performance.

Proposed Method
Most of the existing denoising methods do not consider the non-local relationship among spectral bands.When facing complex noise, the performance of most methods is significantly reduced, affecting the performance of image quality improvement.To solve the above problems, this study also set up a memory library that can remember the detailed information of each input band during denoising training, fully learning the spectral information of the data.In addition, we use a dense connected module to extract multi-scale spatial information from images separately.This method can learn the non-local structural relationship between the space and spectrum to improve the denoising effect.

Overall Network Architecture
As shown in Figure 1, the network mainly includes two stages, which are spatial information extraction and non-local spectral information augmentation.In the first stage of the network, a Dense Connected Module (DCM) is used to extract the spatial multi-scale features of the input current band and its adjacent K bands, aiming to fully utilize the spatial multi-scale information of the data in the denoising process.The second stage of the network consists of two modules: a non-local spectral attention module and a global spectral memory-augmented module.The non-local spectral module aims to extract useful spectral information from adjacent band features.In this module, the current band feature F b is used to query neighboring band features.The memory-augmented attention module utilizes a global memory library to remember useful information from adjacent bands in the training set and then uses the current band features to directly query the memory library for useful information when restoring the current band.Subsequently, convolving with a kernel size of 1 and fusing two spectral attention outputs X b and Y b with the current band feature can further improve the model's ability to extract relevant information from the input data.Finally, the upsampling module is used to fuse the information from three scales and output the final denoising result.

Spatial Information Extaction Module
Due to the fact that traditional convolutional neural networks can capture the features of similar objects in different local regions using stacked convolutional layers to extract hierarchical features, we use a dense connected network [44] to extract local features, as shown in Figure 2. According to Figure 2, the output of the DCM module can be obtained as follows: where l denotes the layer of network, and the activation function Relu is used here.W l represents the convolutional kernel size, and f use (.) is a fusion operation.Then, the features F l extracted from each layer are fused as the final extracted features.The advantage is that multiple layers of local spatial feature representations can be obtained.The fusion operation can be obtained through a 1 × 1 convolution operation, which can be expressed as follows: where F DCM is the output of the spatial information extraction module.W f use is the weight parameter during fusion.

Non-Local Memory-Augmented Attention Module
Given the current band I ∈ R H×W of the image, its adjacent bands are {I b−τ,..., I b+τ }, where H and W are the height and width of the HSI, and T = 2τ + 1 represents the neighboring band span of the current band.In the first stage of our network, each band is passed through a DCM in the feature space to obtain a feature map {F b−τ,..., F b+τ }.To model the non-local spectral correlation of the HSI, we designed a cross-band non-local spectral memory-augmented attention network (MAN), as shown in Figure 3.
The MAN consists of two parts, namely, the non-local spectral attention module and the global spectral memory-augmented attention module.
Firstly, the output feature F b ∈ R C ′ ×H×W in the first stage of the current band is used as the query tensor, while the output feature {F b−τ , . . .F b+τ } of adjacent bands is used as the key tensor and value tensor, where C is the dimension of the feature map.In Figure 3,  The biggest drawback of this strategy is the introduction of large matrix operations, which imposes a heavy burden on computer memory.Therefore, to alleviate this phenomenon, we set the focus on the neighborhood of a single band, and at this point, the scale of Γ is HW × HW, which can greatly reduce the computational cost.The goal of non-local spectral attention in hyperspectral denoising tasks is to query as many bands as possible in the spectral dimension that are most similar to the current band.However, if there is a significant difference in the spectral bands found, this not only leads to excessive network computation but also reduces the denoising performance.
To reduce matching errors in spectral bands, similar to [45], a Gaussian weight G was set here, which is multiplied by the correlation matrix Γ, and the center of the Gaussian map is located at the position of the query band.In short, the Gaussian map of each band in the first dimension of Γ is different.Throughout the entire network learning process, the standard deviation of the Gaussian function is used as a learnable parameter to find the optimal denoising performance.The learnable Gaussian map can maintain a good balance between the non-local and local relationships in spectral dimensions.Ultimately, the output of the cross-band non-local attention module can be expressed as follows: where ⊗ is a Hadamard product.However, due to the frequent approximation of noise interference between adjacent bands, the complementary information is limited.Therefore, we attempt to find the best non-local correlation band for the current band from a global perspective to better supplement spectral information for denoising.For this purpose, we set up a memoryaugmented spectral attention network here.This module maintains a global memory S, which is a learnable parameter in the network.Therefore, all non-local correlated bands of the current band are queried in the memory S, which is Γ S = QS ∈ R HW×N .The final output Y b can be obtained through the following steps: And then, the output X b and Y b of the cross-band non-local spectral attention module and the memory-augmented module are fused through a 1 × 1 convolution and added as residuals to the input feature F b of the current band.Finally, the upsampling module is used to fuse information from three scales and output the final denoising result.

Implementation Details
Firstly, the three scale sizes are 32, 64, and 128.The first τ and last τ bands adjacent to the current band are taken as inputs for the network, where K = 2τ.Then, these cropped image patches are randomly flipped by 90 degrees, 180 degrees, and 270 degrees in the horizontal direction to enhance the diversity of the training data.Next, an Adam optimization algorithm was used to optimize the network.The initial learning rate was set to 2 × 10 −4 , and the number of iterations was set to 200.

Experiments and Discussions
To demonstrate the effectiveness of the designed network, experiments were conducted on four datasets, including Washington DC (WDC) data, Pavia Center (PC) data, and Pavia University and Indian Pines data.Among them, WDC data and PC data were used as the basic data for synthesizing noise, and Pavia University and Indian Pines data were used as real-scene HSIs to further verify the practicality of all denoising methods.Due to the proposed method being an HSI denoising method based on spatial and spectral similarity constraints, five popular comparison methods were selected here, namely, LRMR [26], NAILRMA [27], LRTV [30], Partial-DNet [34], and SERT [33].It is worth noting that LRMR, NAILRMA, and LRTV are all traditional optimization-based methods that improve the performances of algorithms by linearly characterizing the spatial-spectral structure of HSIs.In contrast, Partial-DNet and SERT learn the inter-relationships within HSIs through deep networks, and they have advantages in hyperspectral denoising tasks due to their ability to learn deep features of the data.This section demonstrates the effectiveness of the proposed method in exploring spatial and spectral structures by comparing the denoising results of six algorithms on different datasets.
To train the denoising network, the minimum residual MSE loss function is used here, which is mathematically expressed as follows: where N denotes the training image pairs.φ i = y i − x i , and x i , y i , and k i denote the ith band of the clean image, the ith noisy band, and K bands in the neighborhood of the current band, respectively.θ is the network trainable parameters.This section also uses three widely used quantitative evaluation indicators to measure the denoising performance of the algorithms, namely, the PSNR, SSIM, and SAM.Specifically, the PSNR and SSIM compare the pixels of HSI images before and after denoising, while the SAM compares the spectral angular distance of images before and after denoising.To make the comparison of all algorithms more fair, sll experiments were conducted on the PyTorch platform using an NVIDIA RTX 3080 GPU.

The Synthetic Data Experiments and Discussions
The WDC contains hyperspectral data with a size of 1280 × 303 × 191.Images with a random sampling size of 1080 × 303 × 191 were used for training, while the rest were used for testing.The original size of the PC dataset was 1096 × 715 × 102, of which 1096 × 480 × 102 was used for training, and the rest was used for testing.When synthesizing data, similar to [46], four different levels of noise were added to the WDC and PC data to test the denosing methods' ability for the following, different levels of noise: (1) Case 1: In each band, zero-mean Gaussian noise with a variance of 0.1 is added; (2) Case 2: In each band, zero-mean Gaussian noise with a variance range of 0.1∼0.2 is added; (3) Case 3: On the basis of Case 2, 20 bands are randomly selected and receive impulse noise with a variance of 0.2; (4) Case 4: On the basis of Case 3, deadline noise is added with a width of 1-3 to 20 bands, with 10 bands selected from the bands with added impulse noise.
The quantitative results of all compared algorithms and the proposed method for HSI denoising conducted on four levels of noise data are shown in Tables 1 and 2. Among them, the best denoising results for each metric are displayed in bold.From Tables 1 and 2, it can be seen that the proposed method is significantly superior to the LRMR, NAILRMA, and LRTV methods on the WDC dataset and the PC dataset.Because all three kinds of methods are based on low-rank matrices, they lose the spatial structure information of data during the denoising process.In analyzing the denoising results of different types of noise data, it is evident that deep learning-based HSI denoising methods are generally superior to traditional machine learning-based methods.Deep learning-based denoising methods fully consider the spectral and spatial correlations of hyperspectral images.In comparing the Partial and SERT methods based on deep learning, it was found that the proposed method can achieve better results than the other methods in most of the three evaluation indicators.The above results indicate that the proposed method is beneficial in exploring the spatial-spectral relationship of HSIs.Its use of local memory-enhanced spectral attention techniques can fully learn the relationship between the spectral dimensions of data, thereby better restoring clean images.In addition, it can be summarized from Tables 1 and 2 that as the types of noise become increasingly complex, the performances of all algorithms decrease, but the proposed method performs more stably when dealing with different levels of noise.
To comprehensively compare the denoising effects, the visualization results of denoising for Case 2 data are shown in Figures 4 and 5. From Figures 4 and 5, it can be seen that there is some residual noise in the denoising results of LRMR, NAILRMA, and LRTV.At the same time, Partial DNet and SERT have good denoising effects, but the denoised images still have some noise and incomplete structural information preservation, losing many details.In contrast, the proposed algorithm has better reconstruction results, especially in some detail and edge areas, which can be restored more clearly.

The Real Data Experiments and Discussions
Simialr to [34], the Indian Pines dataset and Pavia University dataset were used as real data to test the performances of five comparison algorithms and the proposed method.Figure 6a is a pseudocolor image from the original HSI Pavia University dataset affected by noise interference, and Figure 6b-g are the denoised results of algorithms LRMR, NAILRMR, LRTV, Partial DNet, SERT, and the proposed algorithm, respectively.According to the previous literature [34], the Pavia University dataset is weakly affected by noise interference.Therefore, the results of traditional denoising algorithms LRMR, NAILRMR, and LRTV are similar to those of deep learning-based denoising methods Partial-DNet and SERT on the Pavia University dataset.After zooming in on all the denoising results, it can be found that our proposed algorithm maintained better and clearer texture details, and some oversmoothing phenomena occurred using the Partial-DNet and SERT methods.
As is well known, the Indian Pines dataset is highly affected by noise interference.Therefore, in Figure 7, it is evident that there are differences between the original noisy image, Figure 7a, and the denoising results of the six algorithms, Figure 7b-g.It can be concluded that the three machine learning-based comparison methods, including LRMR, NAILRMA, and LRTV, cannot completely remove complex noise.After denoising, there are still a large amount of noise in the results.On the contrary, the two methods based on deep learning have much better visual effects than the traditional methods, but they lose many details in some texture areas compared to the proposed method, where there are some oversmoothing phenomena, as shown in Figure 7. Overall, the proposed hyperspectral denoising algorithm based on non-local memory-augmented spectral attention outperforms the comparison methods in terms of visualization on real datasets, with better performances in denoising and detail preservation.

The Ablation Experiments and Discussions
To obtain the best model, we conducted ablation experiments and confirmed the optimal parameter T on Case 2 data from PC data.In the proposed network, we set a total of T bands, including the current band and its adjacent bands, as inputs to the network to improve the spectral efficiency of the denoising algorithm.Therefore, the input bands T is a key parameter in the denoising network.We explored the influence of T on denoising results, and Table 3 presents the quantitative evaluation results of the algorithm under different spectral band inputs.It can be clearly seen from Table 3 that the denoising performance of this study's proposed method first improves with the increase in T, and the denoising effect is best when T = 35.As T further increases, the algorithm performance gradually decreases.Therefore, T was set to 35 here.It could also be demonstrated through parameter experiments that non-local spectral information is crucial for the proposed method.In addition, Table 4 provides quantitative evaluation indicators for the ablation experiments of various parts of the network used to analyze the effectiveness of the proposed method.The network mainly consists of two parts.The first part is the spatial information extraction module, which aims to use pixel and spatial relationships to assist in the image restoration performance.To verify its effectiveness, we removed this module from the overall network framework and express it as o/w in the Table 4. From Table 4, we can conclude that its denoising results decreased by about 0.2 dB compared to the original method.The SSIM decreased by 0.0007, and the SAM increased by 0.0612.This is sufficient to prove that the proposed spatial information extraction module is beneficial for hyperspectral denoising tasks.The second part is the non-local memory-enhanced spectral attention module.In order to verify its effectiveness, we removed this module from the overall network framework and express it as w/o in the table.It can be seen that its denoising results decreased by about 1.18 dB compared to the original method; the SSIM decreased by 0.005, and the SAM increased by 0.3865, fully proving the effectiveness of the non-local memory-enhanced spectral attention module.Overall, the above ablation experiments demonstrated the superiority of the proposed network in denoising tasks.

Conclusions
We proposed a hyperspectral denoising method that has better robustness against complex noise.Unlike general denoising methods that use spatial-spectral correlations, the proposed method uses a designed cross-band non-local module to search for bands that can provide information supplementation for the current band.Due to the introduction of a memory-augmented module, the proposed network can also remember the detailed information of all bands during training, thereby supplementing the information loss of non-local similarity bands in the global spectral dimension.In addition, the proposed method uses a dense connected network to extract multi-scale spatial information from hyperspectral data.Compared with state-of-the-art denoising algorithms, the proposed network performed better on both synthetic and real data.Unfortunately, the method proposed in this paper is based on 2-D images.An original hyperspectral image is a 3-D image, and the proposed memory-augmented module-based denoising method still unfolds the 3-D image into a 2-D image for processing, losing some irreparable correlation information.In the future, we will explore methods for directly denoising 3D images.

Figure 1 .
Figure 1.Structure of the denoising method proposed in this paper.
and V ∈ R C ′ ×T×H×W represent the query tensor, the key tensor, and the value tensor, respectively, where C ′ = C/2.Generally, Q and K are directly matrixed into Q ∈ R HW×C ′ and K ∈ R C ′ ×HWT .The correlation matrix Γ is then calculated as Γ = Q K.

Figure 3 .
Figure 3. Structure of the MAN.
The size of the Indian Pines dataset is 145 × 145 × 206, which was collected by AVIRIS.Meanwhile, the size of the Pavia University dataset is 200 × 200 × 103.When testing the real datasets, the network trained under Case 2 on the artificially synthesized data was selected as the training model for the two real dataset.Due to the lack of corresponding clean images in real noisy images, a quantitative comparative analysis could not be performed.Therefore, only the visual comparison results of the denoised algorithm and the proposed method are presented here.The visual image-comparison results of each method on the Indian Pines dataset and Pavia University dataset are shown in Figures 6b-g and 7b-g.

Table 1 .
Results on the WDC Mall Data.Bold represents the optimal result.

Table 3 .
Results of different T values on Case 3 of the PC data.Bold represents the optimal result.

Table 4 .
Results of ablation experiments on Case 3 of the PC data.Bold represents the optimal result.