TRQ3DNet: A 3D Quasi-Recurrent and Transformer Based Network for Hyperspectral Image Denoising

: We propose a new deep neural network termed TRQ3DNet which combines convolutional neural network (CNN) and transformer for hyperspectral image (HSI) denoising. The network consists of two branches. One is built by 3D quasi-recurrent blocks, including convolution and quasi-recurrent pooling operation. Speciﬁcally, the 3D convolution can extract the spatial correlation within a band, and spectral correlation between different bands, while the quasi-recurrent pooling operation is able to exploit global correlation along the spectrum. The other branch is composed of a series of Uformer blocks. The Uformer block uses window-based multi-head self-attention (W-MSA) mechanism and the locally enhanced feed-forward network (LeFF) to exploit the global and local spatial features. To fuse the features extracted by the two branches, we develop a bidirectional integration bridge (BI bridge) for better preserving the image feature information. Experimental results on synthetic and real HSI data show the superiority of our proposed network. For example, in the case of Gaussian noise with sigma 70, the PSNR value of our method signiﬁcantly increases about 0.8 compared with other state-of-the-art methods.


Introduction
Hyperspectral sensors capture information in different continuous wavelengths such as ultraviolet, visible, and near-infrared simultaneously, and produce hyperspectral images (HSIs) with numerous bands, which contain richer spatial and spectral information than RGB images, and better represent real scenes. Thus, HSIs can be applied to multiple remote sensing tasks, including classification [1][2][3][4][5], segmentation [6,7], spectral unmixing [8], etc. However, due to the inevitable sensor sensitivity, photon effects, and other physical mechanism, raw HSIs are often corrupted by various noise, i.e., Gaussian, stripe, deadline, impulse, or a mixture of them, exerting a negative influence on the downstream HSI applications. Therefore, it is crucial to conduct HSI denoising to achieve a better performance.
HSI denoising is an important task in the area of image processing and remote sensing. Various HSI denoising methods are proposed and an effective toolbox is provided in previous work [9]. Current HSI denoising methods can be categorized into two classes: model-based methods and deep-learning-based methods. The model-based methods try to exploit the prior knowledge among HSIs, and typical methods are dictionary-learningbased (i.e., TDL [10]), filtering-based (i.e., BM4D [11]), tensor-based (i.e., ITSReg [12], LLRT [13]), low-rank-matrix-recovery-based (i.e., LRMR [14], LRTV [15], NMoG [16]), and low-rank-tensor-based (i.e., TDTV [17]). We list some typical methods below. BM4D comes from BM3D [18] and contains hard-thresholding stage and Wiener-filtering stage, with three similar steps: grouping, collaborative filtering, and aggregation. Peng et al. [10] propose the tensor dictionary learning (TDL) as an extension of singular value decomposition (SVD). • We propose TRQ3DNet, a residual encoder-decoder network for HSI denoising, which consists of two branches. One is based on convolution, and the other is transformer. The model can extract both the global correlation along spectrum and the local-global spatial features. • We present a bidirectional integration bridge, which aggregates the global features from convolution layers and the local features from window-based attention mechanism, so as to exploit a better representation of image features. • We conduct both synthetic and real HSI denoising experiments. Quantitative evaluation results reveal that our model achieves a better performance than other state-ofthe-art model-based and deep-learning-based methods.
The rest of the paper is organized as follows: We introduce the proposed method in Section 2, and the experimental results are presented in Section 3. In Section 4, we conduct analysis and discussion, and in Section 5 we draw the conclusion. Code is available at https://github.com/LiPang/TRQ3DNet (on 8 September 2022).

Notations
An HSI Y degraded by various noise can be described as a linear model: where X is the ideal noise-free image, and N is the addictive noise, e.g., Gaussian noise. Y, X, N ∈ R B×H×W , and H, W, B represent the height, the width, and the band number of the HSI, respectively. The goal of HSI denoising is to recover the clean X from the noisy observation Y.

Overall Architecture
The overall architecture of the TRQ3DNet is shown in Figure 1. Our network consists of an extractor, a reconstructor, and four pairs of symmetric TRQ3D units. The basic unit of the residual encoder-decoder network consists of three parts: 3D quasi-recurrent block (QRU3D block), Uformer block, and the bidirectional integration bridge (BI bridge).  Figure 1. The overall architecture of the TRQ3DNet.
The input degraded HSIs X are fed into the QRU3D and Uformer extractors separately to extract low-level features. As for the QRU3D extractor, the input firstly goes through a bidirectional QRU3D layer and obtains output named X qru . For the Uformer extractor, the input is applied with a dimensional transformation (2D convolution with LeakyReLU [38] activation) and obtains X trans . Next, X qru and X trans are fed into a sequence of TRQ3D units which are composed of QRU3D and Uformer blocks. Before X trans is input into the Uformer block, information from the QRU3D block is integrated. We apply a 3D and 2D convolution to X qru , which has the same dimensions as X trans by setting the proper stride of the convolution kernel, then the weighted element-wise sum of the two parts are fed into the Uformer block, and we obtain the output. This output serves as the input for the next Uformer block. At the same time, the original X qru goes through the QRU3D block, and is added to the output from Uformer block (we also perform the weighted element-wise sum), which is the input of the next QRU3D block. Eventually, the restored image is obtained via QRU3D and Uformer reconstructors which are structured similarly to extractors. Above all, the degraded HSIs are passed into extractors to extract low-level characteristics. After being processed by a succession of TRQ3D units, clean HSIs are obtained by reconstructors. Each TRQ3D unit takes two results from the previous unit as inputs, one from the QRU3D block and the other from the Uformer block. The two inputs are separately fed into the two blocks with information exchange, which generates two outputs that are fed into the next TRQ3D unit.
We set stride = 2 of 3D convolution for the first half of the blocks, and stride = 1/2 for the rest, separately. By adapting the stride of the convolution kernel, we can perform downsampling operations in the encoder part and upsampling operations in the decoder part. This can lessen the computation cost, reduce the risk of overfitting, and make the network more adaptable to larger datasets. In the following, we introduce the three components of the network in detail.

3D Quasi-Recurrent Block
As seen in Figure 2, the 3D quasi-recurrent (QRU3D) block is one of the components of the TRQ3D unit, which is composed of two modules: a 3D convolution module and a quasi-recurrent pooling module.     (1) 3D Convolution Module: We apply two sets of 3D convolution [39,40] to the inputs, which activates the convolution output with two different nonlinear functions, and generates two tensors, named candidate tensor Z and forget gate tensor F. The process is formulated as where X ∈ R C in ×B×H×W is the input feature map from the last layer, and C in is the number of input channels. σ represents a certain activation function, e.g., tanh, relu or without activation. {Z, F} ∈ R C out ×B×H×W and C out are the number of output channels. {W Z , W F } ∈ R C out ×C in ×3×3×3 are 3D convolution kernels. Notice that we only use sigmoid for gate tensor F, in order to map the output to values between 0 and 1. Compared with 2D convolution, 3D convolution can not only aggregate spatial domain information, but also exploit the spectral information of the input. (2) Quasi-Recurrent Pooling Module: Considering that the 3D convolution can only aggregate the information in adjacent bands, motivated by the QRNN3D [22], we introduce the quasi-recurrent pooling operation and dynamic gating mechanism in order to fully exploit global correlation along all the bands. We split the candidate tensor and forget gate tensor along the spectrum direction, obtaining sequences Z = {z 1 , z 2 , . . . , z B } and F = {f 1 , f 2 , . . . , f B }. Then, these sequences are applied with the quasi-recurrent pooling operation, as shown below: where h i is the i-th hidden state (h 0 = 0), and represents the Hadamard product.
The value of f i controls the weight of candidate tensor z i and states from the last step h i−1 . The sigmoid is used as the activation function for the forget gate. All h i are concatenated along the spectral dimension, generating the output. The benefits of quasi-recurrent pooling are that the module will automatically preserve the information of each spectrum z i through the training process, and achieve global correlation along all spectra. Notice that the hidden state only depends on the current band of the input feature map, so the gate tensor relies more on the input as well as the parameters learned from the training process.

Uformer Block
As is shown in Figure 3, each Uformer block is stacked by two LeWin transformer blocks [37]. Each LeWin transformer block has two basic modules: window-based multihead self-attention (W-MSA) and locally-enhanced feed-forward network (LeFF), which are described asX whereX (l) and X (l) represent the output feature maps of the l-th W-MSA and LeFF module. LN represents the layer normalization [41].   (1) Window-based Multi-head Self-Attention: The projection layer transforms the bands value of the input feature maps, from X ∈ R B×H×W to X ∈ R B out ×H×W . Each band of the feature maps is seen as a 2D map, which is partitioned into non-overlapping windows with size M × M, expressed as X = {X 1 , X 2 , ..., X N }, where N = HW/M 2 is the number of windows (also called patches), and X i ∈ R B out ×M×M , i = 1, 2, ..., N. Then, we flatten these image patches with size M 2 × B out , and calculate the multihead self-attention on each of them. Given the total number of head K, we can draw that the dimension of k-th head is D k = B out /K, and the k-th self-attention of X i is calculated as where {W V } ∈ R B out ×D k are learnable weight matrices. When calculating the self-attention, we also employ the relative position bias B ∈ R M 2 ×M 2 to each head, following the work of [37]. Thus, the attention mechanism is and Q, K, V represent the query, key, and value, separately. The values in B are taken from a smaller bias matrixB ∈ R (2M−1)×(2M−1) with learnable parameter [42,43]. For N patches X i ∈ R M 2 ×B out , i = 1, 2, ..., N, we obtain corresponding N output feature maps Y Finally, outputs of K heads are concatenated together followed by a linear transform to generate the result of the LeWin transformer block.
(2) Locally-enhanced Feed-Forward Network: The feed-forward network (FFN) in the standard transformer model provides linear dimensional transformation and nonlinear activation to the tokens from the W-MSA module, which enhances the ability of feature representation. The limitation is that the spatial correlation among neighboring pixels is ignored. To overcome this problem, we replace the FFN with a locally-enhanced feed-forward network (LeFF), because the latter model provides depth-wise convolution to extract the spatial information. The process is as follows. First, the input feature maps are fed into a linear layer, and projected to a higher dimension. Next, we reshape the feature maps to 2D feature, using a 3 × 3 depth-wise convolution to capture local information. Then, we flatten back the features and shrink the channels via another linear layer to match the dimension of the input channels. For each linear and convolution layer, GELU [44] is used as the activation function since it has been proven to achieve comparable denoising results compared with other activation functions [36,37].

Bidirectional Integration Bridge
We propose a bidirectional integration bridge (BI bridge), to combine and enhance the representation of image features. The 3D convolution extracts the local spatial correlation and spectral correlation between neighboring bands. The quasi-recurrent pooling exploits the global correlation along the spectrum. The Uformer block is able to exploit the global spatial features. The BI bridge is used to combine the three parts to better preserve image details. We point out that the size of the QRU3D output feature map is On one hand, the input of the next Uformer layer is the weighted sum of outputs of the current Uformer layer and the QRU3D layer. In order to match the output size from the two modules, dimensional transformation is necessary. We first conduct a 3D convolution operation, shaping the feature map from C qru × B × H × W to 1 × B × H × W, and squeeze it to B × H × W. Then, the feature map is applied with a 2D convolution, transforming from B × H × W to C trans × H × W. Finally, the element-wise sum of this feature map and the Uformer layer output is the input of the next Uformer layer. This dimension transformation, making the QRU3D feature map adapt to the Uformer block, is seen as a unidirectional bridge.
On the other hand, the input of the next QRU3D layer is also the weighted sum of outputs of the current Uformer layer and the QRU3D layer. We use the reverse transformation method above. Firstly, a 2D convolution transforms the Uformer layer output with size Then, we unsqueeze it to 1 × B × H × W, and apply a 3D convolution to size C qru × B × H × W. Finally, we perform a 1 × 1 × 1 3D convolution operation on the element-wise sum of this feature map and the QRU3D layer output, generating the input of the next QRU3D layer. This dimension transformation, making the Uformer feature map adapt to the QRU3D block, is the other directional integration bridge. Essentially, the BI bridge is made up of two directional (from QRU3D to Uformer, and reverse) dimension transformations, based on 2D and 3D convolution.

Training Loss Function
In this part, we introduce the loss function used in the training process. The l 1 loss and l 2 loss are often used in order to make a balance between noise removal and detail preservation [23]. In this paper, we adopt the l 2 loss as the final loss function, which is defined as where N is the number of training patch image, and X i and X (gt) i represent the output denoised image patch and ground truth image patch, respectively.

Experimental Setup
In this section, we conduct several experiments to evaluate the model. In the following, we introduce the datasets we use, the denoising methods to compete with, and training strategies, as well as evaluation metrics. Code is available at https://github.com/LiPang/ TRQ3DNet (on 8 September 2022).
Dataset: We conduct synthetic experiments on the ICVL dataset [45], CAVE [46], Pavia Centre [47] and Pavia University [47], in which the HSIs can be seen as clean. In addition, real HSI denoising experiments are conducted on real remotely sensed hyperspectral datasets, e.g., Urban [48] and Indian Pines [49]. The overall information of the involved datasets can be seen in Table 1. The ICVL dataset, which contains 201 images, is randomly divided into three disjointed parts for training, validation, and testing. Taking the training time into account, we use 100 images to train, 20 images to validate, and the rest to test our model. To enlarge the training dataset, we crop the images into multiple overlapped cubes with size 64 × 64 and bands 31 (same as the original bands number) to preserve the spectral domain integrity. In addition, transformations such as rotation and scalings are adopted, and we thus generate about 50 k training HSIs patches in total. As for testing and validation, the images are cropped to 512 × 512 × 31 from their domain region. To better assess the robustness of the trained model, in addition to 20 samples from ICVL dataset, we select 10 samples from CAVE, adding various noises (e.g., Gauss, impulse, mixture) as the validation set.
The CAVE dataset is made up of 31 hyperspectral images at a spatial size of 512 × 512 with 31 bands. To better evaluate the performance of different models, we perform threefold cross validation on the CAVE dataset. We use 21 images for training and the remaining 11 images for testing in each fold. Similar to the process procedure of the ICVL training set, we crop the training images into cubes with size 64 × 64 and bands 31, and perform rotation and scalings transformations, generating 2562 training patches.
Remotely sensed hyperspectral datasets (i.e., Pavia Centre, Pavia University, Indian Pines, and Urban) are used to further verify the stability and adaptability of our model. The number of spectral bands of Pavia Centre is 102 and that of Pavia University is 103, and both datasets were captured by the ROSIS sensor. Indian Pines and Urban were obtained using the 224-bands AVIRIS sensor and 210-bands HYDICE hyperspectral system, respectively.
Competing Methods: We conduct synthetic experiments as well as real HSIs denoising experiments. In both sets of experiments, several state-of-the-art model-based methods and deep-learning-based models are compared.
To evaluate the HSI denoising method, two types of synthetic experiments are often needed. The first one is the Gaussian noise experiment, and the second one is the complex noise case. For each of the two cases, the state-of-the-art (SOTA) methods are quite different, which has been recognized by existing work [22]. Therefore, in our experiments, we use different SOTA methods for the two cases. In synthetic experiments, we compare our method with the model-based methods (BM4D [11], TDL [10], ITSReg [12], LLRT [13]), and the deep-learning-based methods (HSID-CNN [20], swinir [36], QRNN3D [22]) in Gaussian noise case, and model-based methods (LRMR [14], LRTV [15], NMoG [16], and TDTV [17]), and the same deep-learning-based methods in complex noise case. All approaches based on deep learning are trained and tested in the same condition to ensure fairness.
Network Training: The training process contains three successive stages with increasing difficulty, from Gaussian noise with certain intensity to uncertain case, from single noise to mixture case. The model is trained by minimizing the mean square error (MSE) between degraded HSIs and the corresponding ground truth. The parameters are optimized by Adam optimizer [50]. The whole project is based on the deep learning framework Pytorch on a machine with Tesla V100 PCIe GPU, Intel(R) Xeon(R) CPU E5-2690 v4 of 2.60 GHz and 503 GB RAM.
Evaluation Metrics: We adopt three common metrics to evaluate the performance of the models in the synthetic experiment, including PSNR, SSIM [51], and SAM [52], and report the time cost for each model. The first two metrics measure the spatial similarity, and the last measures the spectral similarity. Given a reference image I and the reconstruction imageÎ, PSNR can be calculated as where H and W represent the height and width of the image, and n is the pixel digits which is usually 8. The core formula of SSIM can be shown as follows: where µ I and µÎ are the mean values of images I andÎ, σ 2 I and σ 2 I are the respective variances of images I andÎ, and σ IÎ is the covariance of the images I andÎ. The higher the PSNR and SSIM, and the lower the SAM, the better the model performs. Furthermore, we adopt a no-reference quality assessment proposed by Liu et al [53] for HSI denoising, which evaluates the image quality according to the changes in kurtosis values of noisy images. The lower the score, the higher the quality of the recovered image. As HSIs contain hundreds of bands, we calculate PSNR, SSIM, and the no-reference quality score for each band of the HSIs, and take the average value as the final result.

Synthetic Experiments
In synthetic experiments, we simulate the real situation by adding various noise to HSIs artificially. Many experimental works for the HSI denoising task assume that the HSIs are usually contaminated by Gaussian, impulse, dead pixels, lines, and stripes noises [54]. Therefore, we design five noise combinations, as shown below: Case 1: Non-i.i.d. Gaussian Noise. Gaussian noise with zero mean, as well as intensities randomly set from 10 to 70, is added to each band.
Case 2: Gaussian + Stripe Noise. On the basis of Case 1, all bands have Gaussian noise added, among which 10 bands are randomly selected to be corrupted by stripe noise. The number of stripes in each band is 5% to 15% of the number of columns by uniform sampling.
Case 3: Gaussian + Deadline Noise. Similar to Case 2, the only difference is that we use deadline noise instead of stripe noise.
Case 4: Gaussian + Impulse Noise. On the basis of Case 1, all bands have Gaussian noise added, among which 10 bands are randomly selected to be corrupted by impulse noise, and the intensity of impulse noise ranges from 10% to 70% by uniform sampling.
Case 5: Mixture Noise. On the basis of Case 1, all bands have Gaussian noise added, among which 10 bands are randomly selected to be corrupted by one of the extra noise mentioned in above cases.
We adapt the training and testing strategy mentioned above (see experimental setup). Specifically, in stage 1, namely, in the first 30 epochs, the training set is formed by HSIs corrupted by Gaussian noise with zero mean and known noise intensity σ = 50, and batch size set to 16. In stage 2, from epochs 30 to 50, it is similar to stage 1, yet the intensity of Gaussian noise is uniformly sampled from 30 to 70. In stage 3, from epochs 50 to 100, we use noise combinations randomly chosen from Case 1 to Case 4. Batch size in stage 2 and 3 is set to 64 to stabilize the training process. At the end of stages 2 and 3, we test the model on the Gaussian denoising task and complex denoising task separately.
Gaussian Noise Denoising on ICVL: We compare our model with four model-based methods (BM4D, TDL, ITSReg, and LLRT) and three DL-based methods (HSID-CNN, swinir, and QRNN3D). Figure 4 presents the Gaussian denoising example (with noise intensity σ = 50) of the ICVL dataset. Intuitively, we evaluate the performance of the model on different levels of Gaussian noise intensity (σ = 30, 50, 70 and blind). The qualitative evaluation results and time cost are shown in Table 2. Compared with other methods, our model achieves better performances in most Gaussian noise cases.
Complex Noise Denoising on ICVL: We compare our model with four model-based methods (LRMR, LRTV, NMoG, and LRTDTV) and three DL-based methods (HSID-CNN [20], swinir [36], and QRNN3D [22]). We evaluate the performance of the model in complex noise cases from Case 1 to Case 5 mentioned above, select one of the images from each case, and visualize the restoration status in Figure 5. The qualitative evaluation results and time cost are shown in Table 3. The results show that our model outperforms other state-of-the-art methods in all cases. We plot the PSNR value of each band in Gaussian and complex noise cases, as seen in Figure 6. We can easily observe that the PSNR values of most bands obtained by our model are higher than competing methods, indicating that our model outperforms others.     LRTV [15] NMoG [16] LRTDTV [17] HSID-CNN [20] swinir [36] QRNN3D [

Complex Noise Denoising on CAVE:
In addition to testing on ICVL, we also perform threefold cross validation on the CAVE dataset in the mixture noise case. Since the CAVE dataset is small, which is prone to cause overfitting, we employ a small amount of samples from the CAVE dataset to fine-tune the model trained on the ICVL dataset instead of training from scratch. All the competing DL-based approaches are trained in the same way for fair comparison. The experimental results are shown in Table 4, from which we can observe that our proposed TRQ3DNet obtains the best denoising performance compared with other methods.

Index Methods
Noisy LRMR [14] LRTV [15] NMoG [16] LRTDTV [17] HSID-CNN [20] swinir [36] QRNN3D [ Complex Noise Denoising on Pavia University: To further verify the effectiveness of our proposed method, we also conduct experiments on the Pavia University dataset in the mixture noise case. Considering the similarity between Pavia Centre and Pavia University, we first evaluate the performance of our method trained from scratch on Pavia Centre (denoted as Ours-S). The Ours-S model seems to overfit the data and obtains an undesirable result owing to the fact that the Pavia Centre training set is relatively small (only 969 patches of size 64 × 64 × 31, which is small compared with 53,000 training patches of the ICVL dataset). Nevertheless, the model trained on the ICVL dataset (denoted as Ours-P) outperforms all the other methods. For fair comparison, all the DL-based methods are obtained by the model trained on the ICVL dataset, which is the same as Ours-P. Furthermore, we fine-tuned the Ours-P model for another 50 epochs on the Pavia Centre dataset (denoted as Ours-F). It can be observed from Table 5 that the fine-tuned model significantly boosts the performance, which verifies the adaptability of our proposed TRQ3DNet.

Real HSI Denoising
Except for the aforementioned synthetic experiments, we also evaluate our model on two real HSI datasets, i.e., Indian Pines and Urban. The two real noisy HSI datasets have no ground truth image for reference, and thus it is very difficult to evaluate the performance. In this paper, we adopt the visualization and one no-reference quality assessment proposed in [53] for HSI denoising, which evaluates the image quality according to the changes in kurtosis values of noisy images. The lower the score, the higher the quality of the recovered image. As illustrated in Table 6, TRQ3DNet achieves the best denoising performance in terms of this index. Additionally, we also show some visualization comparison results in Figures 7 and 8, from which it can be seen that our model achieves better performance than other methods since it can not only remove the complex noise, but also preserve the local details of the HSIs, which is consistent with the quantitative results. Table 6. Real HSI denoising results comparison using the no-reference quality assessment.

Ablation Study
Our model has three main components: QRU3D block, Uformer block, and BI bridge. To explore the specific function of each part, we design the following ablation experiments. We follow the steps in Section 5, and test different noise combinations from Case 1 to Case 5. Results from swinir and QRNN3D are listed as reference. Our QRU3D block has a slight difference from that in QRNN3D, where our model contains 1 × 1 convolution layers, when fusing the output from two blocks.
Investigations of Subcomponent: We split our model into three parts (QRU3D block, Uformer block, and BI bridge) and test each part independently, which are recorded as QRU3D, TR, and WithoutI, separately. WithoutI is the combination of the QRU3D block and the Uformer block with the BI bridge removed, and the purpose is to explore the improvement of the BI bridge to the model. QRU3D Block: To investigate the role of QRU3D, we remove this structure as well as the BI bridge, keeping the Uformer block only. The quantitative comparison between TR and WithoutI shows the promotion of the Uformer block on the model. Nevertheless, the parameters of the model are also greatly reduced; thus, we change the hidden channels of TR (16 to 32), called WTR, with approximate parameters with the full model. The results show that WithoutI outperforms the TR and WTR in all cases, and only the SAM value in Case 5 is larger than WTR, indicating that the QRU3D block takes positive effect and extracts features successfully. One possible reason why TR performs poorly is that, different from the image classification task in which general information is more important, the image denoising problem focuses more on the details of the image, such as the value of each pixel.
Compared with transformer, convolutional neural network can better capture the local correlation and preserve image details [55], and thus obtain better denoising performance.
Uformer Block: Similar to the first experiment, we keep the QRU3D block and compare the performance with WithoutI. We also test WQRU3D by doubling the channels. It can be observed that the Uformer block boosts the performance to some extent.
BI Bridge: We compare the effectiveness of the original model and WithoutI (called Ours). It can be seen that Ours is the best of the bunch, demonstrating that the information interaction between the QRU3D modules and the Uformer modules can also improve the performance.
The ablation results on each sub component are listed in Table 7.

Investigations of Network Hyperparameters:
The purpose of this experiment is to select hyperparameters beforehand, so as to achieve the trade-off between performance and computation cost. We consider two kinds of parameters: the depth of the model (the total number of TRQ3D blocks), and width of the model (hidden channels). We search the parameter in a small grid map on Gaussian denoising case (σ = 50), and choose by the integration of PSNR value, time cost (seconds), and the size of the network. The first step is to fix width and search for the best depth, then use the best depth and determine the width. The evaluation results of each hyperparameter pair are shown in Table 8. We finally take depth = 8 and width = 16, and apply this to the above experiments. Although the model with width = 20 achieves better results than that with width = 16, the computation load and the network size in this case increase greatly.

Training Strategy
We adopt a three-stage training strategy in our experiment, which is an incremental training policy and has already been used to train the deep neural network to prevent the network from converging to a poor local minimum [22]. All the other DL-based models are trained in the same way so that they can be fairly compared. We also trained our proposed TRQ3DNet only using the complex noise case. As shown in Figure 9, in contrast to training incrementally, training from scratch makes the optimization converge to a poor local optimum.

Feature Analysis
We also visualized the feature maps to explicitly show the better feature preservation ability of the BI bridge. Specifically, we made comparisons of the feature maps of the second encoder layer and the third decoder layer in Figure 10. Our QRU3D and our Uformer denote the outputs of the TRQ3D block, while QRU3D and Uformer represent the feature maps of the single branch trained without other subcomponents of TRQ3DNet. It was introduced that TRQ3DNet takes advantage of both local and global features. From Figure 10, we can see that QRU3D enhances local details such as the small bright regions in the middle of the peppers. Our QRU3D benefits from the global representation of the Uformer branch and activates more decentralized areas. For example, QRU3D activates a distinct area inside the peppers in the feature map (e), while our QRU3D activates the entire peppers in the feature map (f). Compared with Uformer, our Uformer retains the detail of the local features from the QRU3D branch (e.g., (c,d)). In addition, it seems that there are obvious striping artifacts in the features. One possible reason is that the training set has stripe noise added, as described in Section 3.2, and thus the proposed network can learn the stripe features which are reflected in the outputs of the hidden layer. Therefore, even if the testing data do not contain such stripe noise, we can not guarantee whether this phenomenon will occur or not. More effective training methods can be explored to alleviate the problem.

Structure Analysis
We propose the bidirectional integration bridge (BI bridge) to fuse convolutional local features with transformer-based global representations in an interactive fashion. Considering the BI bridge as a short connection, it can thus be flexibly placed in different positions of the TRQ3D unit. For example, as shown in Figure 11, case (a) and case (b) place the BI bridge in the input and the output of the unit, respectively, and case (c) places one direction of the bridge in the input and the other direction in the output. Case (a), in which the BI bridge is added before each TRQ3D unit, means that the outputs from the previous layer are firstly processed by the BI bridge (as introduced in Section 2.5) and then are fed into the Uformer and QRU3D blocks separately, generating the inputs of the next layer. Similarly, we add the BI bridges after and at both ends of each TRQ3D unit in case (b) and case (c). Furthermore, we compare the results of using a one-way bridge which means that there is only one direction of information exchange in each TRQ3D unit in case (d).
In other words, only the QRU3D output or the Uformer output is added to the output of the other block (as shown in Figure 11d). In addition, we replace the 3D convolution in each TRQ3D unit with 2D convolution which is implemented by changing the convolution kernel size to 3 × 3 × 1 in case (e). Table 9 shows the denoising performance on ICVL of different network structures in the case of Gaussian noise. Although there is no significant difference, one can see that the network structure in case (c) obtains the best denoising performance. Additionally, we can empirically observe that case (c) outperforms case (a) and case (b), which may be attributed to the placement of the BI bridge in the network structure. Additionally, the one information exchange direction in case (d) leads to less computation cost but worse denoising performance as the dependency of the QRU3D and Uformer blocks is poorly modeled. The result of case (e) indicates that 3D convolution can better model spectral domain knowledge, as the spectral distortion is relieved compared to 2D convolution, which is also proved in previous work [22]. Figure 11. Visualization of different network structures. Case (a) and case (b) place the BI bridge in the input and the output of the unit, respectively, and case (c) places one way of the bridge in the input and the other way in the output. In addition, there is only one direction of information exchange in case (d) and case (e) replaces 3D convolution with 2D convolution in each TRQ3D unit.

Practical Implications
HSIs are widely employed to recognize various objects and terrain land cover classes based on spectral features [56]. However, owing to environmental factors and precision of instrument, HSIs are inevitably corrupted by various noise, making model training and prediction more challenging. Therefore, to further verify the significance of our work, we compared classification performance on Pavia University and Indian Pines with different noise level. Each dataset has mixture noise added and is denoised by several methods. We train the classification network proposed in [56] for 15 epochs on each dataset under different noise situations. The network employs convolutional blocks to learn spectral and spatial characteristics and a multilayer perceptron to predict labels. Seventy-five percent of pixels are used as the training set and the rest are the testing set. For equality, the split of training and testing sets as well as parameter settings are kept the same in all cases. The results are shown in Table 10. From this table, we can observe that noisy images degrade the classification performance. In contrast, the accuracy improves significantly when denoising is performed, and our method achieves the most significant classification results improvement, even better than training with original clean images, which demonstrates the value of our work.

Limitations Analysis
Although our network achieves excellent denoising performance, there are still several limitations. Firstly, since the TRQ3D unit is composed of two blocks (i.e., QRU3D and Uformer), more computational cost is required. One of our future research directions focuses on modifying the network structure to be lighter so that the performance can still be preserved while the running time can be reduced. Secondly, the BI bridge is composed of 2D and 3D convolutional blocks, leading to a sharp increase in parameters and computation cost when the number of hidden layer channels is large. As a consequence, it is possible to investigate more efficient aggregation methods. Last, but not least, similar to most deeplearning-based methods, TRQ3DNet can be well trained and tested on a single dataset, while the denoising performance of the model dramatically degrades in a new dataset. In the future, we will focus on this generalization issue and try to alleviate this limitation.

Conclusions and Future Work
In this paper, we design a new network (i.e., TRQ3DNet) for HSI denoising. This network is composed of two branches. The first one is based on the 3D quasi-recurrent block, including convolution and quasi-recurrent pooling operation, and can help to extract the local spatial correlation and the global correlation along the spectrum. The second branch contains the Uformer block with window-based multi-head self-attention (W-MSA) and locally-enhanced feed-forward network (LeFF) to exploit the global spatial features. Experimental results on synthetic and real HSI denoising illustrate the superiority of our proposed network compared with other state-of-the-art methods.
Specifically, in the future work, it is worth investigating a more effective training strategy in case the performance on previous datasets degrades when training on a new dataset. In addition, the structure of the network and the aggregation methods can be further exploited to lighten the model so that the running time can be reduced. In addition, more real HSI datasets can be used for training to improve the model performance. In addition, better validation methods should be considered to verify the effectiveness of the proposed network.

Data Availability Statement:
The datasets generated during the study are available from the corresponding author on reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.