You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

14 November 2025

Hyperspectral Image Denoising via Quasi-Recursive Spectral Attention and Cross-Layer Feature Fusion

,
,
,
and
1
School of Information Engineering, Chenzhou Vocation Technical College, Chenzhou 424500, China
2
State Key Laboratory of Medical Neurobiology, MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
3
College of Computer and Software, Chengdu Jincheng College, Chengdu 611731, China
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Multi-Source Image Fusion, Restoration, and Understanding and Its Application in Sensing

Abstract

Hyperspectral images (HSIs) contain rich spatial–spectral information but are highly susceptible to various types of noise during the imaging process, which significantly degrades image quality and undermines the reliability of subsequent applications. To address this issue, we propose a novel end-to-end denoising framework, termed Quasi-Recursive Spectral Attention Network (QRSAN), which aims to design a feature extraction module that leverages the intrinsic characteristics of hyperspectral noise while preserving high-quality spatial and spectral information. Specifically, QRSAN introduces a Quasi-Recursive Attention Unit (QRAU) to jointly model inherent spatial–spectral dependencies, where 2D convolutions are employed for spatial feature extraction and frequency pooling is utilized for spectral representation. In addition, we develop a multi-head spectral attention mechanism to effectively capture inter-band correlations and suppress spectrally dependent noise. To further preserve fine-grained spatial structures and spectral fidelity, we design an adaptive cross-layer skip connection strategy that integrates channel-wise concatenation and transition blocks, enabling efficient feature propagation and fusion within an asymmetric encoder–decoder architecture. Extensive experiments on both synthetic and real HSI datasets demonstrate that QRSAN consistently outperforms existing methods in terms of visual quality and objective evaluation metrics, achieving superior denoising performance and generalization ability while maintaining high spatial–spectral fidelity.

1. Introduction

Hyperspectral images (HSIs), which provide both fine-grained spatial details and continuous spectral information, have been widely applied in various fields such as remote sensing [], medicine [], agriculture [], and food inspection []. By simultaneously recording spatial and spectral responses across hundreds of contiguous bands, HSIs enable accurate material identification and discrimination, which makes them uniquely advantageous for vision tasks involving detailed spatial–spectral feature extraction. Consequently, HSIs have been extensively used in image classification [,,], object detection [,], object tracking [,], change detection [,], and anomaly detection [,], demonstrating their irreplaceable value in both academic research and real-world applications. However, during the acquisition process, HSIs are inevitably contaminated by various degradation factors, including insufficient exposure, platform jitter, atmospheric disturbance, photon counting errors, stray light [], and environmental noise. These physical limitations often introduce different noise patterns into HSIs, such as Gaussian noise, impulse noise, stripe noise, and dead-line noise [,,]. Notably, degradations caused by stray light—such as ghost reflections or scattering that deteriorate the performance of optical instruments []—often affect multiple spectral bands simultaneously, thereby leading to complex mixed noise distributions. Conventional hyperspectral imaging systems are generally based on pure amplitude imaging, where only the intensity of the reflected or emitted light is recorded. In contrast, complex-valued hyperspectral imaging techniques [] extend this framework by capturing both amplitude and phase information, enabling more comprehensive spectral–spatial characterization and improved material discrimination. Despite their theoretical advantages, complex-valued HSIs remain highly susceptible to severe noise issues, including phase instability, quantization artifacts, and optical interference, which can significantly distort spectral reconstruction. Therefore, robust denoising remains a critical prerequisite for ensuring reliable spectral analysis and high-quality interpretation in both amplitude-only and complex-valued HSI modalities.
Over the past decade, extensive research has been devoted to HSI denoising, and existing methods can generally be grouped into two categories: model-based approaches and learning-based approaches. Model-based approaches typically formulate the denoising task as an inverse problem, regularized by suitable prior constraints to make the ill-posed problem tractable. For example, representative priors include  [,,], which exploits the sparsity of images in a specific transform or dictionary domain to separate noise from essential features for effective reconstruction; nonlocal similarity [,,], which leverages the repetition of similar patches across spatial or spectral dimensions and aggregates them to enhance denoising; total variation (TV) regularization [,,], which constrains the gradient magnitude to preserve edges while suppressing noise in smooth regions; and low-rank properties [,,], which model hyperspectral data as low-rank matrices or tensors by exploiting strong spectral correlations, enabling noise removal while maintaining structural information. Representative algorithms such as BM4D [], VBM4D [], CDBM3D [], CCF [], tensor dictionary learning (TDL) [], and low-rank tensor recovery (LLRT) [] have shown effectiveness by exploiting the global spectral correlation (GSC) and spatial nonlocal self-similarity (NSS) of HSIs. These approaches are physically interpretable and well-founded in theory, but they also come with critical limitations: they heavily depend on handcrafted assumptions, usually require iterative solvers, and often suffer from high computational complexity and weak generalization ability when applied to diverse real-world noise.
With the success of deep learning, particularly convolutional neural networks (CNNs), learning-based approaches have attracted significant attention in recent years [,]. Unlike model-driven methods that rely on explicit priors, CNN-based frameworks directly learn implicit feature priors from paired noisy–clean data [,,]. Such end-to-end models offer higher flexibility, generalization, and efficiency, thereby alleviating the dependence on physical degradation models. However, CNNs are inherently limited by their local connectivity and fixed convolutional kernels. Their finite receptive field and weight-sharing property make them less suitable for modeling sequential spectral data, often leading to insufficient robustness against complex noise patterns []. As a result, CNN-based methods may struggle to preserve subtle spatial–spectral structures in large-scale HSIs, especially under strong or mixed noise conditions.
More recently, attention mechanisms and Transformer architectures have been introduced into HSI denoising. Compared with CNNs, self-attention is capable of effectively modeling long-range dependencies and enhancing global feature representations, which has led to superior performance in various computer vision tasks. Moreover, the weaker inductive bias of Transformers allows them to better exploit large-scale data, thereby overcoming the receptive field and weight-sharing constraints of CNNs [,,]. Representative architectures such as Swin Transformer and U-shaped Transformer networks have demonstrated their effectiveness in image restoration and reconstruction and have gradually been extended to HSI denoising. Nevertheless, Transformers still face notable challenges: their ability to capture local features is limited, making it difficult to fully exploit spatial nonlocal similarity; the fully connected attention mechanism is prone to noise interference during feature aggregation; and their quadratic computational complexity with respect to input resolution greatly restricts their applicability to high-resolution HSIs.
In summary, the development of HSI denoising has progressed from traditional model-based methods with handcrafted priors to CNN-based frameworks that automatically learn discriminative features and further to Transformer-based models capable of global dependency modeling. Despite these advances, achieving a balance between denoising accuracy, computational efficiency, and robustness to diverse noise types remains a fundamental challenge in this field.
To address these issues, we propose a novel end-to-end denoising network, termed the Quasi-Recursive Spectral Attention Network (QRSAN). The key idea is to explicitly leverage spatial–spectral correlations while maintaining computational efficiency. Within QRSAN, we introduce the Quasi-Recursive Attention Unit (QRAU), which employs 2D convolutions to extract local spatial features and integrates frequency pooling along the spectral dimension to model inter-band redundancy. In particular, considering the strong noise dependency across adjacent bands, we design a multi-head spectral attention mechanism to strengthen inter-band feature correlation and suppress structured noise. Furthermore, to preserve low-level structural details that are crucial for reconstruction, we propose a cross-layer skip connection strategy with channel concatenation and a transition block, enabling effective multi-level feature propagation and improving both spatial fidelity and spectral consistency. Extensive experiments conducted on multiple benchmark datasets demonstrate that QRSAN achieves superior performance compared with state-of-the-art methods, validating its effectiveness and robustness in practical HSI denoising scenarios.The main contributions of this work are summarized as follows:
1.
A novel QRSAN architecture is proposed, consisting of multiple QRAUs that effectively explore intrinsic spatial–spectral features of HSIs and precisely capture noise dependencies across adjacent bands.
2.
A channel concatenation strategy with dedicated transition blocks is introduced to facilitate feature propagation, enabling multi-level feature fusion within an asymmetric encoder–decoder architecture, thus preserving structural consistency and enhancing spatial–spectral fidelity.
3.
Comprehensive experiments on diverse synthetic and real HSI denoising tasks demonstrate that QRSAN consistently outperforms existing methods in terms of both denoising performance and generalization ability, validating its effectiveness and superiority.
The remainder of this paper is organized as follows. Section 2 presents the proposed method. Section 3 reports the experimental results, and Section 4 concludes the paper.

2. Methods

2.1. Notations

Let the clean hyperspectral image be X R H × W × B , where H and W are the spatial dimensions and B is the number of spectral bands, with each pixel represented by a B-dimensional spectral vector. In practice, HSIs are inevitably degraded during acquisition due to sensor limitations, environmental interference, or transmission errors, which can be modeled as an additive noise E, yielding the observed image Y = X + E . The noise may vary across spectral bands and take diverse forms, making denoising challenging. The goal of HSI denoising is thus to recover the clean image X from Y while preserving both spatial structures and spectral fidelity.

2.2. Training Loss Function

The proposed QRSAN aims to learn a mapping function from the degraded image to the clean image, thereby achieving hyperspectral image denoising and reconstruction. The training objective is to minimize the L 2 distance between the predicted image X ^ and the ground truth X, with the loss function defined as follows:
L = 1 N i = 1 N X ^ i X i F 2 ,
Here, N denotes the batch size within each iteration.

2.3. Overall Architecture

The overall architecture of the proposed QRSAN is illustrated in Figure 1. To fully exploit the feature modeling capability of the Quasi-Recursive Attention Unit (QRAU) and to achieve high-fidelity HSI reconstruction, the network adopts an encoder–decoder framework, which has proven effective in balancing representation learning and detail recovery. The backbone of QRSAN is constructed with three pairs of symmetric QRAU layers, forming a hierarchical structure that progressively extracts abstract representations while preserving fine spatial–spectral information. On top of this backbone, several specifically designed modules are integrated to further enhance denoising performance and maintain spatial–spectral consistency.
Figure 1. Architectural Details of QRSAN.
First, let the input HSI feature map be X R C i n × B × H × W and the output feature map after 3D convolution be X R C o u t × B × H × W . The convolution is applied as
X ( : , b , : , : ) = c = 1 C i n X ( c , b , : , : ) K c , b = 1 , 2 , , B ,
where K c R k H × k W is the 2D spatial kernel, ∗ denotes 2D convolution, and the stride along the spectral dimension is fixed to 1 with kernel size k B = 1 , ensuring independent processing of each spectral band while preserving spectral continuity. This design ensures that spectral continuity is preserved without introducing band mixing, while also eliminating restrictions on the number of spectral channels. As a result, the model is highly flexible and can be directly applied to hyperspectral datasets with arbitrary numbers of bands, from tens to hundreds, without the need for reconfiguration or retraining.
Second, as the network depth increases, the receptive field of feature maps expands, allowing the model to capture long-range dependencies. However, this comes at the cost of gradually losing fine structural details, especially in high-frequency regions. Since reliable denoising requires compensating for the information loss introduced by downsampling and deeper transformations, skip connections play a crucial role in feature preservation. Instead of using standard symmetric skip connections, we design an asymmetric skip connection strategy tailored to hyperspectral noise characteristics. This strategy consists of channelwise concatenation and transition blocks, which fuse multi-scale features from corresponding encoder and decoder stages. In doing so, the proposed design not only helps retain spatial edges and spectral signatures but also alleviates optimization difficulties such as gradient vanishing or explosion, thus stabilizing network training.
Furthermore, the asymmetric skip connections enhance cross-layer feature interaction, allowing low-level structural cues and high-level semantic features to be jointly exploited during reconstruction. This ensures that both local spatial textures and global spectral correlations are preserved, which is particularly important for denoising tasks where over-smoothing or spectral distortion can easily occur.
In summary, beyond the efficient noise modeling provided by QRAUs, the proposed asymmetric QRSAN architecture integrates multi-scale contextual information with carefully designed skip connections, thereby preserving high-resolution structures and richer details during reconstruction. This collaborative design enables QRSAN to achieve superior denoising performance, demonstrating its robustness and generalization ability across diverse HSI noise scenarios.

2.4. Quasi-Recurrent Attention Unit

Effective feature extraction is essential for reconstructing clean hyperspectral images (HSIs). Since noise distributions vary across spectral bands and may involve multiple types or intensities, it is necessary to capture not only local spatial features but also long-range spectral dependencies during modeling. To this end, we design the Quasi-Recurrent Attention Unit (QRAU), which integrates lightweight convolutional operations with recursive spectral modeling and attention mechanisms. This design allows for efficient joint spatial–spectral feature extraction while maintaining low computational overhead. The structure of the QRAU is illustrated in Figure 2.
Figure 2. Design details of the QRAU. It mainly consists of three components: (a) Local Spatial Feature Modeling, (b) Quasi-Recursive Spectral Pooling, and (c) Spectral Multi-Head Attention.

2.4.1. Local Spatial Feature Modeling

HSIs exhibit strong nonlocal similarity in the spatial domain, and multi-scale contextual information plays a critical role in both denoising and reconstruction. A straightforward solution is to employ multi-scale convolutional kernels to capture different receptive fields. However, this approach significantly increases the number of parameters and computational complexity. To strike a balance between performance and efficiency, we instead introduce multi-resolution inputs through scaling operations in the data augmentation stage, followed by a fixed-scale convolutional backbone for feature extraction. This strategy improves the diversity of training samples while avoiding redundant convolutional operations.
As shown in Figure 2a, we apply independent 2D convolutional kernels in parallel to each spectral band, thereby enabling effective spatial feature extraction without mixing band information. Formally, given an input feature map X R C i n × B × H × W (where, in the first layer, X corresponds to the original HSI patch with C i n = 1 ), two parallel convolutional branches are constructed to generate a candidate tensor Z R C o u t × B × H × W and a forget gate F R C o u t × B × H × W :
Z = tanh ( W z X ) ,
F = σ ( W f X ) ,
where W z and W f are convolutional filter banks, each of size 1 × 3 × 3 , and ∗ denotes 2D convolution. The tanh activation ensures nonlinearity in candidate features, while the sigmoid gate regulates information flow.

2.4.2. Quasi-Recursive Spectral Pooling

In addition to spatial correlation, HSIs also exhibit strong spectral redundancy, which has often been modeled using low-rank priors in traditional methods. However, low-rank modeling alone tends to oversimplify spectral variations and may lose fine-grained details. To better exploit spectral correlation, we propose a quasi-recursive pooling mechanism along the spectral dimension.
As shown in Figure 2b, the candidate tensor Z and forget gate F are decomposed into band-wise sequences z i and f i , which are updated sequentially:
h i = f i h i 1 + ( 1 f i ) z i , i [ 1 , B ] ,
where ⊙ denotes elementwise multiplication and h i represents the hidden state of the i-th band (initialized to zero). In this formulation, the forget gate f i [ 0 , 1 ] adaptively balances the current band representation z i and the historical state h i 1 .
This recursive update ensures that information flows progressively along the spectral dimension, while the gating mechanism prevents error accumulation. Compared to strict recurrent formulations (e.g., RNN or LSTM), the proposed quasi-recursive pooling avoids full sequential dependency, thereby mitigating gradient vanishing and reducing computational cost while maintaining inter-band continuity []. Unlike 3D CNNs that model spatial–spectral cubes using fixed convolutional kernels, the quasi-recursive mechanism adaptively adjusts the contribution of each band through dynamic gating, leading to more flexible and context-aware spectral modeling []. After processing all bands, the hidden states h i are concatenated to form the enhanced spectral feature map F, which serves as the input to the subsequent spectral attention module. This design ensures that the quasi-recursive pooling stage provides a compact and context-aware spectral representation, which is then refined by the attention mechanism for global spectral dependency modeling.

2.4.3. Spectral Multi-Head Attention

While quasi-recursive pooling captures local and sequential dependencies across adjacent bands, it is insufficient for modeling long-range spectral dependencies, especially when noise patterns exhibit cross-band correlations. To address this, we incorporate a spectral multi-head attention mechanism into QRAU.
In Figure 2c, taking the enhanced spectral feature map F from the quasi-recursive pooling stage as input, each spectral band is projected into three learnable representations: queries (Q), keys (K), and values (V). These are used to compute attention weights across all bands in parallel to capture global inter-band relationships:
F ^ = W p · Attention ( Q ^ , K ^ , V ^ ) + F ,
Attention ( Q ^ , K ^ , V ^ ) = V ^ · Softmax Q ^ · K ^ α ,
Here, α is a learnable scaling factor that stabilizes the magnitude of the dot product; W p denotes a linear projection matrix used to aggregate the attended features; Softmax normalizes across all spectral band indices i = 1 , 2 , , B to ensure the attention weights sum to 1; and Attention represents the resulting weighted output features for each band.
Through this design, the attention mechanism adaptively assigns correlation weights to different bands, thereby reinforcing informative bands while suppressing noisy ones. Moreover, the multi-head formulation enhances nonlinear modeling capacity and ensures that different subspaces of spectral dependencies can be captured simultaneously.
In summary, the proposed QRAU integrates local spatial convolutions, quasi-recursive spectral pooling, and multi-head spectral attention in a lightweight yet powerful framework. The overall data flow of QRAU follows a sequential structure, where local spatial features are first extracted and modulated by band-wise gating, followed by quasi-recursive pooling for local spectral correlation modeling, and finally by multi-head attention for global spectral refinement. This hierarchical design clarifies the interaction between the quasi-recursive and attention components, ensuring smooth spectral information propagation. This combination allows QRAU to adaptively retain clean band information, suppress noise in corrupted bands, and model both local and global spectral dependencies. As a result, QRAU serves as an effective building block for QRSAN, enabling robust spatial–spectral feature learning and high-fidelity HSI reconstruction.

2.5. Transition Block

The internal structure of the Transition Block, as illustrated in Figure 3, consists of two 1 × 1 convolutional layers and a BN–ReLU–Conv sequence, with residual connections added before and after the 1 × 1 convolutions. Formally, the computation can be expressed as
Y = X + C o n v 1 × 1 ( 2 ) C o n v 3 × 3 δ BN C o n v 1 × 1 ( 1 ) X ,
where X denotes the input feature map and * indicates the convolution operation. Here, C o n v represents a convolutional layer used to extract and transform spatial and spectral features; C o n v 1 × 1 ( 1 ) compresses the channel dimension to reduce computational cost and generate compact representations, while C o n v 1 × 1 ( 2 ) restores and remaps the channels after feature fusion. The 3 × 3 convolution enhances the interaction of low-level spatial details and spectral information. BN (Batch Normalization) standardizes the layer output to improve training stability and convergence, and δ denotes the ReLU (Rectified Linear Unit) activation function, defined as δ ( z ) = max ( 0 , z ) , which introduces nonlinearity to enhance model expressiveness. Residual addition ensures stable gradient propagation and preserves consistency between input and output.
Figure 3. Design Details of the Transition Block.
The proposed algorithm in this paper is shown in Algorithm 1.
Algorithm 1: HSI Denoising with the QRSAN Algorithm.
Sensors 25 06955 i001

3. Experiments

To evaluate the effectiveness of the proposed method, we conducted extensive experiments on both synthetic and real-world datasets. Seven representative denoising algorithms were selected for comparison, including three model-based approaches—BM4D [], NGMeet [], and LRTFL0 []—as well as four deep learning-based methods—QRNN3D [], T3SC [], MAC-Net [], and SST []. For fairness, all learning-based baselines were retrained and tested under the same settings. The proposed model was implemented in PyTorch 2.6.0+cu126 and optimized with Adam using a learning rate of 1 × 10 4 . Training was performed on a single NVIDIA GeForce RTX 3090 GPU (NVIDIA, Santa Clara, CA, USA).

3.1. Simulated HSI Experiments

In this study, we conducted simulation experiments on the ICVL dataset, where the original HSIs were regarded as clean references. The ICVL dataset consists of 201 hyperspectral images captured by a Specim PS Kappa DX4 hyperspectral camera (Specim, Oulu, Finland), with a spatial resolution of pixels and 31 spectral bands covering the wavelength range of 400–700 nm.
To simulate diverse noise conditions, two types of degradations were introduced: Gaussian noise and mixed noise.
For complex noise, four settings were considered, including three fixed noise levels ( σ = 30 , σ = 50 , and σ = 70 ) and one blind scenario ( σ [ 30 , 70 ] ), enabling a comprehensive evaluation of robustness under varying noise intensities.
For complex noise, five representative scenarios were considered:
Case 1 (Non-i.i.d. Gaussian Noise): Zero-mean Gaussian noise with band-dependent intensities randomly sampled from [10%, 70%] is added to each spectral band.
Case 2 (Gaussian + Stripe Noise): Building on Case 1, stripe noise with intensity in [5%, 15%] is added to a randomly selected one-third of the bands.
Case 3 (Gaussian + Deadline Noise): Based on Gaussian noise, deadline noise with intensity in [5%, 15%] is introduced to one-third of the bands randomly.
Case 4 (Gaussian + Impulse Noise): In addition to Gaussian noise, impulse noise with intensity randomly sampled from [10%, 70%] is injected into one-third of the bands.
Case 5 (Mixture Noise): Gaussian noise is added to all bands, while one-third of the bands are randomly corrupted with a combination of the four aforementioned noise types, simulating a more complex and realistic noise environment.
Furthermore, we employed three commonly used metrics—PSNR, SSIM, and SAM—to evaluate the performance of each model in the synthetic experiments and also reported the computational time for each method. The first two metrics assess spatial similarity, while SAM quantifies spectral consistency. Given a reference image and a reconstructed image, PSNR is computed as
PSNR = 10 log 10 ( 2 n 1 ) 2 MSE ,
MSE = 1 H W i = 1 H j = 1 W [ I ( i , j ) I ^ ( i , j ) ] 2 ,
Here, H and W denote the height and width of the image, respectively, and n represents the number of possible pixel values, which is typically 8.
The core formulas of SSIM and SAM are given as follows:
SSIM = ( 2 μ I μ I ^ + c 1 ) ( 2 σ I I ^ + c 2 ) ( μ I 2 + μ I ^ 2 + c 1 ) ( σ I 2 + σ I ^ 2 + c 2 ) ,
SAM = arccos I , I ^ I I ^ ,
Here, μ I and μ I ^ denote the mean values of the reference image I and the reconstructed image I ^ , respectively; σ I 2 and σ I ^ 2 represent their variances, and σ I I ^ is the covariance between I and I ^ ; c 1 and c 2 are small constants introduced to avoid division by zero; I , I ^ denotes the inner product between the spectral vectors of the reference and reconstructed images; and · represents the Euclidean norm. Higher PSNR and SSIM values, along with lower SAM values, indicate better model performance. Since HSIs contain hundreds of spectral bands, all metrics are computed for each band and the final results are obtained by averaging across bands.

3.1.1. Denoising Under Gaussian Noise Conditions

The quantitative comparison of several HSI denoising methods is presented in Table 1. The results demonstrate that QRSAN consistently achieves superior performance across all noise levels as well as in the blind scenario. QRSAN maintains an advantage in all three core metrics—PSNR, SSIM, and SAM—highlighting its effectiveness in restoring image quality while preserving spectral fidelity. Other deep learning-based methods, such as SST and MAC-Net, also perform well, further confirming the significant potential of data-driven approaches in hyperspectral image denoising.
Table 1. Average results of different methods on the ICVL dataset under various Gaussian noise levels. PSNR and SSIM (↑, higher is better) and SAM (↓, lower is better) are reported. The best values in each row are highlighted in bold.
To visually illustrate the effectiveness of QRSAN, Figure 4 presents denoising results under a noise level of, with key regions magnified for detailed comparison. Among traditional model-based methods, NGMeet effectively suppresses noise by leveraging non-local self-similarity priors but tends to over-smooth complex textured regions. LRTF L 0 preserves some texture details yet still leaves residual noise. In contrast, deep learning-based approaches, benefiting from strong data-driven capabilities, outperform traditional methods across different noise levels. Notably, QRSAN achieves a superior balance between noise suppression and detail preservation, effectively enhancing both the spectral fidelity and the overall visual quality of the reconstructed images.
Figure 4. Visual comparison of different methods on the ICVL negev_0823-1003 data at bands (28, 18, and 8) with Gaussian noise level σ = 50 .

3.1.2. Denoising Under Complex Noise Conditions

Table 2 presents the quantitative evaluation of various methods under five representative noise types: Non-i.i.d. Gaussian, Stripe, Deadline, Impulse, and Mixture noise.
Table 2. Average results of different methods on the ICVL dataset under five complex noise scenarios. PSNR and SSIM (↑, higher is better) and SAM (↓, lower is better) are reported. The best values in each row are highlighted in bold.
The results indicate that QRSAN achieves the highest or near-highest PSNR and SSIM values across all scenarios, while attaining the best performance in SAM, demonstrating superior image reconstruction quality, spectral fidelity, and robust stability. In comparison, SST excels in structural similarity and detail preservation, particularly showing advantages in SSIM and SAM metrics. QRNN3D, T3SC, and MAC-Net exhibit relatively balanced performance under diverse noise conditions. Traditional model-based methods such as BM4D, NGMeet, and LRTF L 0 perform moderately, with noticeable limitations when handling complex noise. Overall, deep learning-based approaches outperform conventional methods in hyperspectral image denoising tasks, with QRSAN standing out due to its superior fidelity and stronger noise suppression capability.
Figure 5 illustrates representative HSI samples under different noise scenarios along with the denoising results of each method. For more intuitive comparison, key regions are magnified to highlight differences in structure preservation and noise suppression among the methods.
Figure 5. Visual comparison of different methods on the ICVL dataset under Case 1–Case 5.

3.2. Real HSI Experiments

3.2.1. Urban

To further evaluate the denoising performance of the QRSAN algorithm, additional experiments were conducted on the Urban dataset, acquired with the HYDICE sensor (The U.S. Army Research Laboratory, Adelphi, MD, USA). This dataset contains 210 spectral bands with a spatial resolution of 307 × 307 pixels and covers the 400–2500 nm spectral range. Several bands are affected by atmospheric interference and exhibit mixed noise, including Gaussian, stripe, and dead-line noise.
Figure 6 shows the real-world denoising results of QRSAN compared with seven benchmark methods on the Urban dataset. QRSAN effectively suppresses mixed noise while retaining fine spatial details, demonstrating superior overall performance. Among the model-based methods, NGMeet utilizes non-local self-similarity priors and reduces noise but tends to over-smooth complex textures, whereas LRTF L 0 achieves relatively good results in mixed noise removal. Deep learning-based methods, including QRNN3D, T3SC, and MAC-Net, generally remove noise effectively; however, QRNN3D loses some texture details, and T3SC and MAC-Net exhibit certain spectral distortions. SST struggles to fully remove stripe noise, indicating limited adaptability to challenging imaging conditions. Figure 7 presents the corresponding spectral reflectance curves. Overall, QRSAN achieves a better balance between noise suppression, detail preservation, and spectral fidelity than the compared methods.
Figure 6. Visual comparison of real denoising results on the Urban dataset at bands (208, 101, and 1).
Figure 7. The reflectance of pixel (80, 120) in the Urban HSI.

3.2.2. Realistic Dataset

The Realistic dataset [] comprises 59 paired noisy and clean HSIs, each with a spatial resolution of 696 × 520 pixels and 34 spectral bands covering 400–700 nm. It serves as a standard benchmark for evaluating real-world hyperspectral denoising performance.
Figure 8 presents the denoising results of various methods on the Realistic dataset under real-world conditions. As reported in Table 3, QRSAN achieves the highest PSNR and SSIM values along with the lowest SAM, demonstrating superior performance in image quality restoration, structural preservation, and spectral fidelity. SST follows closely, effectively balancing noise reduction and detail retention, indicating its capability in preserving spatial textures and spectral smoothness. Methods such as MAC-Net, T3SC, and QRNN3D show relatively stable performance across different noise types and levels, maintaining a reasonable trade-off between detail preservation and noise suppression. Model-based approaches, including BM4D, NGMeet, and LRTF L 0 , offer certain spectral consistency advantages but are limited in modeling complex noise, resulting in less effective detail recovery and overall image enhancement. To verify the reliability of the performance improvement, a paired t-test between QRSAN and SST was conducted on the Realistic dataset, and the results confirm that the improvements in PSNR, SSIM, and SAM are statistically significant ( p < 0.05 ).
Figure 8. Real denoising results on bands (30, 18, and 15) of Scene 2 in the Realistic dataset.
Table 3. Average results of different methods on the Realistic dataset. PSNR and SSIM (↑, higher is better) and SAM (↓, lower is better) are reported. The best values in each row are highlighted in bold. The * indicates that the improvement of QRSAN over SST is statistically significant according to a paired t-test ( p < 0.05 ).

3.3. Ablation Study

In this section, we evaluate the effectiveness of each component of QRSAN on the ICVL dataset and explore the optimal trade-off between denoising performance and computational cost. PSNR, SSIM, and the total number of network parameters are used as the evaluation metrics.

3.3.1. Effectiveness of QRAU Components

To comprehensively assess the contribution of the QRAU module within QRSAN, ablation experiments were conducted on the proposed Quasi-Recursive Attention Unit and its variants, RES2D, QRU2D, QRU3D, and an LSTM-based recurrent unit (denoted as LSTM), as summarized in Table 4. RES2D removes both the gated quasi-recursive pooling and spectral attention, QRU2D combines 2-D convolutions with quasi-recursive pooling, and QRU3D extends QRU2D using 3-D convolutions. The LSTM variant replaces QRAU with a fully recurrent spectral model under the same framework and training settings.
Table 4. Ablation study on the contribution of QRAU and its variants on the ICVL dataset under Gaussian noise. PSNR and SSIM (↑, higher is better), SAM (↓, lower is better).
As shown in Table 4, RES2D exhibits substantially lower performance, highlighting the importance of spectral modeling. QRU3D improves over QRU2D due to 3-D convolutions but lacks spectral attention. The LSTM-based recurrent unit achieves slightly better performance than QRU3D, benefiting from its explicit modeling of long-range spectral dependencies. However, it incurs more parameters and higher computational cost and lacks flexibility in adapting to arbitrary numbers of spectral bands. In contrast, the proposed QRAU integrates lightweight 2-D convolutions with multi-head spectral attention, capturing both spatial and spectral dependencies effectively. This quasi-recursive design provides greater flexibility and robustness, enabling adaptive denoising across diverse noise types and HSIs with varying spectral dimensions, while remaining computationally efficient.

3.3.2. Skip Connections

Table 5 presents a comparison of different skip connection strategies. Specifically, N-net employs no skip connections, V-net uses progressive additive skip connections, and C-net incorporates channel-wise concatenation combined with transition blocks for feature propagation. The results indicate that N-net performs the worst, demonstrating that the absence of skip connections leads to the loss of high-level information. Both V-net and C-net outperform N-net significantly with comparable computational costs.
Table 5. Ablation study on the impact of skip connection strategies on the ICVL dataset under Gaussian noise. PSNR and SSIM (↑, higher is better), SAM (↓, lower is better).
Notably, V-net is more lightweight, whereas C-net achieves superior denoising performance, suggesting that the channel-wise concatenation and transition block design provides an effective alternative to conventional skip connections. Considering the trade-off between performance and computational efficiency, QRSAN adopts C-net within the encoder–decoder framework while employing V-net to bridge shallow feature extraction and final reconstruction. The final configuration achieves state-of-the-art denoising results.

3.4. Limitations and Potential Impact

While QRSAN demonstrates strong performance in hyperspectral image denoising across synthetic and real-world datasets, several limitations exist. Its robustness under extremely high noise levels or rare, unseen noise types remains uncertain, and performance may degrade with severely corrupted bands. The generalization to hyperspectral images from sensors with different spectral ranges or imaging conditions is not fully validated, and domain shifts may affect spectral fidelity. Like other deep learning methods, QRSAN relies on sufficient labeled data, and computational costs may limit real-time processing of large datasets. Despite these challenges, the quasi-recursive pooling and spectral attention mechanisms provide a flexible framework for spectral–spatial modeling, with potential extensions to tasks such as anomaly detection, unmixing, and super-resolution. Future work may explore domain adaptation, self-supervised learning, or noise-aware strategies to improve robustness and cross-sensor generalization.

4. Conclusions

This work introduces the Quasi-Recursive Spectral Attention Network (QRSAN) for hyperspectral image (HSI) denoising, designed to exploit HSI noise characteristics for efficient spatial–spectral modeling and high fidelity. The core building block, the Quasi-Recursive Attention Unit (QRAU), employs lightweight 2-D convolutions for spatial feature extraction and integrates quasi-recursive pooling with multi-head spectral attention to capture spectral dependencies, effectively suppressing noise. In parallel, an adaptive cross-layer skip connection strategy, incorporating channel-wise concatenation and transition blocks, is introduced to enable efficient multi-level feature fusion within an asymmetric encoder–decoder architecture, balancing spatial detail preservation and spectral fidelity. Experiments on synthetic and real datasets show that QRSAN consistently outperforms existing methods in visual and quantitative evaluations. Specifically, on synthetic data, QRSAN achieves the highest PSNR and SSIM and the lowest SAM across various noise levels, validating its ability to recover both spatial and spectral information; on real datasets, it maintains superior denoising performance and detail preservation. Ablation studies further confirm that QRAU effectively models spatial–spectral dependencies and enhances generalization, while the cross-layer skip connections facilitate robust feature propagation and integration. Overall, QRSAN exhibits both theoretical novelty and practical efficacy. Future work will explore its applications in HSI classification and target detection and other downstream tasks.

Author Contributions

Writing—original draft, methodology, investigation, and conceptualization, Y.X.; writing—review and editing, and supervision, W.L. and L.Y.; writing—review and editing, H.Z.; validation, K.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Science and Technology Projects of Xizang Autonomous Region (No. XZ202401ZY0008) and Chenzhou Municipal Science and Technology Bureau Project (ZDYF2020212).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets generated and analysed during the current study are available from the corresponding authors on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral remote sensing data analysis and future challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [Google Scholar] [CrossRef]
  2. Zhi, L.; Zhang, D.; Yan, J.Q.; Li, Q.L.; Tang, Q.l. Classification of hyperspectral medical tongue images for tongue diagnosis. Comput. Med. Imaging Graph. 2007, 31, 672–678. [Google Scholar] [CrossRef]
  3. Lelong, C.C.; Pinet, P.C.; Poilvé, H. Hyperspectral imaging and stress mapping in agriculture: A case study on wheat in Beauce (France). Remote Sens. Environ. 1998, 66, 179–191. [Google Scholar] [CrossRef]
  4. Saha, D.; Manickavasagan, A. Machine learning techniques for analysis of hyperspectral images to determine quality of food products: A review. Curr. Res. Food Sci. 2021, 4, 28–44. [Google Scholar] [CrossRef] [PubMed]
  5. Ullah, F.; Ullah, I.; Khan, R.U.; Khan, S.; Khan, K.; Pau, G. Conventional to deep ensemble methods for hyperspectral image classification: A comprehensive survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 3878–3916. [Google Scholar] [CrossRef]
  6. Wang, Y.; Liu, L.; Xiao, J.; Yu, D.; Tao, Y.; Zhang, W. MambaHSI+: Multidirectional State Propagation for Efficient Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4411414. [Google Scholar] [CrossRef]
  7. Yu, C.; Zhu, Y.; Wang, Y.; Zhao, E.; Zhang, Q.; Lu, X. Concern with Center-pixel Labeling: Center-specific Perception Transformer Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5514614. [Google Scholar] [CrossRef]
  8. Yan, L.; Zhao, M.; Wang, X.; Zhang, Y.; Chen, J. Object detection in hyperspectral images. IEEE Signal Process. Lett. 2021, 28, 508–512. [Google Scholar] [CrossRef]
  9. He, X.; Tang, C.; Liu, X.; Zhang, W.; Sun, K.; Xu, J. Object detection in hyperspectral image via unified spectral–spatial feature aggregation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5521213. [Google Scholar] [CrossRef]
  10. Zhou, J.; Zhang, J.; Dong, Y. Incorporating Prompt Learning and Adaptive Dropping Hyperspectral Information Tracker for Hyperspectral Object Tracking. In Proceedings of the 2024 14th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), Helsinki, Finland, 9–11 December 2024; pp. 1–5. [Google Scholar]
  11. Gao, L.; Chen, L.; Liu, P.; Jiang, Y.; Xie, W.; Li, Y. A transformer-based network for hyperspectral object tracking. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5528211. [Google Scholar] [CrossRef]
  12. Liu, S.; Marinelli, D.; Bruzzone, L.; Bovolo, F. A review of change detection in multitemporal hyperspectral images: Current techniques, applications, and challenges. IEEE Geosci. Remote Sens. Mag. 2019, 7, 140–158. [Google Scholar] [CrossRef]
  13. Luo, F.; Zhou, T.; Liu, J.; Guo, T.; Gong, X.; Ren, J. Multiscale diff-changed feature fusion network for hyperspectral image change detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5502713. [Google Scholar] [CrossRef]
  14. Cheng, X.; Zhang, M.; Lin, S.; Li, Y.; Wang, H. Deep self-representation learning framework for hyperspectral anomaly detection. IEEE Trans. Instrum. Meas. 2023, 73, 5002016. [Google Scholar] [CrossRef]
  15. Guo, T.; He, L.; Luo, F.; Gong, X.; Li, Y.; Zhang, L. Anomaly detection of hyperspectral image with hierarchical antinoise mutual-incoherence-induced low-rank representation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5510213. [Google Scholar] [CrossRef]
  16. Clermont, L.; Michel, C.; Stockman, Y. Stray light correction algorithm for high performance optical instruments: The case of Metop-3MI. Remote Sens. 2022, 14, 1354. [Google Scholar] [CrossRef]
  17. Aggarwal, H.K.; Majumdar, A. Hyperspectral image denoising using spatio-spectral total variation. IEEE Geosci. Remote Sens. Lett. 2016, 13, 442–446. [Google Scholar] [CrossRef]
  18. Wang, F.; Li, J.; Yuan, Q.; Zhang, L. Local–global feature-aware transformer based residual network for hyperspectral image denoising. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5546119. [Google Scholar] [CrossRef]
  19. Chen, H.; Yang, G.; Zhang, H. Hider: A hyperspectral image denoising transformer with spatial–spectral constraints for hybrid noise removal. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 8797–8811. [Google Scholar] [CrossRef] [PubMed]
  20. Kulya, M.; Petrov, N.V.; Tsypkin, A.; Egiazarian, K.; Katkovnik, V. Hyperspectral data denoising for terahertz pulse time-domain holography. Opt. Express 2019, 27, 18456–18476. [Google Scholar] [CrossRef]
  21. Chang, Y.; Yan, L.; Fang, H.; Liu, H. Simultaneous destriping and denoising for remote sensing images with unidirectional total variation and sparse representation. IEEE Geosci. Remote Sens. Lett. 2013, 11, 1051–1055. [Google Scholar] [CrossRef]
  22. Lu, T.; Li, S.; Fang, L.; Ma, Y.; Benediktsson, J.A. Spectral–spatial adaptive sparse representation for hyperspectral image denoising. IEEE Trans. Geosci. Remote Sens. 2015, 54, 373–385. [Google Scholar] [CrossRef]
  23. Zhuang, L.; Bioucas-Dias, J.M. Fast hyperspectral image denoising and inpainting based on low-rank and sparse representations. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 730–742. [Google Scholar] [CrossRef]
  24. Maggioni, M.; Katkovnik, V.; Egiazarian, K.; Foi, A. Nonlocal transform-domain filter for volumetric data denoising and reconstruction. IEEE Trans. Image Process. 2012, 22, 119–133. [Google Scholar] [CrossRef] [PubMed]
  25. Dian, R.; Fang, L.; Li, S. Hyperspectral image super-resolution via non-local sparse tensor factorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5344–5353. [Google Scholar]
  26. He, W.; Yao, Q.; Li, C.; Yokoya, N.; Zhao, Q.; Zhang, H.; Zhang, L. Non-local meets global: An iterative paradigm for hyperspectral image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2089–2107. [Google Scholar] [CrossRef] [PubMed]
  27. Chang, Y.; Yan, L.; Fang, H.; Luo, C. Anisotropic spectral-spatial total variation model for multispectral remote sensing image destriping. IEEE Trans. Image Process. 2015, 24, 1852–1866. [Google Scholar] [CrossRef]
  28. Peng, J.; Xie, Q.; Zhao, Q.; Wang, Y.; Yee, L.; Meng, D. Enhanced 3DTV regularization and its applications on HSI denoising and compressed sensing. IEEE Trans. Image Process. 2020, 29, 7889–7903. [Google Scholar] [CrossRef]
  29. Yuan, Q.; Zhang, L.; Shen, H. Hyperspectral image denoising employing a spectral–spatial adaptive total variation model. IEEE Trans. Geosci. Remote Sens. 2012, 50, 3660–3677. [Google Scholar] [CrossRef]
  30. Cao, X.; Chen, Y.; Zhao, Q.; Meng, D.; Wang, Y.; Wang, D.; Xu, Z. Low-rank matrix factorization under general mixture noise distributions. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1493–1501. [Google Scholar]
  31. Chang, Y.; Yan, L.; Zhong, S. Hyper-laplacian regularized unidirectional low-rank tensor recovery for multispectral image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4260–4268. [Google Scholar]
  32. Zhang, H.; Huang, T.Z.; Zhao, X.L.; He, W.; Choi, J.K.; Zheng, Y.B. Hyperspectral image denoising: Reconciling sparse and low-tensor-ring-rank priors in the transformed domain. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5502313. [Google Scholar] [CrossRef]
  33. Maggioni, M.; Boracchi, G.; Foi, A.; Egiazarian, K. Video denoising, deblocking, and enhancement through separable 4-D nonlocal spatiotemporal transforms. IEEE Trans. Image Process. 2012, 21, 3952–3966. [Google Scholar] [CrossRef]
  34. Katkovnik, V.; Egiazarian, K. Sparse phase imaging based on complex domain nonlocal BM3D techniques. Digit. Signal Process. 2017, 63, 72–85. [Google Scholar] [CrossRef]
  35. Shevkunov, I.; Katkovnik, V.; Claus, D.; Pedrini, G.; Petrov, N.V.; Egiazarian, K. Hyperspectral phase imaging based on denoising in complex-valued eigensubspace. Opt. Lasers Eng. 2020, 127, 105973. [Google Scholar] [CrossRef]
  36. Peng, Y.; Meng, D.; Xu, Z.; Gao, C.; Yang, Y.; Zhang, B. Decomposable nonlocal tensor dictionary learning for multispectral image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 2949–2956. [Google Scholar]
  37. Liu, W.; Lee, J. A 3-D atrous convolution neural network for hyperspectral image denoising. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5701–5715. [Google Scholar] [CrossRef]
  38. Hu, S.; Gao, F.; Zhou, X.; Dong, J.; Du, Q. Hybrid convolutional and attention network for hyperspectral image denoising. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5504005. [Google Scholar] [CrossRef]
  39. Dong, W.; Wang, H.; Wu, F.; Shi, G.; Li, X. Deep spatial–spectral representation learning for hyperspectral image denoising. IEEE Trans. Comput. Imaging 2019, 5, 635–648. [Google Scholar] [CrossRef]
  40. Chang, Y.; Yan, L.; Fang, H.; Zhong, S.; Liao, W. HSI-DeNet: Hyperspectral image restoration via convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2018, 57, 667–682. [Google Scholar] [CrossRef]
  41. Wei, K.; Fu, Y.; Huang, H. 3-D quasi-recurrent neural network for hyperspectral image denoising. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 363–375. [Google Scholar] [CrossRef]
  42. Zhang, Q.; Dong, Y.; Zheng, Y.; Yu, H.; Song, M.; Zhang, L.; Yuan, Q. Three-dimension spatial-spectral attention transformer for hyperspectral image denoising. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5531213. [Google Scholar] [CrossRef]
  43. Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]
  44. Yuan, L.; Hou, Q.; Jiang, Z.; Feng, J.; Yan, S. Volo: Vision outlooker for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 6575–6586. [Google Scholar] [CrossRef]
  45. Li, M.; Fu, Y.; Zhang, Y. Spatial-spectral transformer for hyperspectral image denoising. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1368–1376. [Google Scholar]
  46. Bradbury, J.; Merity, S.; Xiong, C.; Socher, R. Quasi-Recurrent Neural Networks. In Proceedings of the International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
  47. Xiong, F.; Zhou, J.; Qian, Y. Hyperspectral restoration via l_0 gradient regularized low-rank tensor factorization. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10410–10425. [Google Scholar] [CrossRef]
  48. Bodrito, T.; Zouaoui, A.; Chanussot, J.; Mairal, J. A trainable spectral-spatial sparse coding model for hyperspectral image restoration. Adv. Neural Inf. Process. Syst. 2021, 34, 5430–5442. [Google Scholar]
  49. Xiong, F.; Zhou, J.; Zhao, Q.; Lu, J.; Qian, Y. MAC-Net: Model-aided nonlocal neural network for hyperspectral image denoising. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5519414. [Google Scholar] [CrossRef]
  50. Zhang, T.; Fu, Y.; Li, C. Hyperspectral image denoising with realistic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2248–2257. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.