Multi-Scale Differentiated Network with Spatial–Spectral Co-Operative Attention for Hyperspectral Image Denoising

Chang, Xueli; Wang, Xiaodong; Huang, Xiaoyu; Yan, Meng; Cheng, Luxiao

doi:10.3390/app15158648

Open AccessArticle

Multi-Scale Differentiated Network with Spatial–Spectral Co-Operative Attention for Hyperspectral Image Denoising

by

Xueli Chang

^1,2,

Xiaodong Wang

¹,

Xiaoyu Huang

³,

Meng Yan

¹ and

Luxiao Cheng

^1,*

¹

School of Computer Science, Hubei University of Technology, Wuhan 430068, China

²

Hubei Provincial Key Laboratory of Green Intelligent Computing Power Network, Wuhan 430068, China

³

China Centre for Resources Satellite Data and Application, Beijing 100095, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8648; https://doi.org/10.3390/app15158648

Submission received: 18 June 2025 / Revised: 28 July 2025 / Accepted: 3 August 2025 / Published: 5 August 2025

(This article belongs to the Special Issue Remote Sensing Image Processing and Application, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Hyperspectral image (HSI) denoising is a crucial step in image preprocessing as its effectiveness has a direct impact on the accuracy of subsequent tasks such as land cover classification, target recognition, and change detection. However, existing methods suffer from limitations in effectively integrating multi-scale features and adaptively modeling complex noise distributions, making it difficult to construct effective spatial–spectral joint representations. This often leads to issues like detail loss and spectral distortion, especially when dealing with complex mixed noise. To address these challenges, this paper proposes a multi-scale differentiated denoising network based on spatial–spectral cooperative attention (MDSSANet). The network first constructs a multi-scale image pyramid using three downsampling operations and independently models the features at each scale to better capture noise characteristics at different levels. Additionally, a spatial–spectral cooperative attention module (SSCA) and a differentiated multi-scale feature fusion module (DMF) are introduced. The SSCA module effectively captures cross-spectral dependencies and spatial feature interactions through parallel spectral channel and spatial attention mechanisms. The DMF module adopts a multi-branch parallel structure with differentiated processing to dynamically fuse multi-scale spatial–spectral features and incorporates a cross-scale feature compensation strategy to improve feature representation and mitigate information loss. The experimental results show that the proposed method outperforms state-of-the-art methods across several public datasets, exhibiting greater robustness and superior visual performance in tasks such as handling complex noise and recovering small targets.

Keywords:

deep learning; hyperspectral image denoising; convolutional neural networks; spatial–spectral attention; multi-scale feature fusion

1. Introduction

Hyperspectral image (HSI) captures continuous spectral bands, resulting in high-dimensional data with dozens to hundreds of channels, which enables precise characterization of subtle land surface features [1,2]. Compared to traditional grayscale or RGB images, HSI provides a higher density of spectral information and a more detailed representation of spectral features [3]. This unique advantage in spectral resolution makes HSI indispensable in various fields, including environmental monitoring [4,5], precision agriculture [6], mineral resource exploration [7,8,9] and medical image diagnostics [10,11]. However, due to the physical limitations of sensors and the influence of complex imaging conditions, HSI data often suffer from mixed noise (such as Gaussian noise, impulse noise, and striping) and other degradation effects [12,13]. These factors not only degrade image quality but also significantly reduce the accuracy of subsequent hyperspectral interpretation tasks, such as land cover classification [14,15], target recognition [16,17], and change detection [18]. As a critical preprocessing step, the effectiveness of HSI denoising directly impacts the reliability of these subsequent tasks [19].

Hyperspectral image denoising methods can primarily be categorized into two types: prior-based methods and deep-learning-based methods [20,21,22]. Most prior-based methods denoise HSI denoising by utilizing prior knowledge such as total variation [23], sparse representation [24,25], and low-rank matrices [26,27,28]. Donoho et al. [29] proposed a wavelet thresholding denoising method that employs adaptive thresholds in the frequency domain to separate noise. Maggioni et al. [30] introduced the block-matching 4D filtering (BM4D) method, which extends two-dimensional denoising to three-dimensional space, enhancing the performance of spatial–spectral joint denoising. Chang et al. [31] proposed the super Laplacian regularized unidirectional low-rank tensor recovery model (LLRT), which improves restoration performance for multispectral data by exploiting tensor low-rank properties. However, prior-based methods are often limited in their ability to jointly model complex noise in hyperspectral images, and the handcrafted prior constraints struggle to accommodate the dynamic changes in complex scenarios.

In recent years, deep learning has achieved remarkable progress in hyperspectral denoising. Hu et al. [32] were the first to introduce three-dimensional convolutional neural networks (3D-CNNs) for this task, overcoming the performance limitations of traditional methods by joint spatial–spectral joint features. Zhang et al. [33] designed the HSID-CNN deep residual network, utilizing residual connections to mitigate the vanishing gradient problem. Wei et al. [34] proposed the QRNN3D model, which incorporates an attention mechanism to enhance noise suppression through dynamic feature selection. Tai et al. [35] developed MemNet, a memory network that strengthens long-range dependency modeling through recursive structures. Recent studies have also explored the integration of Transformer architectures into hyperspectral image denoising. For instance, Hong et al. [36] proposed SpectralFormer, which leverages multi-head attention to model the nonlinear correlations among spectral bands. This approach enhances the global modeling capability of deep learning methods and shows great potential in advancing HSI denoising performance.

Although existing methods have made significant progress in hyperspectral image denoising, there are still limitations in the feature transfer process. Most network architectures fail to sufficiently account for the joint representation mechanism of spectral and spatial features, typically combining these features through simple concatenation or addition, without deeply exploring the intrinsic relationships between the two. Recently, Pan et al. [37] introduced the Spatial–Spectral Attention Recursive Network (SQAD), which models the relationships between spatial and spectral features to a certain degree by incorporating an attention mechanism. The experimental results demonstrate that this approach can effectively enhance feature fusion efficiency. However, this high-level feature interaction method still fails to fully capture the spatial–spectral interdependencies unique to hyperspectral data, limiting the network’s generalization ability in complex scenarios. Additionally, traditional feature transfer paths often follow fixed patterns, hindering the adaptive adjustment of fusion weights for spectral and spatial features, further constraining the model’s representational capability.

In multi-scale feature processing, although the multi-scale adaptive fusion network (MAFNet) proposed by Pan et al. [38] increases the network’s receptive field by constructing a multi-level feature pyramid, the traditional downsampling techniques it employs inevitably result in information loss. This is particularly problematic when processing high-frequency details and fine spectral characteristics, as conventional pooling or convolutional downsampling often cause feature blurring and spectral distortion. Additionally, most current multi-scale fusion strategies rely on simple feature concatenation or addition, which makes it difficult to achieve truly effective complementarity between features across scales. To some extent, this limitation restricts the potential for improving the model’s denoising performance.

To address the challenges of insufficient collaborative representation between spectral and spatial features and the loss of information during multi-scale fusion, this study proposes a hyperspectral image denoising network (MDSSANet). The network incorporates an adaptive spatial–spectral collaborative attention mechanism, which utilizes a parallel structure with two branches for spectral and spatial features, enabling joint optimization of multi-dimensional features. For multi-scale feature processing, this study introduces a differentiated strategy for multi-scale feature fusion. This strategy constructs parallel processing branches at large, medium, and small scales and employs dynamic weight-based feature fusion, thereby achieving hierarchical modeling of spatial–spectral features and mitigating the loss of feature information during multi-scale upsampling and downsampling.

The contributions of this work can be summarized as follows:

(1): We propose a hyperspectral image denoising network (MDSSANet), which constructs a multi-scale differentiated denoising network based on spatial–spectral cooperative attention, achieving the collaborative optimization of spatial–spectral features and the effective fusion of multi-scale information. Experiments on multiple standard datasets demonstrate the effectiveness and rationality of the proposed MDSSANet.
(2): A spectral–spatial cooperative attention (SSCA) module is optimized by implementing the parallel spectral channel attention and spatial attention mechanisms, and the fusion weights of the two types of features are adjusted in combination with 3D convolution. This approach enables adaptive optimization of cross-band features and precise modeling of long-range spatial dependencies, thereby enhancing the network’s ability to extract features of multi-scale objects, especially small targets.
(3): We design a differentiated multi-scale feature fusion module (DMF) to conduct cross-scale skip connections and differentiation processing strategy in the multi-scale fusion process. A dynamic fusion mechanism is employed to capture spatial–spectral features at different scales using dynamic weighting, which helps alleviate the issue of information loss commonly encountered in traditional multi-scale fusion.

2. Methods

2.1. Network Overview

This paper presents a denoising network for hyperspectral images. The network first builds a four-level feature pyramid via successive downsampling operations. Then, the multi-scale features are normalized via instance normalization with learnable affine parameters, guided by an adaptive spatial–spectral attention mechanism (SSCA). Finally, the normalized multi-scale features are fused through DMF, resulting in the reconstruction of a clean image. Figure 1 illustrates the network architecture of the proposed method. The stride and output channels (

C_{o u t}

) for each layer are provided, with other configurations (e.g., padding) inferred from the context.

In the process of feature normalization, the lower scale feature

h^{'}

, after preliminary feature extraction through

D e p w_{c o n v 2 d}

, is first transformed to match the dimensionality of the upper scale

h

dimension by transpose convolution. Then, the transformed feature

h^{'}

is input into the SSCA module for spatial–spectral joint feature extraction. After obtaining the spectrally extracted features, affine transformation parameters—shift β and scale γ—are generated from each pixel in the feature map to perform adaptive normalization. Finally, the normalized multi-scale features are input into the DMF module for differential dynamic fusion, resulting in the denoised feature map.

2.2. Spatial–Spectral Cooperative Attention Module (SSCA)

As shown in Figure 2, the spatial–spectral cooperative attention module utilizes a dual-branch parallel structure, composed of spectral and spatial attention branches. The input feature maps are processed by the spectral attention branch and the spatial attention branch, respectively, to obtain the spectral and spatial feature maps. Then, the two feature maps are concatenated to form a spatial–spectral enhanced feature map that fuses spatial and spectral information. At the same time, the spectral attention branch and the spatial attention branch generate the spectral weight map

M_{s} (X)

and the spatial weight map

M_{r} (X)

respectively. The two attention maps are element-wise multiplied to obtain the final spatial spectral attention heatmap. Finally, the spliced spatial–spectral enhanced feature map is fused with the feature map weighted by the attention heat map to output the final feature representation.

As shown in the following equation, this module enables the collaborative enhancement of spatial and spectral features.

O u t = c a t (M_{r} (X) ⊙ Y_{1} + M_{s} (X) ⊙ Y_{2}) ⊙ (M_{r} (X) ⊙ M_{s} (X)),

(1)

where

M_{r} (X)

represents the spectral dimension weight matrix learned by the spectral attention branch,

Y_{1}

denotes the feature representation extracted by the spectral attention branch,

M_{s} (X)

represents the spatial attention map generated by the spatial attention mechanism, and

Y_{2}

denotes the spatial features extracted by the spatial attention branch.

2.2.1. Spectral Attention Branches

In the spectral attention branch, this paper adopts a dual-channel processing structure. The first channel constructs the spectral weight matrix

M_{r} (X)

, through global average pooling followed by a fully connected layer, focusing on modeling global dependencies across spectral bands. The second channel introduces a spectral enhancement module that nonlinearly transforms the original spectral features to generate the enhanced spectral features

Y_{1}

:

Y_{1} = A v g 3 d (X) + f^{3 \times 3 \times 3} (X - (A v g 3 d (X)),

(2)

M_{r} (X) = σ (F_{c} (A v g 3 d (f^{3 \times 3 \times 3} (X))),

(3)

where

f^{3 \times 3 \times 3}

represents a 3D convolution operation used to extract spectral–spatial features from the input spectral features.

A v g 3 d

performs global average pooling on the extracted spectral–spatial feature map.

F_{c}

is then passed through a fully connected layer for further processing. The Sigmoid activation function is applied to the final feature map to generate the spectral attention weight

M_{r} (X)

in the

[0, 1]

range.

In the spectral weight learning branch, the input feature map first undergoes two consecutive 3D convolution layers. After global average pooling (GAP), the spectral weight map

M_{r} (X)

is generated by a convolution layer followed by the Sigmoid activation function.

In the parallel spectral information extraction branch, the input feature map first undergoes 3D average pooling (

A v g 3 d

) to extract low-frequency information

X_{1}

. The high-frequency information

X_{2}

is obtained through residual computation, which is the difference between the input feature map and

X_{1}

;

X_{2}

is then processed through a 3D convolution and added to the low-frequency information

X_{1}

to obtain the spectral-enhanced feature map

X_{3}

.

After the spectral weight extraction is completed, the learned spectral attention weight map

M_{r} (X)

is element-wise multiplied with the feature map

X_{3}

output from the spectral enhancement branch, achieving adaptive weighting across different spectral bands.

2.2.2. Spatial Attention Branches

The spatial attention branch adopts a dual-branch structure with a spatial weight branch and a spatial feature extraction branch. The main path generates the spatial attention map

M_{s} (F)

by applying channel compression and spatial convolution, capturing long-range spatial dependencies. The auxiliary path applies multi-scale convolutions to capture spatial features at different receptive fields, producing a detail-enhanced feature map

Y_{2}

.

Y_{2} = D e p w_{c o n v 2 d} (c a t (f^{3 \times 3} (X) + f^{5 \times 5} (X) + f^{7 \times 7} (X)),

(4)

M_{s} (X) = σ (f^{3 \times 3} ([C o n v_{m e a n} (D e p w_{c o n v 2 d} (X)); C o n v_{m a x} (D e p w_{c o n v 2 d} (X))])) = σ (f^{3 \times 3} [X_{m e a n}^{s}; X_{m a x}^{s}]),

(5)

The

D e p w_{c o n v 2 d}

block combines a depthwise convolution with a 3 × 3 kernel and a 1 × 1 pointwise convolution to process the feature map. In the spatial information extraction branch, the feature map is processed through convolutions with kernel sizes of

f^{3 \times 3}

,

f^{5 \times 5}

and

f^{7 \times 7}

.

After obtaining the enhanced spatial features along with spatial attention weights, the spatial weight map

M_{s} (X)

is multiplied element by element with the feature map

Y_{2}

, achieving adaptive weighting across spatial locations to enhance the network’s spatial information extraction capability.

2.3. Differentiated Multi-Scale Feature Fusion Module (DMF)

During the multi-scale downsampling process of the image, high-dimensional features often lose fine structural details of small targets, while low-dimensional features have difficulty capturing sufficient global context. To mitigate this problem, this study proposes the differentiated multi-scale feature fusion module (DMF), with its detailed structure shown in Figure 3. The module uses a parallel multi-branch structure to dynamically fuse spatial–spectral features at different scales and applies a feature compensation strategy to achieve complementary enhancement of cross-scale information.

The multi-scale fusion network begins with the multi-scale feature maps

\{X_{i}, i = 1, 2, 3\}

, where

i

denotes the spatial resolution index. Each feature map is computed as follows:

X_{O u t p u t} = D e p w_{c o n v 2 d} (D y f u s i o n (H_{1} (X_{1}) + H_{2} (X_{2}) + H_{3} (X_{3})),

(6)

H_{1} (X_{1}) = f^{3 \times 3} (P o o l_{m e a n} (D e p w_{c o n v 2 d} (X_{1})) + P o o l_{m a x} (D e p w_{c o n v 2 d} (X_{1}))),

(7)

where

H_{1}

refers to the large-sized feature map, where features are first extracted using the depthwise convolution operation

D e p w_{c o n v 2 d}

, followed by dimensionality reduction through a combination of max pooling and average pooling.

H_{2} (X_{2}) = f^{1 \times 3 \times 3} (f^{3 \times 1 \times 3} (f^{3 \times 3 \times 1} (X_{2}))),

(8)

For processing horizontal features, feature extraction is achieved by using three different convolutional kernels in parallel (3 × 1 × 3, 3 × 3 × 1, and 1 × 3 × 3), effectively enabling joint modeling and preservation of both spectral and spatial information. Specifically, to improve the model’s ability to express nonlinear relationships, the GELU (Gaussian Error Linear Unit) activation function is introduced after each convolutional layer. Its smooth gradient characteristics, compared to the conventional ReLU function, allow for better retention of detailed information in the feature map.

H_{3} (X_{3}) = N e a r e s t (D e p w_{c o n v 3 d} (X_{3})),

(9)

D e p w_{c o n v 3 d} (X) = R e s h a p e (S p c c o n v (F l a t t e n (f^{3 \times 3} (X)))),

(10)

S p c c o n v (X) = d f^{3} (f^{5} (f^{3} (x))),

(11)

For small feature maps, the module combines depthwise separable 3D convolution (

D e p w_{c o n v 3 d}

) with spatial–spectral joint modeling. Spectral convolution kernels of different sizes (

k_{s p e} c_{t r a l} = [3, 5]

) are used to fuse local and global spectral features, and dilated convolutions (dilation = 1) in the spectral domain are applied to expand the receptive field, enabling the network to better capture long-range dependencies across spectral channels. Finally, the dynamic fusion module takes the feature maps from the large, medium, and small scales, processes and spatially aligns them, and then computes their average values to form the

f u s i o n_b a s e

.

The

D y f u s i o n

utilizes a lightweight attention branch, which consists of

A d a p t i v e A v g P o o l 2 d

layer followed by 1 × 1 convolution, to extract global information from

f u s i o n_b a s e

This process generates three weight coefficients that correspond to the response strength of the features at the three scales. The Sigmoid activation function is applied to constrain the weights, and each scale feature is weighted accordingly, enabling dynamic selection and integration of multi-scale information.

2.4. Loss Function

To obtain accurate noisy features and improve network training performance,

L 1

loss

L_{rec}

and global gradient regularizer loss

L_{g r a d}

are used to optimize our network, and the overall loss

L

is as follows:

L = L_{r e c} + λ L_{g r a d},

(12)

L_{r e c}

is defined as follows:

L_{r e c} = ‖ \hat{I} - I ‖_{1},

(13)

where

\hat{I}

represents the estimated noise-free HSI, and

I

refers to the real noise-free HSI. Additionally, the global gradient regularizer is introduced to constrain the details of

\hat{I}

, which can be defined as follows:

L_{g r a d} = ‖ 𝛻_{h} \hat{I} - 𝛻_{h} I ‖_{2} + ‖ 𝛻_{v} \hat{I} - 𝛻_{v} I ‖_{2} + ‖ 𝛻_{s} \hat{I} - 𝛻_{s} I ‖_{2},

(14)

where

𝛻_{h}, 𝛻_{v}

and

𝛻_{s}

represent the gradient operator along the horizontal, vertical, and spectral direction respectively. λ is the weight parameter of

L_{g r a d}

, which is set to 0.01 to balance the loss terms.

3. Experimental Results and Analysis

3.1. Experiment Setup

3.1.1. Benchmark Datasets

In this study, the ICVL [39] hyperspectral dataset serves as the benchmark for experiments. The ICVL dataset contains 31 spectral bands and was captured by the Specim PS Kappa DX4 hyperspectral camera (manufactured by Specim, Spectral Imaging Ltd., Oulu, Finland) with a spatial resolution of 1392 × 1300, comprising a total of 201 images. To construct the training data, overlapping patches are extracted from the original images, with each patch treated as an independent training sample. Each sample has a spatial size of 128 × 128 and a spectral dimension of 31. For data splitting, 100 images are allocated for training, 5 for validation, and the remaining images for testing. To improve the network’s generalization capability, data augmentation techniques, including rotation and scaling, are applied, generating approximately 10,000 training samples in total. During testing, 128 × 128 × 31 patches are cropped from the main regions of each test image to balance computational cost with evaluation performance.

Additionally, to thoroughly evaluate the robustness and generalization capability of the network, systematic experiments were conducted on several representative remote sensing hyperspectral datasets. These datasets span various sensor types, spatial resolutions, and land cover categories, enabling comprehensive validation of the network’s performance across different real-world scenarios. Four representative hyperspectral datasets were selected for the experiments: the Pavia University dataset [40], the Washington DC Mall dataset [41], the Urban dataset [42], and the Indian Pines dataset [43].

3.1.2. Noise Setting

In real-world hyperspectral imaging tasks, data are frequently affected by various types of noise, such as Gaussian noise, impulse noise, dead pixels, stripe noise, and others. To simulate a complex noise environment, this study defines five composite noise types, referred to as Case 1 to Case 5, as outlined below:

Case 1: non-i.i.d. Gaussian noise. Each spectral band is contaminated by Gaussian noise with zero mean and varying intensity, where the noise standard deviation ranges from 10 to 70, simulating uneven responses across sensor bands.

Case 2: Gaussian + stripe noise. In addition to the non-i.i.d. Gaussian noise in Case 1, stripe noise is introduced into certain spectral bands. In these affected bands, about 5–15% of pixel columns are contaminated, creating stripes along the scanning direction to simulate sensor calibration errors.

Case 3: Gaussian + dead pixel noise. Each band still suffers from the non-i.i.d. Gaussian noise as in Case 1. Additionally, dead pixel noise is randomly introduced into one-third of the spectral bands. In the affected bands, about 5–15% of columns exhibit dead pixel interference, forming noticeable vertical stripes.

Case 4: Gaussian + impulse noise. Building on Case 1, impulse noise (salt-and-pepper noise) is further added to simulate discrete noise generated during sensor operation or data transmission. Impulse noise affects one-third of the bands randomly, with 10–70% of pixels contaminated, appearing as random white or black dots in the image.

Case 5: mixed noise. This case represents the most complex noise scenario, where each spectral band is contaminated by at least one type of noise from Cases 1–4, comprehensively simulating various noise disturbances that may occur in real hyperspectral images.

3.1.3. Competing Methods and Quantitative Metrics

This method is compared with seven leading methods, including traditional and deep-learning-based approaches. In terms of traditional methods, the low-rank method LLRT is considered. In terms of deep-learning-based methods, it is compared with MAFNET [38], HSD-CNN [33], QRNN3D [34] and SQAD [37]. To objectively evaluate the denoising performance of the algorithm, three complementary quantitative evaluation metrics are used: Peak Signal-to-Noise Ratio (PSNR) [44] and Structural Similarity Index (SSIM) [45] based on the spatial domain, and Spectral Angle Mapper (SAM) [46] in the spectral domain. Specifically, PSNR evaluates pixel-level reconstruction accuracy by calculating the mean square error between the denoised image and the ground truth image. Higher values indicate better denoising performance. SSIM measures the structural similarity of the image in three dimensions: luminance, contrast, and structure. The value is in the [0, 1] range, with values closer to 1 indicating higher structural fidelity. SAM evaluates the preservation of spectral features by calculating the angle between the original spectral vector and the denoised spectral vector. A smaller value (with an ideal value of 0) indicates less spectral distortion.

3.1.4. Incremental Learning Policy

This study uses the Progressive Incremental Learning (PIL) strategy to optimize the network’s training process. This strategy divides the training into three stages, following a “from easy to difficult” curriculum learning paradigm, gradually enhancing the network’s ability to handle different types of noise [47]. In the first stage, fixed-parameter Gaussian noise (

σ

= 30, 50, and 70) is used to construct the training sets. A step-by-step training approach is adopted to optimize the network parameters, and the chain parameter initialization strategy is employed to save and transfer the training weights from each sub-stage. In the second stage, more challenging blind Gaussian noise (

σ

~

U

(30, 70)) is introduced. The optimal weights from the previous stage are used as initialization, enabling the network to adapt to dynamically changing noise levels. The final stage employs a progressive mixed noise training strategy, where the network is trained sequentially on single noise cases from Case 1 to Case 4, followed by optimizing on complex mixed noise (Case 5), which includes all types of noise, thus systematically improving the network’s robustness. As shown in Figure 4, the model trained with incremental learning exhibits lower and more stable training errors compared to conventional training without incremental learning.

The proposed network is implemented using PyTorch version 1.9.0+cu111 and trained on a system equipped with an NVIDIA GeForce RTX 4090 GPU (manufactured by NVIDIA Corporation, Santa Clara, CA, USA). For optimization, the Adam optimizer is used to update the network parameters [48], with a batch size of 20. The training epochs are differentiated for each stage: for the Gaussian noise removal task, the training lasts for 100 epochs, while for more complex noise removal tasks, the training is extended to 200 epochs to ensure convergence.

The learning rate follows a stage-wise decay strategy, starting at 5 × 10⁻⁴, and is reduced by 2 × 10⁻⁴ at the 50th and 70th epochs. To further improve the network’s performance, the gradient loss weight λ is dynamically adjusted during training. Specifically, an adaptive weight adjustment function based on the training epoch is designed. This function ensures that during the early stages of training, the network focuses on basic feature learning with a smaller λ value, while in later stages, the network gradually increases the constraint on fine detail features with a growing λ value.

3.2. Experiments on Gaussian Noise Cases

This paper employs a single unified network to handle different levels of Gaussian noise. Specifically, additive Gaussian noise with varying variances is introduced to create a set of noisy hyperspectral image. The average evaluation metrics are presented in Table 1, with the best performances for each metric highlighted in bold. Figure 5 presents the denoising results for a noise level of

σ

= 50, while Figure 6 displays the results for blind noise, offering a detailed comparison of the outcomes.

Experimental comparisons clearly show that the network proposed in this paper demonstrates superior denoising performance with respect to Gaussian noise, as evidenced by improvements in key metrics such as PSNR, SSIM, and SAM. This performance advantage is primarily due to the network’s enhanced integration of multi-scale contextual information and the synergistic effect of the spatial–spectral attention mechanism, which strengthens its feature extraction capability. Compared to existing methods like MAFNet, HSID-CNN, SQAD, and QRNN3D, the proposed network not only suppresses noise more effectively but also retains texture details in the image better. The denoising results, visualized in Figure 5 and Figure 6, show that traditional methods such as LLRT perform poorly when dealing with strong Gaussian noise, while other deep learning methods, although effective at noise suppression, still suffer loss of texture detail. In contrast, the proposed network more effectively preserves details in textures like tree branches and stone columns. Quantitative analysis further verifies the superiority of the proposed network, with its PSNR and SSIM scores outperforming those of the other methods.

3.3. Experiments of Complex Noise Removal on ICVL Dataset

As previously described, in the final training stage, the network is trained to handle five complex noise types simultaneously: non-i.i.d Gaussian noise, Gaussian stripe noise, Gaussian dead pixel noise, Gaussian impulse noise, and mixed noise. Figure 7 illustrates the denoising results of the network under these complex noise conditions. A comparison of the denoising performance metrics in Table 2 shows that the proposed network outperforms LLRT, which, being based on low-rank matrix methods, loses some essential structural information during the denoising process. Additionally, the proposed network performs better than other deep-learning-based methods, such as MAFNET, SQAD, and QRNN3D. As shown in the figure, under Case 2 noise, the proposed network effectively preserves the clarity of the text, while under Case 4 noise, where interleaving of power lines and building windows lead to detail loss in the results of other networks, the proposed network can effectively distinguish between different features.

3.4. Experiments on Remote-Sensing HSI Datasets

To comprehensively evaluate the robustness and denoising performance of the proposed network, systematic experiments were carried out on four representative hyperspectral datasets. These datasets cover diverse sensor types, spatial resolutions, and land cover characteristics.

The Pavia University dataset [40] is employed for testing. This dataset, collected by the ROSIS-3 airborne sensor (German Aerospace Center—DLR, Oberpfaffenhofen, Germany), includes 103 valid spectral bands and has a spatial resolution of 610 × 340 pixels, covering typical urban land cover in the city of Pavia, Italy. Next, the Washington DC Mall dataset [41], acquired using the HYDICE imaging spectrometer (Naval Research Laboratory, Washington, DC, USA), which contains 1208 × 307 spatial size and 191 spectral bands. It provides a detailed record of the complex urban features of the National Mall and its surroundings. We clip a region with the sizes of 304 × 304 × 191 and 344 × 344 × 103 for testing. Meanwhile, the remaining parts of these two datasets are divided into overlapping patches with spatial size of 128 × 128. We then obtain 800 patches and 400 patches in total for the Washington DC Mall and Pavia University datasets, respectively, which are used to fine-tune the pretrained model on the ICVL dataset.

To further assess the network’s performance in real-world noise conditions, two datasets of practical significance were chosen: the Urban dataset [42] and the Indian Pines dataset [43]. The Urban dataset contains 210 contiguous spectral bands, spanning a spectral range of 400 to 2500 nm, providing a comprehensive representation of the spectral characteristics of urban environments. The Indian Pines dataset, acquired using the AVIRIS sensor (NASA Jet Propulsion Laboratory, Pasadena, CA, USA), includes 220 spectral bands with a spatial resolution of 145 × 145 pixels. Some bands are significantly affected by atmospheric absorption and water vapor, leading to complex mixed noise interference. In these real-world datasets, due to the absence of noise-free reference images, the network trained on the ICVL dataset is directly transferred for use in these scenarios.

3.4.1. Pavia University

Mixed noise was applied to the Pavia University dataset in this study, and the experimental results are summarized in Table 3. The results show that the proposed network achieves the best performance in terms of the quantitative evaluation metrics. The corresponding visual results are presented in Figure 8. This method effectively removes complex noise while preserving high-frequency texture details in the image. The denoised image retains spectral information more effectively and better preserves the texture structure of the metal plate ground and surrounding small buildings.

Figure 9 shows the PSNR and SSIM values of the restored hyperspectral images for different bands in the dataset. The proposed method achieves the highest PSNR values across nearly all spectral bands. It is observed that most methods show varying performance across spectral bands, primarily due to the varying intrinsic noise levels in each band, resulting in different denoising challenges. Additionally, due to the characteristics of hyperspectral sensors, spatial details vary across spectral bands. As a result, nearly all methods perform differently across the spectral bands.

Figure 10 shows the average normalized digital number (DN) curves for different methods. It is evident that the image is significantly affected by noise across the 30 bands. Although most methods can restore the denoised image to the original trend, the traditional low-rank tensor recovery model (LLRT) has a large deviation. However, this network can more accurately reconstruct the DN value damaged by noise and minimize the denoising deviation. Furthermore, compared to recent deep learning networks such as SQAD and MAFNet, the proposed method is more stable in preserving spectral information, resulting in denoised DN values that are closer to the original data distribution. This further demonstrates that the proposed network effectively denoises without causing significant spectral distortion.

3.4.2. Washington DC Mall

To further assess the network’s denoising effectiveness, severe noise was introduced into the Washington DC Mall dataset, with the experimental results detailed in Table 4. As shown in the table, the proposed network attains the best performance across all quantitative evaluation metrics. The relevant visual outcomes are presented in Figure 11. Compared with the spectral distortion after noise reduction by HSID-CNN, the images after noise reduction in this paper not only retain the spectral information more effectively, but the texture structures of building roofs and road edges are also closer to those of the real images.

As Figure 12 shows, that performance varies between spectral bands. In some bands, the denoised images still exhibit distortion, which can be attributed to several factors. For example, the noise characteristics vary across spectral bands, with some bands experiencing stronger noise and others being less affected, leading to differences in denoising results. Furthermore, sensor sensitivity differs between bands, and high-frequency information in hyperspectral images is more susceptible to noise, resulting in varying denoising effectiveness. However, in most spectral bands, the proposed network outperforms other methods, effectively removing noise while preserving original details. The zoomed-in visualizations show that after denoising with the proposed network, the land cover details on building rooftops are better preserved.

As illustrated in Figure 13, certain bands in the 171st band are heavily contaminated by noise, causing significant distortion in some pixels of the image. While most networks can restore the noisy image to a DN value curve that roughly follows the original, the results indicate that the LLRT fails to accurately reconstruct the global curve. Although deep-learning-based methods recover the general trend of the DN values, the proposed network not only reconstructs the true shape of the curve more accurately but also preserves details more effectively, resulting in restored data that are closer to the ground truth.

3.4.3. Real Noise Image

Figure 14 and Figure 15 present the performance comparison of different denoising methods on the Urban and Indian Pines datasets. The quality of the original hyperspectral images is significantly degraded due to severe noise contamination from factors such as atmospheric absorption and water vapor interference, which greatly hinders the effective extraction of land cover information. In these real-world scene datasets, due to the lack of noise-free real reference images, this paper directly applies the network migration trained on the ICVL dataset to these scenarios. Despite this, the proposed method shows effective denoising performance. As shown in the figures, the denoising results of LLRT and HSID-CNN show spectral distortion, while the proposed network not only preserves spectral information integrity but also more accurately preserves the contours of buildings and roads. Furthermore, on the Indian Pines dataset, compared to the denoising results of MAFNet and QRNN3D, the proposed method is better at distinguishing between “Corn-mini till” and “Corn-no till” and is also more effective at retaining road information.

3.5. Ablation Study

The SSCA and DMF modules are the core components of the proposed method. To evaluate their effectiveness, three network variants were tested on the Pavia University dataset: (a) basic network with SSCA module, (b) basic network with DMF module, and (c) the full network with both the SSCA and DMF modules (i.e., the proposed method). The results in Table 5 show that the network with both SSCA and DMF modules achieves the best denoising performance, further validating the effectiveness and complementarity of these two modules in hyperspectral image denoising tasks.

As shown in Figure 16, the spatial–spectral attention module guides feature weight distribution via the spatial–spectral heatmap. Spatial attention directs the spatial features, while spectral attention directs the spectral features. Through the parallel branches, the network effectively learns the spatial–spectral feature information from the feature map. The spatial–spectral features, spectral features, and features after multiplying the spatial–spectral heatmap are shown in the figure.

To validate the sensitivity of the proposed network to the number of layers (i.e., the number of downsampling operations), ablation experiments with network depths of two, three, and four layers were designed. The four-layer network serves as the baseline structure used in this study. The three-layer network was constructed by retaining the core modules of this paper, namely, the SSCA module and the DMF module. The two-layer network only retained the SSCA module. As shown in Table 6 and Figure 17, the experimental results demonstrate a direct correlation between the number of network layers and denoising performance: the four-layer network achieves the best performance. Moreover, when the proposed modules are removed, the network performance significantly drops, further confirming the effectiveness of the modules.

4. Conclusions

This paper proposes multi-scale differentiated network with spatial–spectral co-operative attention for hyperspectral image denoising (MDSSANet). The network includes two core modules: the spatial–spectral cooperative attention (SSCA) module, which enhances cross-band features adaptively and precisely models long-range spatial dependencies through parallel spectral and spatial attention mechanisms; and the differentiated multi-scale feature fusion (DMF) module, which uses a three-branch parallel architecture to effectively alleviate the loss of information in multi-scale feature transmission by dynamically weighting and fusing large-scale feature extraction, medium-scale spatial–spectral joint modeling, and small-scale fine feature enhancement. The experimental results of Case 5 complex noise on several synthetic and real hyperspectral datasets (ICVL, Pavia University and Washington DC Mall) show that the proposed method outperforms existing approaches across multiple objective evaluation metrics: PSNR increases by 0.45 dB on average, SSIM by 0.8%, and SAM by 0.7%. In particular, the method demonstrates superior performance in complex noise environments, such as Gaussian–impulse mixed noise, and in small object restoration tasks, such as agricultural field boundaries and urban buildings, confirming its effectiveness in real-world application scenarios.

Considering the limitations of the current research, future work will focus on the following two optimization areas: first, to address the computational complexity of the multi-scale architecture, lightweight network designs using depthwise separable convolution and knowledge distillation will be explored; second, research will investigate a noise distribution adaptation mechanism based on meta-learning, along with dynamic network pruning strategies, to enhance the model’s generalization ability for heterogeneous sensor data.

Author Contributions

Conceptualization, X.W. and L.C.; methodology, X.W. and X.C.; validation, X.C. and X.H.; formal analysis, M.Y.; investigation, X.H. and M.Y.; data curation, X.C.; writing—original draft preparation, X.W.; writing—review and editing, L.C. and X.C.; Visualization, M.Y.; supervision, L.C.; project administration, M.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hubei Province under grant No. 2022CFB447.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and/or analyzed during the current study are publicly available and can be accessed without any restrictions. If needed, they can also be obtained by contacting the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pan, J.; Wang, S.; Li, H.; Yuan, Z.; Yuan, B.; Peng, J.; Liu, Y. Integrated Fusion Network for Hyperspectral, Multispectral and Panchromatic Data Fusion. Appl. Sci. 2025, 15, 2217. [Google Scholar] [CrossRef]
Rasti, B.; Chang, Y.; Dalsasso, E.; Denis, L.; Ghamisi, P. Image restoration for remote sensing: Overview and toolbox. IEEE Geosci. Remote Sens. Mag. 2021, 10, 201–230. [Google Scholar] [CrossRef]
Liang, Y.; Xu, M.; Dong, W.; Zhang, Q. ShadeNet: Innovating Shade House Detection via High-Resolution Remote Sensing and Semantic Segmentation. Appl. Sci. 2025, 15, 3735. [Google Scholar] [CrossRef]
Rajabi, R.; Zehtabian, A.; Singh, K.D.; Tabatabaeenejad, A.; Ghamisi, P.; Homayouni, S. Hyperspectral Imaging in Environmental Monitoring and Analysis. Front. Environ. Sci. 2024, 11, 1353447. [Google Scholar] [CrossRef]
Cheng, M.F.; Mukundan, A.; Karmakar, R.; Valappil, M.A.E.; Jouhar, J.; Wang, H.-C. Modern Trends and Recent Applications of Hyperspectral Imaging: A Review. Technologies 2025, 13, 170. [Google Scholar] [CrossRef]
Sethy, P.K.; Pandey, C.; Sahu, Y.K.; Behera, S.K. Hyperspectral Imagery Applications for Precision Agriculture—A Systemic Survey. Multimed. Tools Appl. 2022, 81, 3005–3038. [Google Scholar] [CrossRef]
Krupnik, D.; Khan, S. Close-Range, Ground-Based Hyperspectral Imaging for Mining Applications at Various Scales: Review and Case Studies. Earth Sci. Rev. 2019, 198, 102952. [Google Scholar] [CrossRef]
Moharram, M.A.; Sundaram, D.M. Enhancing Exploration-Exploitation in Harmony Search for Airborne Hyperspectral Imaging Band Selection (E3HS). Turk. J. Electr. Eng. Comput. Sci. 2023, 31, 969–991. [Google Scholar] [CrossRef]
Peyghambari, S.; Zhang, Y. Hyperspectral Remote Sensing in Lithological Mapping, Mineral Exploration, and Environmental Geology: An Updated Review. J. Appl. Remote Sens. 2021, 15, 31501. [Google Scholar] [CrossRef]
Karim, S.; Qadir, A.; Farooq, U.; Shakir, M.; Laghari, A.A. Hyperspectral Imaging: A Review and Trends Towards Medical Imaging. Curr. Med. Imaging Rev. 2023, 19, 417–427. [Google Scholar] [CrossRef]
Kruse, F.A.; Boardman, J.W.; Huntington, J.F. Comparison of airborne hyperspectral data and EO-1 Hyperion for mineral mapping. IEEE Trans. Geosci. Remote Sens. 2003, 41, 1388–1400. [Google Scholar] [CrossRef]
Wang, M.; Hong, D.; Han, Z.; Li, J.; Yao, J.; Gao, L.; Zhang, B.; Chanussot, J. Tensor decompositions for hyperspectral data processing in remote sensing: A comprehensive review. IEEE Geosci. Remote Sens. Mag. 2023, 11, 26–72. [Google Scholar] [CrossRef]
Huang, H.; Qureshi, J.U.; Liu, S.; Sun, Z.; Zhang, C.; Wang, H. Hyperspectral Imaging as a Potential Online Detection Method of Microplastics. Bull. Environ. Contam. Toxicol. 2021, 107, 754–763. [Google Scholar] [CrossRef]
Zhao, S.; Zhu, X.; Liu, D.; Xu, F.; Wang, Y.; Lin, L.; Chen, X.; Yuan, Q. A Hyperspectral Image Denoising Method Based on Land Cover Spectral Autocorrelation. Int. J. Appl. Earth Obs. Geoinf. 2023, 123, 103481. [Google Scholar] [CrossRef]
Liu, P.; Wang, C.; Ye, M.; Han, R. Coastal Zone Classification Based on U-Net and Remote Sensing. Appl. Sci. 2024, 14, 7050. [Google Scholar] [CrossRef]
Liu, X.; Bourennane, S.; Fossati, C. Reduction of Signal-Dependent Noise from Hyperspectral Images for Target Detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 5396–5411. [Google Scholar]
Sun, Q.; Liu, X.; Bourennane, S.; Liu, B. Multiscale Denoising Autoencoder for Improvement of Target Detection. Int. J. Remote Sens. 2021, 42, 3002–3016. [Google Scholar] [CrossRef]
Yang, J.; Zhao, Y.Q.; Chan, J.C.W. Learning and transferring deep joint spectral–spatial features for hyperspectral classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4729–4742. [Google Scholar] [CrossRef]
Wang, S.; Zhu, Z.; Liu, Y.; Zhang, B. Weighted Group Sparse Regularized Tensor Decomposition for Hyperspectral Image Denoising. Appl. Sci. 2023, 13, 10363. [Google Scholar] [CrossRef]
Zhu, Y.; Yuan, K.; Zhong, W.; Xu, L. Spatial–spectral ConvNeXt for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 5453–5463. [Google Scholar] [CrossRef]
Yuan, Y.; Ma, H.; Liu, G. Partial-DNet: A Novel Blind Denoising Model with Noise Intensity Estimation for HSI. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5505913. [Google Scholar] [CrossRef]
Zhu, Y.; Abdalla, A.; Tang, Z.; Cen, H. Improving Rice Nitrogen Stress Diagnosis by Denoising Strips in Hyperspectral Images via Deep Learning. Biosyst. Eng. 2022, 219, 165–176. [Google Scholar] [CrossRef]
Chambolle, A. An algorithm for total variation minimization and applications. J. Math. Imaging Vis. 2004, 20, 89–97. [Google Scholar] [CrossRef]
Elad, M.; Aharon, M. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 2006, 15, 3736–3745. [Google Scholar] [CrossRef]
Dian, R.; Li, S.; Fang, L.; Wei, Q. Multispectral and Hyperspectral Image Fusion with Spatial-Spectral Sparse Representation. Inf. Fusion 2019, 49, 262–270. [Google Scholar] [CrossRef]
Lu, C.; Tang, J.; Yan, S.; Lin, Z. Generalized Nonconvex Nonsmooth Low-Rank Minimization. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2014, 4130–4137. [Google Scholar]
Liu, J.; Musialski, P.; Wonka, P.; Ye, J. Tensor Completion for Estimating Missing Values in Visual Data. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 208–220. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; He, W.; Zhang, L.; Shen, H.; Yuan, Q. Hyperspectral image restoration using low-rank matrix recovery. IEEE Trans. Geosci. Remote Sens. 2013, 52, 4729–4743. [Google Scholar] [CrossRef]
Donoho, D.L.; Johnstone, I.M. Ideal spatial adaptation by wavelet shrinkage. Biometrika 1994, 81, 425–455. [Google Scholar] [CrossRef]
Maggioni, M.; Katkovnik, V.; Egiazarian, K.; Foi, A. Nonlocal transform-domain filter for volumetric data denoising and reconstruction. IEEE Trans. Image Process. 2012, 22, 119–133. [Google Scholar] [CrossRef]
Chang, Y.; Yan, L.; Zhong, S. Hyper-laplacian regularized unidirectional low-rank tensor recovery for multispectral image denoising. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2017, 4260–4268. [Google Scholar]
Hu, L.; Luo, X.; Wei, Y. Hyperspectral image classification of convolutional neural network combined with valuable samples. J. Phys.: Conf. Ser. 2020, 1549, 52011. [Google Scholar] [CrossRef]
Yuan, Q.; Zhang, Q.; Li, J.; Shen, H.; Zhang, L. Hyperspectral image denoising employing a spatial–spectral deep residual convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2018, 57, 1205–1218. [Google Scholar] [CrossRef]
Wei, K.; Fu, Y.; Huang, H. 3-D quasi-recurrent neural network for hyperspectral image denoising. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 363–375. [Google Scholar] [CrossRef]
Tai, Y.; Yang, J.; Liu, X.; Xu, C. MemNet: A persistent memory network for image restoration. Proc. IEEE Int. Conf. Comput. Vis. 2017, 4539–4547. [Google Scholar]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5518615. [Google Scholar] [CrossRef]
Pan, E.; Ma, Y.; Mei, X.; Fan, F.; Huang, J.; Ma, J. SQAD: Spatial–Spectral Quasi-Attention Recurrent Network for Hyperspectral Image Denoising. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5524814. [Google Scholar] [CrossRef]
Pan, H.; Gao, F.; Dong, J.; Du, Q. Multiscale adaptive fusion network for hyperspectral image denoising. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3045–3059. [Google Scholar] [CrossRef]
Arad, B.; Ben-Shahar, O. Sparse recovery of hyperspectral signal from natural RGB images. Comput. Vis.–ECCV 2016, 9911, 19–34. [Google Scholar]
Zhang, Y.; Li, W.; Zhang, M.; Wang, S.; Tao, R.; Du, Q. Graph information aggregation cross-domain few-shot learning for hyperspectral image classification. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 1912–1925. [Google Scholar] [CrossRef]
Yin, H.; Chen, H. Multibranch separable 3-D convolutional neural network for hyperspectral image denoising. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 8034–8048. [Google Scholar] [CrossRef]
Jeong, J.; Cho, Y.; Shin, Y.S.; Kim, J.; Lee, J.H. Complex Urban Dataset with Multi-Level Sensors from Highly Diverse Urban Environments. Int. J. Robot. Res. 2019, 38, 642–657. [Google Scholar] [CrossRef]
He, W.; Zhang, H.; Zhang, L.; Shen, H. Total-variation-regularized low-rank matrix factorization for hyperspectral image restoration. IEEE Trans. Geosci. Remote Sens. 2015, 54, 178–188. [Google Scholar] [CrossRef]
Singh, A.K.; Kumar, H.V.; Kadambi, G.R.; Kishore, J.K.; Shuttleworth, J.; Manikandan, J. Quality metrics evaluation of hyperspectral images. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 2014, 40, 1221–1226. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Yuhas, R.H.; Boardman, J.W.; Goetz, A.F.H. Determination of semi-arid landscape endmembers and seasonal trends using convex geometry spectral unmixing techniques. In Proceedings of the Summaries of the 4th Annual JPL Airborne Geoscience Workshop, Washington, DC, USA, 25–29 October 1993; Volume 1. [Google Scholar]
Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, New York, NY, USA, 14–18 June 2009; pp. 41–48. [Google Scholar]
Kinga, D.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations, San Diego, CA, USA, 7–9 May 2015; Volume 5. [Google Scholar]

Figure 1. The framework of MDSSANet, comprising the SSCA module (performs spatial–spectral joint extraction and guides affine parameters for adaptive normalization) and the DMF module (fuses normalized multi-scale features to output denoised maps for clean image reconstruction).

Figure 2. Illustration of the spatial–spectral co-attention module.

Figure 3. Illustration of the differentiated multi-scale feature fusion module.

Figure 4. The training errors with/without incremental learning. Blue curve denotes the gauss noise data training without incremental learning. Red curve denotes the gauss noise data training with incremental learning.

Figure 5. The Gaussian noise removing results of SSIM at 13th band of the HSI under noise level

σ = 50

(ICVL dataset).

Figure 5. The Gaussian noise removing results of SSIM at 13th band of the HSI under noise level

σ = 50

(ICVL dataset).

Figure 6. The Gaussian noise removing results of SSIM at 13th band of the HSI under noise level

σ = B l i n d

(ICVL dataset).

Figure 6. The Gaussian noise removing results of SSIM at 13th band of the HSI under noise level

σ = B l i n d

(ICVL dataset).

Figure 7. Complex noise removal results on the ICVL dataset. Examples for non-i.i.d Gaussian noise (Case 1), Gaussian + stripes (Case 2), Gaussian + deadline (Case 3), Gaussian + impulse (Case 4) and mixture noise (Case 5) removal are illustrated, respectively.

Figure 8. Simulated noise removal results of SSIM at 3rd, 13th and 23rd bands of mixture noise on the Pavia University dataset. (a) Ground Truth, (b) Noise, (c) LLRT, (d) BM4D, (e) MEMNet, (f) HSID-CNN, (g) SQAD, (h) QRNN3D, (i) MAFNet, (j) Ours.

Figure 9. PSNR and SSIM results of different denoising methods for each spectral band in the simulated mixture noise experiments on the Pavia University dataset. (a) SSIM performance across spectral bands. (b) PSNR performance across spectral bands.

Figure 10. Horizontal mean digital number curves at 30th band of real noise by different methods on the Pavia University dataset. (a) Ground Truth, (b) Noise, (c) LLRT, (d) BM4D, (e) MEMNet, (f) HSID-CNN, (g) SQAD, (h) QRNN3D, (i) MAFNet, (j) Ours.

Figure 11. Simulated noise removal results of SSIM at 16th, 26th and 59th bands of mixture noise on the Washington DC Mall dataset. (a) Ground Truth, (b) Noise, (c) LLRT, (d) BM4D, (e) MEMNet, (f) HSID-CNN, (g) SQAD, (h) QRNN3D, (i) MAFNet, (j) Ours.

Figure 12. PSNR and SSIM results of different denoising methods for each spectral band in the simulated mixture noise experiments on the Washington DC Mall dataset. (a) SSIM performance across spectral bands. (b) PSNR performance across spectral bands.

Figure 13. Horizontal mean digital number curves at 171st band of real noise by different methods on the Pavia University dataset. (a) Ground Truth, (b) Noise, (c) LLRT, (d) BM4D, (e) MEMNet, (f) HSID-CNN, (g) SQAD, (h) QRNN3D, (i) MAFNet, (j) Ours.

Figure 14. Visual results of noise removal on false-color images for bands 206, 94, and 1 of the Urban dataset compared with competing methods. (a) Noise, (b) LLRT, (c) BM4D, (d) MEMNet, (e) HSID-CNN, (f) SQAD, (g) QRNN3D, (h) MAFNet, (i) Ours.

Figure 15. Visual results of noise removal on false-color images for bands 30, 110, and 189 of the India Pines dataset compared with competing methods. (a) Noise, (b) LLRT, (c) BM4D, (d) MEMNet, (e) HSID-CNN, (f) SQAD, (g) QRNN3D, (h) MAFNet, (i) Ours.

Figure 16. Input features, spatial features, spectral features, and feature maps after the multiplication of the spatial–spectral feature map.

Figure 17. Simulated noise removal results of SSIM at 1st, 71st and 101st band of mixture noise on the Pavia University dataset. Bold values indicate the best results for the corresponding metric.

Table 1. Performance comparison of different methods on the ICVL dataset. Note: “Blind” indicates that each image is corrupted by Gaussian noise with unknown

σ .

Bold values indicate the best results for the corresponding metric.

Table 1. Performance comparison of different methods on the ICVL dataset. Note: “Blind” indicates that each image is corrupted by Gaussian noise with unknown

σ .

Bold values indicate the best results for the corresponding metric.

Case	Index	Noisy	LLRT	BM4D	MemNet	HSID-CNN	SQAD	QRNN3D	MAFNet	Ours
$σ$ = 30	PSNR	18.51	40.72	37.16	40.52	39.65	43.92	43.97	44.56	45.31
	SSIM	0.130	0.946	0.908	0.943	0.948	0.966	0.968	0.975	0.978
	SAM	0.751	0.072	0.093	0.085	0.092	0.059	0.056	0.044	0.038
$σ$ = 50	PSNR	16.43	37.94	35.86	39.79	38.89	41.62	41.78	42.12	43.19
	SSIM	0.053	0.921	0.867	0.936	0.929	0.957	0.956	0.959	0.965
	SAM	0.847	0.098	0.118	0.084	0.091	0.061	0.059	0.053	0.044
$σ$ = 70	PSNR	14.84	35.33	32.49	36.46	38.67	40.71	40.49	41.39	42.07
	SSIM	0.036	0.893	0.842	0.898	0.922	0.947	0.945	0.953	0.957
	SAM	0.993	0.137	0.158	0.099	0.121	0.059	0.065	0.052	0.046
Blind	PSNR	17.47	34.28	35.25	37.80	39.69	41.94	42.25	42.83	43.41
	SSIM	0.084	0.887	0.862	0.932	0.941	0.954	0.957	0.963	0.966
	SAM	0.862	0.114	0.131	0.101	0.116	0.052	0.057	0.046	0.043

Table 2. Performance comparison of various methods under five complex noise scenarios on the ICVL dataset. Bold values indicate the best results for the corresponding metric.

Case	Index	Noisy	LLRT	BM4D	MemNet	HSID-CNN	SQAD	QRNN3D	MAFNet	Ours
Case1	PSNR	17.97	32.71	35.86	36.75	38.75	42.32	42.71	42.91	43.45
	SSIM	0.162	0.823	0.867	0.932	0.948	0.954	0.958	0.963	0.967
	SAM	0.864	0.182	0.121	0.102	0.090	0.051	0.047	0.045	0.041
Case2	PSNR	17.58	30.79	34.53	37.11	38.98	41.92	42.35	42.65	43.17
	SSIM	0.151	0.785	0.845	0.935	0.949	0.949	0.955	0.962	0.965
	SAM	0.878	0.208	0.144	0.098	0.084	0.051	0.053	0.048	0.044
Case3	PSNR	17.44	28.76	31.97	38.34	39.02	40.86	40.92	42.16	42.55
	SSIM	0.146	0.722	0.812	0.935	0.942	0.945	0.954	0.961	0.963
	SAM	0.894	0.218	0.178	0.111	0.088	0.061	0.063	0.050	0.045
Case4	PSNR	14.92	26.41	28.71	35.10	35.34	39.39	39.48	39.92	40.50
	SSIM	0.121	0.656	0.714	0.858	0.901	0.939	0.942	0.947	0.951
	SAM	0.905	0.287	0.241	0.203	0.173	0.087	0.091	0.084	0.074
Case5	PSNR	14.09	23.26	27.54	34.82	35.17	38.71	39.07	39.67	40.28
	SSIM	0.093	0.569	0.692	0.841	0.898	0.934	0.941	0.945	0.950
	SAM	0.925	0.319	0.259	0.184	0.177	0.081	0.085	0.079	0.071

Table 3. Performance comparison of different methods on the Pavia University dataset. Bold values indicate the best results for the corresponding metric.

Index	Noisy	LLRT	BM4D	MemNet	HSID-CNN	SQAD	QRNN3D	MAFNet	Ours
PSNR	17.67	21.09	25.97	30.78	29.61	32.91	33.26	33.73	34.14
SSIM	0.139	0.578	0.658	0.811	0.823	0.882	0.889	0.908	0.917
SAM	0.872	0.321	0.275	0.150	0.161	0.121	0.126	0.113	0.104

Table 4. Performance comparison of various methods on the Washington DC Mall dataset. Bold values indicate the best results for the corresponding metric.

Index	Noisy	LLRT	BM4D	MemNet	HSID-CNN	SQAD	QRNN3D	MAFNet	Ours
PSNR	14.43	19.06	21.97	23.51	22.30	24.51	25.91	26.34	26.68
SSIM	0.127	0.537	0.628	0.787	0.804	0.864	0.868	0.881	0.891
SAM	0.902	0.351	0.255	0.128	0.135	0.098	0.101	0.095	0.092

Table 5. Ablation study of SSCA and DMF. Bold values indicate the best results for the corresponding metric.

Method	SSIM	PSNR	SAM
SSCA	0.914	34.01	0.105
DMF	0.911	33.87	0.108
SSCA + DMF	0.917	34.14	0.104

Table 6. Ablation of the number of downsampling.

Method	Noisy	Ours-Two	MAFNet	Ours-Three	Ours
PSNR	17.67	32.12	33.73	33.85	34.14
SSIM	0.139	0.893	0.908	0.910	0.917
SAM	0.872	0.121	0.113	0.109	0.104

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chang, X.; Wang, X.; Huang, X.; Yan, M.; Cheng, L. Multi-Scale Differentiated Network with Spatial–Spectral Co-Operative Attention for Hyperspectral Image Denoising. Appl. Sci. 2025, 15, 8648. https://doi.org/10.3390/app15158648

AMA Style

Chang X, Wang X, Huang X, Yan M, Cheng L. Multi-Scale Differentiated Network with Spatial–Spectral Co-Operative Attention for Hyperspectral Image Denoising. Applied Sciences. 2025; 15(15):8648. https://doi.org/10.3390/app15158648

Chicago/Turabian Style

Chang, Xueli, Xiaodong Wang, Xiaoyu Huang, Meng Yan, and Luxiao Cheng. 2025. "Multi-Scale Differentiated Network with Spatial–Spectral Co-Operative Attention for Hyperspectral Image Denoising" Applied Sciences 15, no. 15: 8648. https://doi.org/10.3390/app15158648

APA Style

Chang, X., Wang, X., Huang, X., Yan, M., & Cheng, L. (2025). Multi-Scale Differentiated Network with Spatial–Spectral Co-Operative Attention for Hyperspectral Image Denoising. Applied Sciences, 15(15), 8648. https://doi.org/10.3390/app15158648

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Scale Differentiated Network with Spatial–Spectral Co-Operative Attention for Hyperspectral Image Denoising

Abstract

1. Introduction

2. Methods

2.1. Network Overview

2.2. Spatial–Spectral Cooperative Attention Module (SSCA)

2.2.1. Spectral Attention Branches

2.2.2. Spatial Attention Branches

2.3. Differentiated Multi-Scale Feature Fusion Module (DMF)

2.4. Loss Function

3. Experimental Results and Analysis

3.1. Experiment Setup

3.1.1. Benchmark Datasets

3.1.2. Noise Setting

3.1.3. Competing Methods and Quantitative Metrics

3.1.4. Incremental Learning Policy

3.2. Experiments on Gaussian Noise Cases

3.3. Experiments of Complex Noise Removal on ICVL Dataset

3.4. Experiments on Remote-Sensing HSI Datasets

3.4.1. Pavia University

3.4.2. Washington DC Mall

3.4.3. Real Noise Image

3.5. Ablation Study

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI