Modulated Diffusion with Spatial–Spectral Disentangled Guidance for Hyperspectral Image Super-Resolution

Xu, Xinlan; Qiao, Jiaqing; Zhou, Jialin; Yuan, Kuo; Feng, Lei

doi:10.3390/rs18101582

Open AccessArticle

Modulated Diffusion with Spatial–Spectral Disentangled Guidance for Hyperspectral Image Super-Resolution

by

Xinlan Xu

,

Jiaqing Qiao

,

Jialin Zhou

,

Kuo Yuan

and

Lei Feng

^*

School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1582; https://doi.org/10.3390/rs18101582

Submission received: 9 April 2026 / Revised: 8 May 2026 / Accepted: 12 May 2026 / Published: 15 May 2026

(This article belongs to the Special Issue Artificial Intelligence Algorithm for Remote Sensing Imagery Processing (5th Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A Dynamic Modulated Residual Network (DMRN) with time-aware Feature-wise Linear Modulation (FiLM) modulation achieves adaptive spatial–spectral conditional guidance throughout the diffusion denoising process. This design effectively prevents feature dilution and hallucination, interacting well with the side-branch injection architecture with a 0.65 dB PSNR improvement over static guidance even under timesteps of 5.
A training-free Spatial–Spectral Disentangled Guidance (SSDG) strategy explicitly decouples spatial and spectral guidance during sampling, enabling flexible control over modality trade-offs, and achieves the best overall performance under blind Gaussian noise ( $σ$ = 0.01, approximately 23 dB SNR) across three hyperspectral benchmark datasets, demonstrating its generalization ability.

What are the implications of the main findings?

The proposed framework demonstrates that dynamic, timestep-aware conditioning is essential for diffusion-based multi-modal fusion, offering a generalizable design principle applicable beyond hyperspectral imaging to other sensor fusion tasks requiring noise robustness.
The SSDG mechanism provides a plug-and-play, retraining-free knob for downstream users to balance spatial fidelity against spectral consistency, making the method readily adaptable to diverse remote sensing applications such as land monitoring, ecological detection, and agricultural analysis.

Abstract

Fusion-based hyperspectral image super-resolution (HSI-SR) on diffusion models exhibits promising performance in generating high-quality, realistic features. However, existing methods are confronted with two limitations: (1) static conditional guidance is discordant with the dynamic denoising process, and (2) modality conflicts are inadequately addressed by concatenation. To address these challenges, we propose a novel Modulated Diffusion Framework with Spatial–Spectral Disentangled Guidance (SSDG). Specifically, it introduces a Dynamic Modulated Residual Network (DMRN), which leverages a time-aware mechanism to dynamically adjust conditional feature injection, ensuring adaptive guidance throughout all denoising stages. Furthermore, we design a training-free SSDG strategy to explicitly decouple spatial and spectral guidance during sampling, allowing for flexible control over the fusion process to mitigate modality conflicts. Extensive experiments on three public datasets demonstrate that the proposed method achieves state-of-the-art performance, exhibiting superior robustness, particularly in challenging noisy scenarios.

Keywords:

hyperspectral image fusion; diffusion models; dynamic modulation; disentangled guidance; frequency-domain optimization

1. Introduction

Hyperspectral images (HSIs) carry rich spectral information with hundreds of contiguous bands and are widely deployed in land monitoring [1], ecological detection [2], and agricultural use [3] tasks. However, due to imaging hardware limits, HSIs usually have comparatively low spatial resolution, which severely hinders their practical use. To mitigate this deficiency, researchers have extensively investigated Hyperspectral Image Super-Resolution (HSI-SR).

As illustrated in Figure 1, multispectral images (MSIs) possess high spatial resolution (HrMSI) but low spectral resolution, serving as a perfect complement to low-resolution HSIs (LrHSI). Consequently, fusion-based HSI-SR, especially Hyperspectral–Multispectral (HS–MS) fusion, has become a mainstream solution. Nevertheless, effectively fusing these two modalities is severely hindered by inherent modality discrepancies. Specifically, the highly ill-posed nature of HS–MS fusion makes it exceptionally challenging to align and integrate their distinct feature distributions, as enhancing spatial high-frequency textures often comes at the expense of spectral fidelity.

Existing methods can be broadly categorized into model-based [4,5,6], CNN-based [7,8,9], and Transformer-based [10,11,12] approaches. Despite their advances, these deterministic methods rely on pixel-wise objective functions, which inherently produce over-smoothed results and fail to recover high-frequency details, especially under real-world noisy conditions. Diffusion models [13,14], with their stable training dynamics and strong generative capacity, have recently been extended to HSI-SR tasks [15,16,17]. However, two critical limitations persist. First, existing methods rely on static conditional guidance that is fundamentally misaligned with the dynamic, multi-stage nature of the denoising process, thereby causing feature dilution and hallucination in high-frequency details. Second, the inherent modality conflict between HrMSI and LrHSI is inadequately resolved by naive concatenation, ultimately leading to spectral distortion and spatial artifacts.

To address these challenges, we propose a Modulated Diffusion Framework with Spatial–Spectral Disentangled Guidance (SSDG), comprising two core components. First, we introduce a Dynamic Modulated Residual Network (DMRN) that leverages a time-aware mechanism to dynamically adjust conditional feature injection throughout all denoising stages, thereby preventing feature dilution and ensuring adaptive guidance from coarse-grained to fine-grained reconstruction. Unlike the static affine transformation of standard Feature-wise Linear Modulation (FiLM) [18] or the heavy replicated encoder branch of ControlNet [19], DMRN explicitly modulates external conditions using the global diffusion timestep, introducing minimal additional computational overhead.

Second, we design a training-free SSDG strategy that explicitly decouples spatial and spectral guidance during sampling, offering flexible control over the spatial–spectral trade-off to mitigate modality conflicts without retraining. Complemented by a Wavelet-Based Frequency-Aware Optimization objective tailored for residual learning, the proposed framework achieves state-of-the-art performance with superior robustness, even on challenging noisy scenarios.

The main contributions of our work are as follows:

We propose a time-aware diffusion framework for HS–MS fusion, comprising a Dynamic Modulated Residual Network (DMRN) for adaptive conditional denoising and a Spatial–Spectral Disentangled Guidance (SSDG) mechanism for flexible sampling.
DMRN introduces a time-aware side-injection mechanism with FiLM modulation, dynamically adapting conditional guidance to each denoising stage while incorporating frequency-domain constraints to align with the residual learning objective.
SSDG extends Classifier-Free Guidance to multi-modal fusion, enabling training-free, explicit control over spatial and spectral guidance weights to mitigate modality conflicts and suit diverse downstream tasks.
Extensive experiments on three public datasets (Chikusei, Houston, and KSC) demonstrate state-of-the-art performance and superior robustness to noise, while maintaining high computational efficiency.

2. Related Work

2.1. Hyperspectral Image Fusion

HSI-SR has garnered significant attention in recent years due to its importance in HSI applications. Various methods have been proposed to address the low spatial resolution issue of HSIs. Among these, the fusion method, which utilizes high spatial information from other images, primarily HrHSIs, has undergone significant development. HS–MS image fusion can generally be divided into two categories: model-based approaches and DL-based approaches.

Derived from pan-sharpening, early research on HSI fusion focused heavily on model-based methods, which incorporate physical priors to regularize the reconstruction process. Though inferior to deep learning (DL)-based methods due to their limited ability to model high-dimensional features, there is still ongoing progress in the model-based area. NLLR [20] incorporated nonlocal spatial similarity and low-rank prior, whereas SSLRR [21] combined low-rank representation, sparse structures, and the Alternating Direction Method of Multipliers (ADMM) algorithm to formulate a fusion pathway for HSI and MSI. Despite their interpretability, model-based methods struggle with capturing the complexities of real-world data and inapplicability in different types of landsat.

Recently, DL-based methods have superseded model-based approaches, evolving through two major paradigms. Convolutional Neural Networks (CNN)-based methods, such as ResTFNet [8], SSR-Net [9], and LAGConv [22], achieve superior pixel-wise accuracy but struggle to capture long-range dependencies due to limited receptive fields. To address this limitation, Vision Transformers (ViTs) [11,23,24,25] were introduced to model global contexts via self-attention, albeit at the cost of heavy computational burdens. Despite their respective advances, however, both CNN- and Transformer-based approaches share a fundamental shortcoming: Their reliance on pixel-wise objective functions inherently leads to over-smoothed outputs and an inability to recover high-frequency details. Such a problem is further exacerbated under real-world noisy conditions, where the physical correlation between HrMSI and LrHSI is not fully exploited and fusion quality declines sharply. As a result, the HSI-SR task requires generative models to restore more realistic textures in real-world scenarios.

2.2. Diffusion Models for HSI-SR

Generative models, particularly Diffusion Models (DMs) [13,14], have shown great promise, offering stable training and high-quality generation compared to Generative Adversarial Networks (GANs) [26,27]. While early DM-based fusion methods focused on unsupervised Pansharpening [28,29,30,31], the availability of synthetic paired data for HSI-SR [32] has enabled the development of supervised approaches.

Recent methods like HSR-Diff [17] and S2CycleDiff [33] have adapted diffusion models for HS–MS fusion, often employing dual-stream architectures or wavelet constraints [34,35] to manage multimodal information. Additionally, diffusion concepts have been integrated into Deep Unfolding Networks (DUNs) as iterative denoisers, as seen in ISP-Diff [36]. However, a primary limitation of these models is their reliance on hlstatic conditional injection, which is misaligned with the dynamic, multi-stage nature of the diffusion process. This static guidance can lead to conditional redundancy and suboptimal performance as the model transitions from coarse structure reconstruction to fine detail refinement.

2.3. Conditional Guidance and Disentanglement

A central challenge in conditional diffusion is ensuring faithful adherence to external conditions, typically addressed through two paradigms: architectural modulation and sampling strategies.

From an architectural perspective, conditional information is injected into the U-Net backbone at various stages. While methods like ControlNet [19], Dif-PAN [35], and DDRF [16] offer robust conditional control, they share two key drawbacks. First, the replicated encoder structure of ControlNet incurs significant computational overhead. Second, and more critically, these methods fail to explicitly account for the evolving demands of the denoising process, which progressively transitions from coarse structure construction to fine texture refinement across the reverse diffusion trajectory. In contrast, the proposed approach extracts a unified conditional feature map that is compactly adapted to the decoder and dynamically modulated according to the current timestep, addressing both efficiency and adaptability simultaneously.

From a sampling perspective, strategies are employed to enhance the conditional signal. While zero-shot or unsupervised methods like DDFM [29] and ARGS-Diff [30] exist, Classifier-Free Guidance (CFG) [37] has become a standard for training-free enhancement. By randomly dropping conditions during training, CFG enables score extrapolation at inference time, thereby yielding more realistic textures. It has been adapted to HSI tasks in models like SDM [38] and SCDM [39]. However, a critical oversight in standard CFG is its treatment of heterogeneous conditions as a monolithic block. Such a limitation can exacerbate modality conflicts between the spatial guidance from HrMSI and the spectral guidance from LrHSI.

3. Materials and Methods

In this section, we first formulate the hyperspectral image super-resolution (HSI-SR) problem and briefly review the fundamentals of diffusion probabilistic models. We then present framework of our model in detail, focusing on two core components tailored for HSI-SR: the Dynamic Modulated Residual Network (DMRN) and the Spatial–Spectral Disentangled Guidance (SSDG). Specifically, we elaborate on how DMRN dynamically modulates feature injection during denoising and how SSDG resolves modality conflicts during sampling. Finally, we introduce the optimization objective incorporating a frequency-aware constraint and discuss the experimental settings employed in our method.

3.1. Problem Formulation

The goal of HSI super-resolution is to reconstruct a high-fidelity, high-resolution HSI (HrHSI), denoted as

Z \in R^{C \times H W}

(in its matrix-unfolded form), from two observed low-resolution inputs: the low-resolution HSI (LrHSI)

X \in R^{C \times h w}

and the high-resolution multi-spectral image (HrMSI)

Y \in R^{c \times H W}

. Here, C and c denote the number of spectral bands, and

(H, W)

and

(h, w)

denote the spatial dimensions (

c < C

,

h < H

,

w < W

).

The observation model characterizing the degradation processes is defined as

\begin{matrix} X & = Z D + N_{x}, \end{matrix}

(1)

\begin{matrix} Y & = R Z + N_{y}, \end{matrix}

(2)

where

R \in R^{c \times C}

denotes the spectral response function (SRF) of the multi-spectral sensor, and the spatial degradation matrix

D \in R^{H W \times h w}

can be decomposed into a blurring and down-sampling operation.

N_{x}

and

N_{y}

are independent additive Gaussian noise terms.

3.2. Diffusion Preliminaries

Standard diffusion models typically predict the added noise

ϵ

. However, for HSI-SR tasks, predicting the clean image

x_{0}

directly has been proven more effective [16]. Furthermore, generating a precise high-dimension HSI cube from scratch is challenging, as minor deviations in the predicted mean can accumulate, leading to significant spectral distortions.

To address this, we simplify the optimization landscape by formulating the task as a residual learning problem. Let

X_{u p} = X U

denote the LrHSI upsampled to the target resolution via bilinear interpolation

U

. The target residual

R_{s}

is defined as

R_{s} = Z - X_{u p} .

(3)

Consequently, our method aims to learn the conditional distribution

p (R_{s} | X, Y)

. Once the predicted

{\hat{R}}_{s}

is generated, the final HrHSI is obtained by

\hat{Z} = {\hat{R}}_{s} + X_{u p}

. In the following formulation,

x_{0}

refers to the residual

R_{s}

.

Built upon the Denoising Diffusion Probabilistic Models (DDPM) [13], the forward process of our framework is a fixed Markov chain that gradually adds Gaussian noise to the original

x_{0}

over T steps:

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I),

(4)

where

t \in {1, 2, \dots, T}

, and

β_{t} \in {β_{1}, β_{2}, \dots, β_{T}}

are fixed variance schedules. As

T \to \infty

,

x_{T}

approximates a standard Gaussian

N (0, I)

.

In the reverse process, we train a neural network to approximate the posterior

p_{θ} (x_{t - 1} | x_{t}, X, Y)

. This process is parameterized as a Gaussian transition:

p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, X, Y, t), Σ_{θ} (x_{t}, t))

(5)

where the mean

μ_{θ}

is derived from the network’s prediction of

x_{0}

, which is detailed in Section 3.3. We adopt the fixed variance strategy

Σ_{θ} = β_{t} I

following [13] to ensure stable training. This reverse process is implemented by our proposed Dynamic Modulated Residual Network (DMRN), which is detailed in the following subsection.

3.3. Dynamic Modulated Residual Network (DMRN)

We propose the Dynamic Modulated Residual Network (DMRN) to parameterize the reverse diffusion process

p_{θ} (x_{t - 1} | x_{t}, X, Y)

. Tailored for multi-modal HSI-SR, DMRN is designed to effectively fuse the spectral information from LrHSI and the spatial details from HrMSI.

3.3.1. Overall Architecture

The DMRN is built upon a U-Net-based encoder–decoder backbone, incorporating residual blocks, skip connections, and spatial self-attention layers at its two lowest-resolution levels. To preserve the fine-grained spatial structure of the conditional images, we modify the standard input configuration. Following SR3 [15], instead of injecting conditions at the bottleneck, we concatenate the upsampled LrHSI

X_{u p}

, the HrMSI

Y

, and the noisy residual

x_{t}

along the channel dimension at the first layer. This early fusion strategy ensures that high-resolution spatial cues are available throughout the encoding process.

Additionally, the global timestep information t is embedded via linear projection and injected into each residual block, enabling the global time-awareness of our network. The network output is the predicted clean residual

{\hat{R}}_{s}

. Finally, the HrHSI is reconstructed as

\hat{Z} = {\hat{R}}_{s} + X_{u p}

.

3.3.2. Side-Injection Mechanism

While input concatenation provides basic guidance, deep semantic features of the condition may be diluted during the encoding process. Inspired by ControlNet [19], we introduce a parallel side-injection branch to enforce robust structural constraints on the decoder.

This branch first processes the concatenated conditions (

X_{u p}, Y

) through a specialized Conditional Module (detailed in Section 3.3.3) to extract multiscale condition embeddings

E_{c}

. These embeddings are then injected into the U-Net’s decoder blocks via addition.

Unlike the heavy dual-branch design in ControlNet, we propose a simpler feature-matching strategy to reduce computational overhead. Figure 2 illustrates the mechanism in detail.

For side injection, the Conditional Module generates a unified feature map

E_{c}

with the same spatial resolution as the target image and channels matching the bottleneck width. To adapt

E_{c}

to different decoder layers, we apply bilinear interpolation to downsample

E_{c}

to match the spatial resolution of the current decoder block, and spectrally, we employ a channel slicing strategy, taking the first

C_{i}

channels

E_{c}

to match the channel dimension

C_{i}

of the i-th decoder layer. This design significantly reduces the parameter count and computational overhead while maintaining strong structural guidance.

3.3.3. Time-Aware Dynamic Modulation

Since the comprehensive conditions

X_{u p}

and

Y

are already provided at the network input, the side-injected features

E_{c}

should provide flexible, adaptive guidance rather than merely replicating redundant information. A static injection strategy, however, treats all timesteps equally.

To address this, we propose a time-aware modulation mechanism using Feature-wise Linear Modulation (FiLM) [18]. Specifically, the concatenated conditions are first encoded into a feature map

F_{c}

, which is then modulated by the global timestep t:

FiLM (F_{c}, t) = γ (e_{t}) ⊙ F_{c} + β (e_{t}),

(6)

where

e_{t}

is the timestep embedding, and

γ (\cdot)

and

β (\cdot)

are learnable Multi-layer Perceptrons (MLPs) that generate the scale and shift parameters, respectively. The resulting features are then passed through a zero-convolution layer to produce the final conditional embedding

E_{c}

, ensuring training stability.

As visualized in Figure 3,

E_{c}

evolves as the denoising steps progress from

T = 1000

to

T = 1

, demonstrating the dynamic transition from coarse structures to fine-grained details.

Figure 2 depicted the structure of the module. The FiLM operation helps direct more representational capacity towards high-frequency features over time. At early stages (large t), the model focuses on low-frequency background structures. As the process continues (with decreasing t), the injected guidance progressively emphasizes high-frequency boundary details.

Following the modulation, the features pass through a projection layer and a zero-convolution layer (initialized with zeros) to yield the final conditional embedding

E_{c}

. The zero-initialization strategy ensures that the side branch does not disrupt training stability in the early stages.

3.4. Spatial–Spectral Disentangled Guidance (SSDG)

While DMRN ensures robust conditioning during the denoising process, it alone can not adequately balance the spatial–spectral trade-off at the sampling stage, as the modality-specific disparities lie in concatenated inputs remain unaddressed. As noted by SCDM [39], such disparities often cause artifacts that are difficult to rectify after training, owing to the monolithic nature of standard DDPM.

Classifier-Free Guidance (CFG) [37] offers a training-free solution by randomly dropping conditions during training, enabling score extrapolation during sampling:

{\hat{r}}_{θ} (x_{t}, c) = w \cdot r_{θ} (x_{t}, c) + (1 - w) \cdot r_{θ} (x_{t}, Ø),

(7)

where

r_{θ}

denotes the residual denoising network DMRN, Ø represents the null condition, and w is the guidance scale. According to Bayes’ rule, the difference between these predictions approximates the conditional score

\nabla_{x_{t}} log p (c ∣ x_{t})

:

\begin{matrix} \nabla_{x_{t}} log p (c ∣ x_{t}) & \propto \nabla_{x_{t}} log p (x_{t} ∣ c) - \nabla_{x_{t}} log p (x_{t} ∣ Ø) \\ \propto r_{θ} (x_{t}, c) - r_{θ} (x_{t}, Ø) \end{matrix}

(8)

For HSI-SR, the condition

c = {c_{s p a}, c_{s p e}}

comprises two physically independent components: spatial (HrMSI) and spectral (LrHSI) inputs, as defined by the observation model in Equations (1) and (2). Crucially, this physical independence implies that the joint posterior

p (c_{s p a}, c_{s p e} ∣ x_{t})

factorizes into modality-specific terms, enabling us to decompose the unified guidance signal into separate spatial and spectral components:

\begin{matrix} {\hat{r}}_{θ} (x_{t}, c) = r_{θ} (x_{t}, c) & + s_{1} \cdot Σ_{θ} (x_{t} | c) \cdot \nabla_{x_{t}} log p (c_{s p a} ∣ x_{t}) \\ + s_{2} \cdot Σ_{θ} (x_{t} | c) \cdot \nabla_{x_{t}} log p (c_{s p e} ∣ x_{t}), \end{matrix}

(9)

where

s_{1}

and

s_{2}

denote the conditional guidance scale from

c_{s p a}

and

c_{s p e}

, and

Σ_{θ} (x_{t} | c)

represents the variance.

Similar to Equation (8), these decomposed guidance terms correlate with the prediction dissimilarities:

\begin{matrix} \nabla_{x_{t}} log p (c_{s p e} ∣ x_{t}) & \propto r_{θ} (x_{t}, c) - r_{θ} (x_{t}, c_{s p a}), \end{matrix}

(10)

\begin{matrix} \nabla_{x_{t}} log p (c_{s p a} ∣ x_{t}) & \propto r_{θ} (x_{t}, c) - r_{θ} (x_{t}, c_{s p e}) . \end{matrix}

(11)

To intuitively control the fusion process, we introduce the overall guidance strength w and the spatial–spectral proportion ratio s:

w = 1 + s_{1} + s_{2}, s = s_{2} / (s_{1} + s_{2});

(12)

where w represents the guidance strength and s rates the proportion between spatial and spectral guidance. Then Equation (9) can be simplified into the following formula:

\begin{matrix} {\hat{r}}_{θ} (x_{t}, c) & = w \cdot r_{θ} (x_{t}, c) \\ + (1 - w) \cdot [s \cdot r_{θ} (x_{t}, c_{s p a}) \\ + (1 - s) \cdot r_{θ} (x_{t}, c_{s p e})] . \end{matrix}

(13)

Crucially, since we typically set the guidance strength

w > 1

to perform score extrapolation, the coefficient

(1 - w)

becomes negative. According to Equation (13), increasing the parameter s increases the magnitude of the subtracted spatial term

r_{θ} (x_{t}, c_{s p a})

.

Based on the Bayesian relation in Equation (10), this subtraction is mathematically equivalent to reinforcing the spectral guidance. Therefore, a larger s indicates a stronger preference for spectral fidelity. In our formulation,

s \in [0, 1]

serves as a regularizer to mediate modality conflicts.

Equation (13) is the overall guidance function of our proposed Spatial–Spectral Disentangled Guidance (SSDG), which can flexibly adjust both the strength and proportion of spatial and spectral guidance.

In the guidance function,

c

denotes the concatenation of

Y

and

X_{u p}

, which represents the full condition. The disentangled conditions

c_{s p a}

and

c_{s p e}

are obtained via selective masking:

\begin{matrix} c_{s p a} & = Concat (Y, Ø_{s p e}), \end{matrix}

(14)

\begin{matrix} c_{s p e} & = Concat (Ø_{s p a}, X_{u p}), \end{matrix}

(15)

where Ø denotes a zero tensor with the same shape as

Y

or

X_{u p}

, respectively.

The selective multi-modal masking described above is exclusively performed during inference. As for the training phase formulated in Algorithm 1, we conduct dropout operations strictly following the standard CFG protocol.

No independent or per-modality masking is required during training. Instead, the disentanglement of SSDG is grounded in the Bayesian factorization of the joint posterior at a purely mathematical level, maintaining a completely training-free paradigm for disentangled guidance.

3.5. Training Objective and Efficient Sampling

This section details the training objectives incorporating frequency-domain constraints and the acceleration strategy employed for efficient inference. The complete training and sampling procedures are then summarized in Algorithms 1 and 2, respectively.

Algorithm 1: Training Algorithm

Algorithm 2: Sampling Algorithm

3.5.1. Frequency-Aware Optimization Objective

Given that hyperspectral datasets typically contain limited training samples for predicting a high-dimensional image cube, predicting the added noise

ϵ

can lead to unstable convergence. Therefore, the proposed method directly predicts the clean residual

R_{s}

(referred to as

x_{0}

in diffusion terminology), accelerating convergence and improving reconstruction performance on HSI-SR tasks.

Since the residual

R_{s}

represents the difference between target HrHSI

Z

and upsampled LrHSI

X_{u p}

, it is inherently dominated by high-frequency components such as edges and textures, with only minor low-frequency discrepancies. While pixel-wise losses enforce spatial consistency, they often fail to capture the abundant high-frequency details present in

R_{s}

. We utilized the Mean Squared Error (MSE) loss as the spatial constraint, which is known for its stability and effectiveness:

L_{M S E} = {∥ R_{s} - {\hat{R}}_{s} ∥}_{2}^{2},

(16)

where

{\hat{R}}_{s}

is the predicted residual.

To explicitly address the high-frequency dominance of

R_{s}

, we introduced a constraint in the frequency domain. We employ the Discrete Wavelet Transform (DWT) [40], a mature processor for remote sensing images, to decompose the residual

R_{s}

. DWT effectively reorganizes the information, separating the low-frequency approximation (

R_{L L}

) from high-frequency details (

R_{L H}, R_{H L}, R_{H H}

) with a ratio of 1:3:

\begin{matrix} R_{f r e q} & = C o n c a t (DWT (R_{s})) \\ = C o n c a t (R_{L L}, R_{L H}, R_{H L}, R_{H H}), \end{matrix}

(17)

and this decomposition aligns perfectly with the characteristics of

R_{s}

.

Consequently, we impose a frequency consistency loss to penalize discrepancies in the wavelet domain:

L_{f r e q} = {∥ R_{f r e q} - {\hat{R}}_{f r e q} ∥}_{1},

(18)

where

{\hat{R}}_{f r e q}

denotes the DWT decomposition of the predicted residual

{\hat{R}}_{s}

. This term enforces consistency of high-frequency features between the generated image and the ground truth.

The final optimization objective of the proposed method is a weighted combination of the spatial and frequency loss terms:

L = L_{M S E} + λ \cdot L_{f r e q},

(19)

where

λ

controls the weight of the frequency constraint. In our experiments,

λ

is empirically set to 0.1 across all three benchmark datasets to consistently encourage the recovery of sharp details.

3.5.2. Accelerated Inference via DDIM

The original DDPM sampler relies on a Markovian process for the reverse generation, necessitating thousands of steps to restore the target image, which severely impedes the inference speed. To enhance the efficiency of our model, we adopt the Denoising Diffusion Implicit Models (DDIM) [41] sampling strategy. DDIM generalizes the reverse process into a non-Markovian deterministic setting, enabling high-quality sampling with significantly fewer timesteps. Without compromising performance, we set the sampling steps to

T = 25

for all comparisons with state-of-the-art (SOTA) methods. The impact of different timesteps on our time-aware DMRN is analyzed in Section 5.1.

Unlike the stochastic sampling in DDPM, DDIM converts the backward process into an Ordinary Differential Equation (ODE) solver by controlling the variance parameter

η

. Given the residual

{\hat{r}}_{θ} (x_{t}, c)

from Equation (13), we first estimate the effective noise

ϵ_{θ}

:

ϵ_{θ} (x_{t}, c) = \frac{x_{t} - \sqrt{{\bar{α}}_{t}} {\hat{r}}_{θ} (x_{t}, c)}{\sqrt{1 - {\bar{α}}_{t}}} .

(20)

Subsequently, the denoised state

x_{t - 1}

at the previous timestep is derived by combining

ϵ_{θ} (x_{t}, c)

and

{\hat{r}}_{θ} (x_{t}, c)

:

\begin{matrix} x_{t - 1} & = \sqrt{{\bar{α}}_{t - 1}} {\hat{r}}_{θ} (x_{t}, c) \\ + \sqrt{1 - {\bar{α}}_{t - 1} - σ_{t}^{2}} \cdot ϵ_{θ} (x_{t}, c) + σ_{t} \cdot ϵ, \end{matrix}

(21)

where

σ_{t} = η \sqrt{(1 - {\bar{α}}_{t - 1}) / (1 - {\bar{α}}_{t})} \sqrt{1 - {\bar{α}}_{t} / {\bar{α}}_{t - 1}}

. Here,

ϵ

represents the stochastic noise component sampled from a Gaussian distribution. For the HSI-SR task, we set

η = 0

(implying

σ_{t} = 0

) to achieve a fully deterministic reconstruction procedure, ensuring stability and consistency.

3.6. Implementation Details

All experiments were conducted using PyTorch (v2.7.0) and CUDA (v12.8) frameworks and trained on a single NVIDIA GeForce RTX 5090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). The AdamW optimizer with a weight decay of

1 \times 10^{- 4}

was employed for optimization, and the batch size was set to 4 to train until convergence. The initial learning rate was set to

1 \times 10^{- 4}

for the U-Net backbone, while the Conditional Module used a scaled rate of

2 \times 10^{- 5}

to ensure stable feature injection. A ReduceLROnPlateau scheduler was used to dynamically adjust the learning rate based on validation loss, with a decay factor of 0.5, a patience of 20 epochs, and a lower bound of

5 \times 10^{- 6}

. Regarding the diffusion configuration, the default time step T was set to 1000 in training and 25 in DDIM sampling, and the

β_{t}

followed a linear noise schedule ranging from

10^{- 4}

to

0.02

, multiplying the scale factor

1000 / T

.

3.7. Datasets

Our method was evaluated on three public hyperspectral datasets: Chikusei [42], Houston 2018 (Houston) [43], and Kennedy Space Center (KSC) [44]. These datasets cover diverse scenes from urban to natural landscapes. For all experiments, the top 70% of the image area was used for training with randomly cropped

128 \times 128

patches, and the remaining 30% was used for testing with a cropping stride of 64 pixels.

Following Wald’s protocol, LrHSI and HrMSI pairs were generated with a spatial downsampling scale factor of 4. LrHSIs were produced by applying a

5 \times 5

Gaussian blurring kernel (

σ = 2

) followed by downsampling. HrMSIs were synthesized using sensor-specific spectral response functions (SRFs).

3.8. Comparison Methods and Metrics

We compared the proposed method against seven state-of-the-art (SOTA) supervised HS–MS fusion methods, categorized as follows:

CNN-based: LAGConv [22].
Transformer-based: Fusformer [11], PSRT [24], and RAMoE [25].
Diffusion-based: Dif-PAN [35], S2CycleDiff [33], and ISPDiff [36].

Model performance was evaluated under both noise-free and blind Gaussian noisy conditions (

σ

= 0.01, approximately 23 dB SNR). Five standard metrics were used: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Spectral Angle Mapper (SAM), Error Relative Global Adimensionnelle de Synthèse (ERGAS), and Root Mean Square Error (RMSE).

All metrics were computed on results normalized to the range

[0, 1]

. To rigorously assess the stability and reliability of the models, each experiment was repeated five times with different random seeds.

We report the mean of each metric alongside their 95% confidence intervals, which were computed via 1000 bootstrap resamplings over the test set patches. Specifically, for deterministic methods under noise-free conditions, the intervals exclusively reflect performance fluctuations across different spatial patches. In contrast, for deterministic methods under noisy conditions and for all diffusion-based models, the intervals capture both patch-level data variance and algorithmic stochasticity.

4. Results

In this section, we first perform a holistic evaluation of cross-dataset performance, followed by a comprehensive quantitative and qualitative comparison between the proposed method and seven state-of-the-art (SOTA) methods.

For quantitative evaluation, we use PSNR and RMSE to assess pixel-wise accuracy, SSIM for structural similarity, SAM for spectral consistency, and ERGAS for overall performance.

Regarding the qualitative comparison, pseudo-color images of the fusion results and the average pixel error per band are displayed to demonstrate the visual fidelity of each method. Furthermore, to better evaluate the trade-off between computational cost and fusion performance, the number of parameters (Params), floating-point operations (FLOPs), peak Video Random Access Memory (VRAM) usage, and average running time per batch (batch size of 4) are reported alongside the evaluation metrics.

By conducting a targeted analysis tailored to the specific characteristics of each dataset, we systematically examine the strengths and limitations of each competing method. Our findings demonstrate that the proposed method represents the most robust approach, delivering consistently strong performance across all datasets, even in noisy scenarios.

4.1. Overall Generalization and Efficiency Comparison

Before detailing individual datasets, we provide a holistic assessment of the proposed method against comparing methods to demonstrate its generalization capability and robustness. As illustrated in Figure 4, The proposed method exhibits the most favorable overall performance profile, consistently ranking among the top methods across all conditions rather than trading off performance on certain datasets for gains on others.

For real-world deployment, inference speed and computational cost are also crucial aspects in addition to fusion quality. For the three hyperspectral datasets with band numbers ranging from 48 to 176, we comprehensively evaluate model efficiency.

Specifically, alongside the average PSNR, we report the Params, FLOPs, peak VRAM usage, and average run time required during the inference procedure under a batch size of 4 and patch size of

128 \times 128

.

As shown in Table 1, although deterministic methods exhibit lower Params and FLOPs than diffusion-based methods, their deterministic nature leads to limited robustness and a reduced ability to capture more realistic high-frequency details. In contrast, generative diffusion models intrinsically require more iterative mathematical operations to achieve superior generative priors and robustness.

This distinction highlights the advantages of deterministic fusion methods in real-time and lightweight applications, while Transformer-based methods and DDIM-accelerated diffusion models are better suited for near-real-time tasks with higher demands on fusion accuracy. Notably, owing to its cyclic two-stage structure, S2CycleDiff exhibits substantially larger model size and computational cost compared with standard diffusion models, making it less suitable for typical fusion tasks while potentially being more advantageous in low-signal scenarios (e.g., water bodies, shadows and dimly lit areas), as evidenced by its higher ranking in noisy scenarios than in clean ones.

Nevertheless, among generative approaches, the proposed method maintains highly competitive computational efficiency, with peak VRAM consumption being even lower than that of some deterministic methods, while achieving the highest average PSNR. Ultimately, the proposed method strikes a strong balance between fusion quality, noise robustness, and deployment feasibility. The cross-dataset advantages underscore the practical superiority and reliability of our approach for real-world deployment.

4.2. Quantitative Comparison

The Chikusei dataset contains 128 spectral bands with a ground sampling distance (GSD) of 2.5 m. The scenes primarily consist of agricultural areas with scattered urban and water components, making it a relatively mild dataset for fusion tasks.

As shown in Table 2, under the noise-free scenario, the proposed method, LAGConv, and RAMoE demonstrate superior performance. The proposed method achieves the best spectral consistency, outperforming the second-best LAGConv by a significant margin of 0.08 in SAM and 7.7% in RMSE. Meanwhile, it also retains competitive spatial fidelity, with the PSNR gap being less than 0.01 dB compared to the top-ranked RAMoE, while keeping ERGAS within 4.2% of the best-performing LAGConv.

Under the 1% noisy scenario, while deterministic methods suffer severe performance degradation, diffusion-based methods exhibit remarkable robustness. As S2CycleDiff ranks first despite its substantially higher computational cost, the proposed method and ISPDiff follow closely with lighter model architectures, indicating their superior robustness against noise distortion. Dif-PAN, though not ranking among the top three, also demonstrates competitive performance compared with deterministic methods.

The Houston dataset comprises 48 spectral bands capturing urban landscapes, where accurately restoring colorful rooftop information poses the primary challenge.

As shown in Table 3, competition in the noise-free scenario is intense. While the proposed method achieves the best PSNR and RMSE, Dif-PAN and LAGConv demonstrate their strengths in capturing coarse-grained structural information and fine-grained spectral details, respectively. In addition, RAMoE also performs well, ranking second in several metrics, but showing relatively poor spectral consistency.

The success of Dif-PAN and LAGConv, as well as the comparatively weaker spectral performance of RAMoE, may be attributed to the relatively regular structural patterns in urban scenes. Specifically, the architectural design of Dif-PAN, which emphasizes structural representation, and that of LAGConv, which focuses on local receptive fields, appear to be particularly effective in such scenarios. In contrast, the mixture-of-experts module in RAMoE may be less effective in discriminating structurally similar scene patterns.

This interpretation is further supported by the performance under the 1% noisy scenario, as the proposed method takes the lead with a significant margin, while the performance of Dif-PAN and LAGConv drops dramatically, and RAMoE still fails to capture spectral consistency. Additionally, the improved ranking of ISPDiff demonstrates the advantage of its theoretically grounded unfolding design in structurally homogeneous scenes.

The KSC dataset comprises 176 spectral bands capturing complex natural landscapes (e.g., marshes and rivers), where accurately restoring diverse spectral vectors constitutes the primary challenge.

As shown in Table 4, LAGConv excels at both spectral fidelity and spatial consistency, while Fusformer and Dif-PAN follow closely behind with only minor gaps. The proposed method, though ranking behind the former three, maintains competitive fusion quality.

As the quantitative evaluation employs bootstrap resampling over individual test patches, we found that diffusion-based methods depend strongly on informative, high-signal conditional guidance to generate high-quality results, and the presence of low-signal regions, such as river areas with weak recorded radiance, contributes to the collective performance degradation of generative methods.

This disadvantage becomes more pronounced under the 1% noisy scenario, and the methods that remain competitive exhibit distinct robustness mechanisms. RAMoE, benefiting from its Invariant-weight Expert module that preserves overall patch energy, achieves the best performance. Meanwhile, Fusformer and S2CycleDiff with their computationally intensive architectures, perform well under noise suppression. The proposed method, Dif-PAN, and LAGConv also demonstrate strong robustness through the flexible SSDG strategy, DWT-enhanced structural modeling, and local adaptive convolution mechanisms, respectively.

4.3. Qualitative Comparison

Beyond quantitative evaluation, the visual quality of fusion results is equally important. Given their strong generative priors, diffusion-based methods are generally expected to produce perceptually superior results in qualitative comparison.

Visual comparison on the Chikusei dataset in Figure 5 illustrates the ability to generate sharp local edges and fine textures. The noisy case then challenges filtering Gaussian noise in spectrally uniform regions.

According to the pseudo-color images and error maps, diffusion-based methods succeed in preserving sharp farmland boundaries and faithful textures, with the proposed method capturing the richest high-frequency details. RAMoE also performs well in the noise-free scenario with the lowest-intensity error map, although this comes at the cost of slightly over-smoothed boundaries and visually noticeable spectral distortion.

Under the 1% noisy scenario, the four diffusion-based methods demonstrate distinct yet consistently distributed error maps, with the proposed method and ISPDiff better at spectral fidelity, while S2CycleDiff achieves the best spatial consistency. Notably, the error map of PSRT exhibits clear spatially correlated residual patterns, possibly indicating limitations in its shuffle-and-reshuffle mechanism.

On the Houston dataset with repetitive urban buildings, the primary challenge is the restoration of colorful rooftops and complex shadowing and illumination textures. As shown in Figure 6, the proposed method consistently exhibits the lowest reconstruction error. In the noise-free case, while other competitors struggle to restore complex rooftop textures, the proposed method accurately recovers these high-frequency details.

Under noise distortion, the proposed method and RAMoE yield the cleanest error maps, and the proposed method also successfully preserves the rooftop textures with the best spectral fidelity. For Fusformer and S2CycleDiff, their computationally intensive architectures appear less effective in modeling subtle illumination variations and surface reflection details.

As shown in Figure 7 for the KSC dataset, the proposed method faithfully restores challenging high-frequency details such as submerged terrain boundaries, whereas LAGConv and RAMoE fail to preserve sharp boundaries despite achieving relatively low pixel-wise errors. Under noise distortion, it is noteworthy that S2CycleDiff produces visually cleaner error maps on restoration of water bodies. However, this improvement appears to be achieved through patch-wise radiometric lifting, which introduces an optimization-induced spectral bias. This is evidenced by the structured residual band in water–land transition regions, together with progressively larger reconstruction errors from low-signal water regions toward brighter land regions. In contrast, the proposed method maintains better spectral vector consistency, effectively preserving structural details while avoiding optimization-induced radiometric bias, albeit with a slight visual disadvantage in patch-wise comparison.

5. Discussion

5.1. Ablation Studies

To validate the effectiveness of the proposed components, ablation studies were conducted on the Time-Aware Conditional Diffusion Framework, the Spatial–Spectral Disentangled Guidance, and the Frequency-Aware Optimization. Additionally, the sensitivity of hyperparameters w and s, and the time-dependent behaviour of the modulated conditional output are discussed in this section. Due to space constraints, Section 5.1.1, Section 5.1.3 and Section 5.3 present the results on the representative Chikusei dataset. Comprising hundreds of test patches, the Chikusei dataset is sufficient to eliminate the influence of random factors and ensure statistical reliability.

5.1.1. Dynamic Modulated Residual Network

The Dynamic Modulated Residual Network (DMRN), which incorporates residual learning, side branch injection to the decoder, and a FiLM-modulated conditional module, serves as the foundation of the the proposed method denoising procedure. To validate the effectiveness of DMRN, experiments were conducted on the full model and three variants: (a) w/o Residual Learning, (b) w/o Side Injection, and (c) w/o Time FiLM. To better isolate and demonstrate the performance of these backbones, the SSDG sampling strategy was disabled, and the optimization terms were uniformly set as

L = L_{M S E} + λ \cdot L_{f r e q}

, with

λ = 0.1

.

As shown in Table 5, removing residual learning drastically degrades performance, confirming its necessity for HSI-SR. Introducing a simple side-injection branch yields marginal improvements over the baseline. Crucially, incorporating the time-aware FiLM operation provides a significant boost in PSNR and notable ERGAS reduction. This validates the effectiveness of dynamically adjusting conditional weights according to the specific denoising stage rather than relying on static feature injection.

Tested on 25 steps DDIM sampling, our time-aware condition module is, however, capable of larger sampling steps trained on

T = 1000

. Several experiments were conducted with different sampling steps to identify

T = 25

as the optimal trade-off between performance and computational cost. Experimental results and analysis are detailed in Section 5.3, which investigates the contributions of dynamic modulation at different timesteps, demonstrating the necessity of iterative diffusion sampling.

5.1.2. Spatial–Spectral Disentangled Guidance

The proposed Spatial–Spectral Disentangled Guidance (SSDG), though conceptually simple, yields comprehensive improvements over standard CFG by providing a flexible mechanism to navigate performance trade-offs.

To evaluate this, we selected the Houston dataset, which is characterized by greater spatial complexity. The CFG mechanism inherently introduces a trade-off between generative quality and quantitative metrics. As shown in Table 6, standard CFG improves the comprehensive ERGAS score but at the cost of degrading PSNR and SAM. In contrast, our SSDG, when configured for ’Best ERGAS’, achieves the same ERGAS improvement with significantly less degradation.

This is because SSDG offers explicit, training-free control over the spatial–spectral balance. The results demonstrate that it can be configured to selectively prioritize pixel-wise accuracy (Best PSNR), spectral fidelity (Best SAM), or a comprehensive performance (Best ERGAS). This adaptability not only yields a better overall trade-off but also confers robustness for datasets with spatial–spectral imbalances. Notably, standard CFG required a more aggressive guidance scale w to reach its optimum, indicating that SSDG provides higher guidance quality, a phenomenon further explored in Section 5.2.

5.1.3. Impact of Optimization Strategy

Furthermore, to evaluate the effectiveness of the proposed DWT frequency constraint, an ablation study was conducted comparing the DWT constraint with a three-level Laplacian Pyramid (LP) decomposition constraint, alongside a pure MSE loss baseline (i.e.,

λ

= 0).

As shown in Table 7, the incorporation of frequency-domain constraints consistently enhances fusion performance. Specifically, the LP variant yields a moderate PSNR improvement of 0.4640 dB over the pure MSE baseline. The proposed DWT strategy, however, achieves the most substantial gains across all metrics, with a PSNR improvement of 1.4737 dB, more than three times that of the LP variant. Similar superiority is observed across spectral fidelity and spatial error metrics, with the proposed DWT strategy yielding the best SAM of 2.3028 and the lowest RMSE of 0.0048.

While the LP variant improves upon the baseline, it falls short of the proposed DWT strategy. This performance discrepancy validates that DWT provides an orthogonal and non-redundant decomposition, which is highly compatible with the residual learning objective. In contrast, the overcomplete representation of LP may hinder precise high-frequency residual learning in diffusion models.

5.2. Analysis of Hyperparameters

Classifier-Free Guidance (CFG) provides the diffusion model with an external mechanism to adjust conditional strength, encouraging the generation of more realistic and texturally rich outputs. However, an inappropriate guidance scale can lead to severe spectral distortion. In this section, we analyze the sensitivity of fusion results to the hyperparameters w and s, and quantitatively explore the relationship between evaluation metrics and these parameters.

Through a fine-grained grid search over the performance space, we identify specific trajectories along which the metrics remain optimal. According to Equation (12), the optimal guidance scales

s_{1}

and

s_{2}

can be expressed in terms of w and s as follows:

\begin{matrix} s_{1} & = (1 - s) \cdot (w - 1), \end{matrix}

(22)

\begin{matrix} s_{2} & = s \cdot (w - 1) . \end{matrix}

(23)

Consequently, the correlations between s and w for a fixed optimal spectral guidance

s_{1}

or

s_{2}

are

\begin{matrix} s & = 1 - \frac{s_{1}}{w - 1}, \end{matrix}

(24)

\begin{matrix} s & = \frac{s_{2}}{w - 1} . \end{matrix}

(25)

To investigate the relationship between the guidance strength w and the spatial–spectral proportion s, we conducted a fine-grained grid search. As illustrated in Figure 8 and Figure 9, metrics such as PSNR, SAM, and ERGAS exhibit distinct optimal trajectories. Specifically, the optimal coordinates

(w, s)

fit rectangular hyperbolic curves centered around

(1, 1)

, yielding the theoretical relationships expressed in Equations (24) and (25). This confirms that explicit spatial–spectral disentanglement effectively manages modality conflicts, with negative

s_{1}

values prioritizing spectral purity in irregular landscapes and positive values sharpening spatial structures in urban scenes.

The cooperative optimization process leads to a consistent negative sign for their fitting curves, further confirming the necessity of spatial–spectral disentanglement. By revelling the underlying properties of the trained DMRN, the relationship between optimal w and s is simplified for determining hyperparameters: With the maximum magnitude

| s_{1} |

, one can trace through Equation (24) to find the optimal s and w, where

w > 1

yields better visual quality and

w < 1

ensures more conservative consistency.

5.3. Analysis of Time-Aware Condition Module

To further validate the stability and efficiency of the proposed framework, we analyzed the impact of the time-aware module across different sampling steps.

As shown in Table 8, the full model with the time-aware FiLM module consistently outperforms the simple condition module across all sampling step counts (T). Crucially, instead of exhibiting severe degradation at extremely low step counts, which is a common issue in standard diffusion models, the proposed framework achieves remarkably stable metrics from T = 5 to T = 1000, confirming the efficiency of the DMRN architecture and its Time-Aware Condition Module. While the numerical gain of incorporating the timestep index beyond T = 25 remains marginal, T = 25 was selected as the default configuration to align with standard DDIM practices, providing a reliable balance between visual fidelity and low inference latency.

6. Conclusions

In this paper, we propose a novel modulated diffusion framework tailored for robust hyperspectral image super-resolution. To address the key limitation that existing diffusion-based methods rely on static conditional injection, leading to constraints that are either overly rigid or insufficiently tight, we design the Dynamic Modulated Residual Network (DMRN) that leverages time-aware FiLM to adaptively regulate conditional strength throughout the denoising process. To better align with the residual learning objective, a DWT-based decomposition is incorporated to impose explicit constraints on high-frequency restoration. Furthermore, inspired by CFG, we introduce the Spatial–Spectral Disentangled Guidance (SSDG) to explicitly manage the modality conflict between HrMSI and LrHSI in a training-free manner, providing flexibility to navigate the trade-off between spatial consistency and spectral fidelity.

Extensive experiments conducted on three public datasets demonstrate the competitive fusion quality and strong blind-noise robustness of the proposed method against state-of-the-art approaches, while maintaining a moderate computational cost. Beyond quantitative gains, our analysis further suggests that adaptive conditional modulation plays a critical role in balancing generative priors and modality consistency, particularly under noisy or low-signal conditions where static guidance often becomes unreliable. Although the iterative sampling process renders it slower than CNN-based methods, its combination of noise robustness, adaptive modality disentanglement, and controllable guidance highlights its practical potential for near-real-time remote sensing applications.

Future work will focus on accelerating the sampling process and developing scene-adaptive estimation algorithms for the constant spatial guidance scale

s_{1}

, leveraging the analytically derived guidance coupling law to enable self-tuning SSDG across diverse sensing scenarios.

Author Contributions

Conceptualization, X.X., J.Q., J.Z. and L.F.; Methodology, X.X., J.Q., J.Z., K.Y. and L.F.; Software, X.X. and J.Z.; Validation, X.X.; Formal analysis, X.X., J.Z. and K.Y.; Investigation, X.X.; Resources, J.Q. and L.F.; Data curation, X.X. and K.Y.; Writing—original draft, X.X.; Writing—review and editing, J.Q.; Visualization, X.X.; Supervision, J.Q. and L.F.; Project administration, J.Q. and L.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were analyzed in this study. The Chikusei dataset is available at https://naotoyokoya.com/Download.html (accessed on 1 May 2026). The Houston 2018 dataset is available at the 2018 IEEE GRSS Data Fusion Challenge website (accessed on 1 May 2026). The Kennedy Space Center (KSC) dataset is available at https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes (accessed on 1 May 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Goetz, A.F.H.; Vane, G.; Solomon, J.E.; Rock, B.N. Imaging Spectrometry for Earth Remote Sensing. Science 1985, 228, 1147–1153. [Google Scholar] [CrossRef]
Asner, G. Hyperspectral Remote Sensing of Canopy Chemistry, Physiology, and Biodiversity in Tropical Rainforests. In Hyperspectral Remote Sensing of Tropical and Sub-Tropical Forests; CRC Press: Boca Raton, FL, USA, 2008; pp. 261–296. [Google Scholar] [CrossRef]
Thenkabail, P.S.; Smith, R.B.; De Pauw, E. Hyperspectral Vegetation Indices and Their Relationships with Agricultural Crop Characteristics. Remote Sens. Environ. 2000, 71, 158–182. [Google Scholar] [CrossRef]
Wei, Q.; Bioucas-Dias, J.; Dobigeon, N.; Tourneret, J.Y. Hyperspectral and Multispectral Image Fusion Based on a Sparse Representation. IEEE Trans. Geosci. Remote Sens. 2015, 53, 3658–3668. [Google Scholar] [CrossRef]
Akhtar, N.; Shafait, F.; Mian, A. Sparse Spatio-spectral Representation for Hyperspectral Image Super-resolution. In Proceedings of the Computer Vision—ECCV 2014; Springer: Cham, Switzerland, 2014; pp. 63–78. [Google Scholar] [CrossRef]
Yokoya, N.; Yairi, T.; Iwasaki, A. Coupled Nonnegative Matrix Factorization Unmixing for Hyperspectral and Multispectral Data Fusion. IEEE Trans. Geosci. Remote Sens. 2012, 50, 528–537. [Google Scholar] [CrossRef]
Zhu, Z.; Hou, J.; Chen, J.; Zeng, H.; Zhou, J. Hyperspectral Image Super-Resolution via Deep Progressive Zero-Centric Residual Learning. IEEE Trans. Image Process. 2021, 30, 1423–1438. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Liu, Q.; Wang, Y. Remote sensing image fusion based on two-stream fusion network. Inf. Fusion 2020, 55, 1–15. [Google Scholar] [CrossRef]
Zhang, X.; Huang, W.; Wang, Q.; Li, X. SSR-NET: Spatial–Spectral Reconstruction Network for Hyperspectral and Multispectral Image Fusion. IEEE Trans. Geosci. Remote Sens. 2021, 59, 5953–5965. [Google Scholar] [CrossRef]
Gao, Y.; Zhang, M.; Wang, J.; Li, W. Cross-Scale Mixing Attention for Multisource Remote Sensing Data Fusion and Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Hu, J.F.; Huang, T.Z.; Deng, L.J.; Dou, H.X.; Hong, D.; Vivone, G. Fusformer: A Transformer-Based Fusion Network for Hyperspectral Image Super-Resolution. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Chen, L.; Vivone, G.; Qin, J.; Chanussot, J.; Yang, X. Spectral–Spatial Transformer for Hyperspectral Image Sharpening. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 16733–16747. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. In Proceedings of the International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image Super-Resolution Via Iterative Refinement. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4713–4726. [Google Scholar] [CrossRef]
Cao, Z.; Cao, S.; Wu, X.; Hou, J.; Ran, R.; Deng, L.J. DDRF: Denoising Diffusion Model for Remote Sensing Image Fusion. arXiv 2023, arXiv:2304.04774. [Google Scholar] [CrossRef]
Wu, C.; Wang, D.; Bai, Y.; Mao, H.; Li, Y.; Shen, Q. HSR-Diff: Hyperspectral Image Super-Resolution via Conditional Diffusion Models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 7060–7070. [Google Scholar] [CrossRef]
Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A. FiLM: Visual Reasoning with a General Conditioning Layer. Proc. AAAI Conf. Artif. Intell. 2018, 32, 3942–3951. [Google Scholar] [CrossRef]
Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3813–3824. [Google Scholar] [CrossRef]
Cao, M.; Bao, W.; Qu, K.; Zhang, X.; Ma, X. Nonlocal Low-Rank Regularization for Hyperspectral and High-Resolution Remote Sensing Image Fusion. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 1444–1447. [Google Scholar] [CrossRef]
Xue, J.; Zhao, Y.Q.; Bu, Y.; Liao, W.; Chan, J.C.W.; Philips, W. Spatial-Spectral Structured Sparse Low-Rank Representation for Hyperspectral Image Super-Resolution. IEEE Trans. Image Process. 2021, 30, 3084–3097. [Google Scholar] [CrossRef]
Jin, Z.R.; Zhang, T.J.; Jiang, T.X.; Vivone, G.; Deng, L.J. LAGConv: Local-Context Adaptive Convolution Kernels with Global Harmonic Bias for Pansharpening. Proc. AAAI Conf. Artif. Intell. 2022, 36, 1113–1121. [Google Scholar] [CrossRef]
Hu, J.F.; Huang, T.Z.; Deng, L.J.; Jiang, T.X.; Vivone, G.; Chanussot, J. Hyperspectral Image Super-Resolution via Deep Spatiospectral Attention Convolutional Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 7251–7265. [Google Scholar] [CrossRef]
Deng, S.Q.; Deng, L.J.; Wu, X.; Ran, R.; Hong, D.; Vivone, G. PSRT: Pyramid Shuffle-and-Reshuffle Transformer for Multispectral and Hyperspectral Image Fusion. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5503715. [Google Scholar] [CrossRef]
Xiao, N.; Fu, X.; Ren, Q.; He, W.; Wei, S.; Jia, S. Region-Aware MoE Network for Hyperspectral and Multispectral Image Fusion. IEEE Trans. Geosci. Remote Sens. 2026, 64, 5510015. [Google Scholar] [CrossRef]
Xiao, J.; Li, J.; Yuan, Q.; Jiang, M.; Zhang, L. Physics-Based GAN with Iterative Refinement Unit for Hyperspectral and Multispectral Image Fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6827–6841. [Google Scholar] [CrossRef]
Zhu, C.; Deng, S.; Zhou, Y.; Deng, L.J.; Wu, Q. QIS-GAN: A Lightweight Adversarial Network with Quadtree Implicit Sampling for Multispectral and Hyperspectral Image Fusion. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5531115. [Google Scholar] [CrossRef]
Liu, J.; Wu, Z.; Xiao, L. A Spectral Diffusion Prior for Unsupervised Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5528613. [Google Scholar] [CrossRef]
Zhao, Z.; Bai, H.; Zhu, Y.; Zhang, J.; Xu, S.; Zhang, Y.; Zhang, K.; Meng, D.; Timofte, R.; Van Gool, L. DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 8048–8059. [Google Scholar] [CrossRef]
Zhu, J.; Wang, H.; Xu, Y.; Wu, Z.; Wei, Z. Self-Learning Hyperspectral and Multispectral Image Fusion via Adaptive Residual Guided Subspace Diffusion Model. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 17862–17871. [Google Scholar] [CrossRef]
Shi, Y.; Liu, Y.; Cheng, J.; Wang, Z.J.; Chen, X. VDMUFusion: A Versatile Diffusion Model-Based Unsupervised Framework for Image Fusion. IEEE Trans. Image Process. 2025, 34, 441–454. [Google Scholar] [CrossRef]
Zeng, Y.; Huang, W.; Liu, M.; Zhang, H.; Zou, B. Fusion of satellite images in urban area: Assessing the quality of resulting images. In Proceedings of the 2010 18th International Conference on Geoinformatics, Beijing, China, 18–20 June 2010; pp. 1–4. [Google Scholar] [CrossRef]
Qu, J.; He, J.; Dong, W.; Zhao, J. S2CycleDiff: Spatial-Spectral-Bilateral Cycle-Diffusion Framework for Hyperspectral Image Super-resolution. Proc. AAAI Conf. Artif. Intell. 2024, 38, 4623–4631. [Google Scholar] [CrossRef]
Phung, H.; Dao, Q.; Tran, A. Wavelet Diffusion Models are fast and scalable Image Generators. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 10199–10208. [Google Scholar] [CrossRef]
Cao, Z.; Cao, S.; Deng, L.J.; Wu, X.; Hou, J.; Vivone, G. Diffusion model with disentangled modulations for sharpening multispectral and hyperspectral images. Inf. Fusion 2024, 104, 102158. [Google Scholar] [CrossRef]
Dong, W.; Liu, S.; Xiao, S.; Qu, J.; Li, Y. ISPDiff: Interpretable Scale-Propelled Diffusion Model for Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5519614. [Google Scholar] [CrossRef]
Ho, J.; Salimans, T. Classifier-Free Diffusion Guidance. arXiv 2022, arXiv:2207.12598. [Google Scholar] [CrossRef]
Zhou, W.; Wang, W.; Bao, J.; Chen, D.; Chen, D.; Yuan, L.; Li, H. Semantic Image Synthesis via Diffusion Models. arXiv 2022, arXiv:2207.00050. [Google Scholar] [CrossRef]
Chen, B.; Liu, L.; Liu, C.; Zou, Z.; Shi, Z. Spectral-Cascaded Diffusion Model for Remote Sensing Image Spectral Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5528414. [Google Scholar] [CrossRef]
Sturm, B.L. Stéphane Mallat: A Wavelet Tour of Signal Processing, 2nd Edition. Comput. Music J. 2007, 31, 83–85. [Google Scholar] [CrossRef]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Yokoya, N.; Iwasaki, A. Airborne Hyperspectral Data over Chikusei; Technical Report SAL-2016-05-27; Space Application Laboratory, University of Tokyo: Tokyo, Japan, 2016. [Google Scholar]
Le Saux, B.; Yokoya, N.; Hansch, R.; Prasad, S. 2018 IEEE GRSS Data Fusion Contest: Multimodal Land Use Classification [Technical Committees]. IEEE Geosci. Remote Sens. Mag. 2018, 6, 52–54. [Google Scholar] [CrossRef]
Ham, J.; Chen, Y.; Crawford, M.; Ghosh, J. Investigation of the random forest framework for classification of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 2005, 43, 492–501. [Google Scholar] [CrossRef]

Figure 1. Conceptual illustration of the fusion-based HSI-SR task. The high-resolution multispectral image (HrMSI) provides spatial details, while the upsampled low-resolution hyperspectral image (LrHSI) offers spectral information. The fusion model integrates these complementary sources to predict the final high-resolution HSI (HrHSI).

Figure 2. Detailed architecture of the proposed method, which is detailed in the Dynamic Modulated Residual Network (DMRN) and the Condition Module. The time-aware Condition Module (bottom left) utilizes FiLM to modulate features based on the timestep t. These adaptive features are then injected into the U-Net decoder via a side-injection mechanism, which employs channel slicing and bilinear interpolation to additively inject conditions. In the diagram, color coding represents functional stages: orange for the encoder and downsampling blocks, green for the decoder and upsampling blocks, and yellow for the middle block. Deep cyan indicates the condition-related components and their data flows, while blue and white denote standard and zero-initialized convolutional layers, respectively. In the channel slicing strategy, colored arrows correspond to the specific channel groups being processed.

Figure 3. Principal Component Analysis (PCA) visualization of the feature maps generated by our Time-Aware Dynamic Modulation at different timesteps. The R, G, and B channels correspond to the first, second, and third principal components of the features, respectively. Similar colors indicate similar feature representations, while color shifts across different timesteps visualize the dynamic evolution of the features during the denoising process.

Figure 4. Radar chart comparison of PSNR performance across various datasets and their 1% noisy cases. Each axis represents a specific scenario. To ensure clarity, comparison methods in the legend are grouped by their architectural categories (CNN, Transformer, and Diffusion-based models) and represented by distinct colors. While the performance of other methods contracts significantly under specific cases, the proposed method (red line) maintains the largest and most balanced coverage, highlighting its stability and ability to perform well in challenging environments.

Figure 5. Visual comparison on the Chikusei dataset under noise-free (top 2 rows) and 1% noisy (bottom 2 rows) conditions. Our proposed method in the last column consistently generates sharper boundaries and more faithful textures compared to state-of-the-art methods. Notably, as highlighted in the zoomed-in regions, the proposed method excels at preserving farmland boundaries.

Figure 6. Visual comparison on the Houston dataset. The first row shows the noise-free case, and the second row shows the 1% noisy case. Note that our method preserves edges better at the boundaries. Visual comparison on the Houston dataset under noise-free (top 2 rows) and 1% noisy (bottom 2 rows) conditions. Our proposed method at the last column consistently generates sharper boundaries and more faithful textures compared to state-of-the-art methods. Notably, as highlighted in the zoomed-in regions, our method excels at preserving rooftop colors.

Figure 7. Visual comparison on KSC dataset under noise-free (top 2 rows) and 1% noisy (bottom 2 rows) conditions. Our proposed method at the last column consistently generates sharper boundaries and more faithful textures compared to state-of-the-art methods. Notably, as highlighted in the zoomed-in regions, our method excels at preserving river textures and marsh boundary.

Figure 8. Three-dimensional scatter plots of five evaluation metrics’ sensitivity to hyperparameters w and s and the projections on two vertical planes. From top to bottom are PSNR, ERGAS, and SAM on the left and SSIM and RMSE on the right. For better visual display, only the median of optimal points is displayed in red spots.

Figure 9. Fitted hyperbolic curves for the experimentally determined optimal guidance points

(w, s)

. The close fit between the curves and the points validates our theoretical hypothesis of a constant governing spatial guidance scale

s_{1}

. Furthermore, the either aligned or opposite fitted curves for different metrics visually reveals the intrinsic modality characteristics of a dataset: consistent or conflict.

Figure 9. Fitted hyperbolic curves for the experimentally determined optimal guidance points

(w, s)

. The close fit between the curves and the points validates our theoretical hypothesis of a constant governing spatial guidance scale

s_{1}

. Furthermore, the either aligned or opposite fitted curves for different metrics visually reveals the intrinsic modality characteristics of a dataset: consistent or conflict.

Table 1. Quantitative comparison of computational efficiency (Params, FLOPs, VRAM and run time) and fusion quality (PSNR in a noise-free scenario). The results represent the average performance across three hyperspectral datasets during the inference procedure. The best, second best, and third best results are highlighted in Bold, Underline and Italic, respectively.

Method	Params (M)	FLOPs (G)	VRAM (GB)	Run time (s)	PSNR (dB)
LAGConv	0.25	2.671	0.96	0.0138	45.77
Fusformer	0.35	10.02	28.32	0.306	45.02
PSRT	0.30	17.17	3.28	0.0195	42.56
RAMoE	29.69	1944	12.69	0.0696	46.16
Dif-PAN	14.09	3057	0.48	0.663	44.66
S2CycleDiff	65.94	307,800	0.90	29.9	42.77
ISPDiff	5.37	30,220	0.70	22.7	43.63
Proposed	19.07	29,540	0.57	0.349	46.41

Table 2. Quantitative comparison on the Chikusei dataset. Bold indicates the best performance, Underline indicates the second best, and Italic indicates the third best. The upward arrows (↑) indicate that higher values correspond to better performance, while downward arrows (↓) indicate that lower values are better. The results are reported as mean ± 95% confidence interval across five runs with different random seeds. The confidence intervals are displayed in grey and a smaller font size to visually distinguish them from the primary mean values.

Metric	LAGConv	Fusformer	PSRT	RAMoE	Dif-PAN	S2CycleDiff	ISPDiff	Proposed
Noise-free Case
PSNR ↑	46.4992±0.0931	46.4238±0.0872	42.7353±0.0878	47.7667±0.1512	45.2549±0.0823	45.1923±0.1143	46.3544±0.0948	47.7571±0.0873
SSIM ↑	0.9932±0.0001	0.9929±0.0001	0.9910±0.0002	0.9943±0.0001	0.9916±0.0001	0.9916±0.0002	0.9928±0.0002	0.9935±0.0001
SAM ↓	2.3819±0.2736	2.3901±0.2753	2.5689±0.2589	2.4268±0.3503	2.4111±0.2620	2.4301±0.2918	2.3721±0.2963	2.3030±0.2940
ERGAS ↓	1.4937±0.0123	1.7143±0.0173	1.9250±0.0174	1.6702±0.0192	1.9000±0.0179	1.8931±0.0185	1.6293±0.0133	1.5568±0.0137
RMSE ↓	0.0052±0.0001	0.0053±0.0001	0.0065±0.0001	0.0048±0.0001	0.0055±0.0001	0.0056±0.0001	0.0052±0.0001	0.0048±0.0000
1% Noisy Case
PSNR ↑	37.5546±0.1314	37.8283±0.1146	36.9311±0.1126	37.5346±0.0922	38.5874±0.1154	40.6467±0.1115	39.5276±0.1018	39.0820±0.1181
SSIM ↑	0.9814±0.0006	0.9827±0.0005	0.9835±0.0004	0.9808±0.0005	0.9850±0.0004	0.9888±0.0003	0.9856±0.0004	0.9857±0.0005
SAM ↓	3.0820±0.2905	3.0194±0.2876	3.0625±0.2720	3.0987±0.2372	2.9006±0.3040	2.6426±0.2884	2.8233±0.2794	2.8453±0.2771
ERGAS ↓	2.6029±0.0294	2.6470±0.0277	2.6514±0.0287	2.5191±0.0205	2.5727±0.0267	2.1504±0.0211	2.2957±0.0230	2.3583±0.0266
RMSE ↓	0.0092±0.0001	0.0090±0.0001	0.0101±0.0001	0.0098±0.0000	0.0083±0.0001	0.0072±0.0001	0.0080±0.0001	0.0079±0.0001

Table 3. Quantitative comparison on the Houston dataset. Bold indicates the best performance, Underline indicates the second best, and Italic indicates the third best. The upward arrows (↑) indicate that higher values correspond to better performance, while downward arrows (↓) indicate that lower values are better. The results are reported as mean ± 95% confidence interval across five runs with different random seeds. The confidence intervals are displayed in grey and a smaller font size to visually distinguish them from the primary mean values.

Metric	LAGConv	Fusformer	PSRT	RAMoE	Dif-PAN	S2CycleDiff	ISPDiff	Proposed
Noise-free Case
PSNR ↑	48.1339±0.3342	47.2223±0.0722	44.3320±1.1600	49.4940±0.5549	48.0839±0.4030	43.8755±0.5578	48.6642±0.8049	50.1413±0.3632
SSIM ↑	0.9971±0.0003	0.9964±0.0003	0.9951±0.0008	0.9969±0.0002	0.9974±0.0002	0.9931±0.0007	0.9952±0.0003	0.9964±0.0004
SAM↓	0.8398±0.0113	0.9191±0.0115	0.8536±0.0127	1.0532±0.0152	0.9080±0.0258	1.1498±0.0175	0.9129±0.0152	0.9293±0.0179
ERGAS ↓	0.7257±0.0424	0.7874±0.0404	1.0220±0.1225	0.7293±0.0228	0.6893±0.0356	1.1026±0.0924	0.8380±0.0384	0.7479±0.0445
RMSE ↓	0.0025±0.0001	0.0026±0.0001	0.0042±0.0005	0.0024±0.0000	0.0024±0.0001	0.0037±0.0003	0.0024±0.0001	0.0023±0.0000
1% Noisy Case
PSNR ↑	38.7893±0.1153	39.4085±0.1227	37.7499±0.2045	40.5530±0.1002	39.5046±0.1700	40.1866±0.2786	41.6502±0.5375	42.7386±0.4648
SSIM ↑	0.9868±0.0008	0.9878±0.0008	0.9853±0.0014	0.9916±0.0007	0.9883±0.0005	0.9903±0.0011	0.9919±0.0010	0.9936±0.0008
SAM ↓	2.9955±0.1504	2.7075±0.1322	2.9880±0.1611	2.4665±0.1388	2.8116±0.1343	2.1043±0.1146	2.1176±0.2085	1.8671±0.1794
ERGAS ↓	1.6415±0.0814	1.5533±0.0806	1.7995±0.1407	1.3274±0.0702	1.5252±0.0707	1.4009±0.1112	1.2477±0.0991	1.1002±0.0988
RMSE ↓	0.0062±0.0002	0.0057±0.0002	0.0072±0.0005	0.0052±0.0002	0.0057±0.0002	0.0054±0.0003	0.0047±0.0004	0.0042±0.0003

Table 4. Quantitative comparison on the Kennedy Space Center (KSC) dataset. Bold indicates the best performance, Underline indicates the second best, and Italic indicates the third best. The upward arrows (↑) indicate that higher values correspond to better performance, while downward arrows (↓) indicate that lower values are better. The results are reported as mean ± 95% confidence interval across five runs with different random seeds. The confidence intervals are displayed in grey and a smaller font size to visually distinguish them from the primary mean values.

Metric	LAGConv	Fusformer	PSRT	RAMoE	Dif-PAN	S2CycleDiff	ISPDiff	Proposed
Noise-free Case
PSNR ↑	42.6676±1.4041	41.4131±1.2097	40.6046±1.1391	41.2137±1.4014	40.6493±1.1596	39.2428±1.0877	35.8717±1.3498	41.3284±1.2553
SSIM ↑	0.9882±0.0029	0.9874±0.0029	0.9867±0.0028	0.9874±0.0032	0.9880±0.0027	0.9834±0.0029	0.9723±0.0051	0.9874±0.0028
SAM ↓	2.2087±0.3354	2.2772±0.3162	2.3404±0.3248	2.3593±0.3732	2.2340±0.3211	2.6673±0.3439	3.4728±0.4921	2.2983±0.3197
ERGAS ↓	1.5219±0.1976	1.5963±0.2109	1.6720±0.1862	1.6139±0.2068	1.5926±0.1821	1.9161±0.1724	2.3258±0.2357	1.6351±0.2034
RMSE ↓	0.0019±0.0002	0.0020±0.0002	0.0022±0.0003	0.0021±0.0002	0.0022±0.0003	0.0027±0.0004	0.0033±0.0003	0.0021±0.0003
1% Noisy Case
PSNR ↑	36.6895±1.0841	36.9831±1.0223	36.1476±1.1148	37.4090±0.9888	36.3646±1.0019	37.1089±0.9399	33.3714±1.8274	36.6688±1.6512
SSIM ↑	0.9781±0.0038	0.9794±0.0035	0.9770±0.0038	0.9809±0.0041	0.9788±0.0038	0.9802±0.0031	0.9488±0.0159	0.9757±0.0066
SAM ↓	3.6181±0.5110	3.3878±0.4647	3.5412±0.4704	3.3208±0.4582	3.5331±0.4637	3.1641±0.3886	4.9393±0.9754	3.4321±0.5218
ERGAS ↓	2.0571±0.2249	2.0187±0.2378	2.1515±0.2080	1.9839±0.2394	2.0724±0.2202	2.0708±0.1945	3.0253±0.4749	2.1087±0.2995
RMSE ↓	0.0030±0.0004	0.0029±0.0003	0.0032±0.0004	0.0028±0.0004	0.0032±0.0004	0.0031±0.0005	0.0041±0.0002	0.0029±0.0002

Table 5. Ablation study of the proposed Dynamic Modulated Residual Network (DMRN) on the Chikusei dataset. “w/o” denotes the removal of a specific component. Bold indicates the best result.

Variant	PSNR ↑	SSIM ↑	SAM ↓	ERGAS ↓	RMSE ↓
w/o Residual	41.4714	0.9875	2.7304	2.1009	0.0074
w/o Side-Inject	47.0979	0.9931	2.3294	1.6060	0.0050
w/o Time FiLM	47.1247	0.9932	2.3355	1.6050	0.0050
DMRN (Full)	47.7679	0.9935	2.3028	1.5569	0.0048

Table 6. Ablation study of the proposed Spatial–Spectral Disentangled Guidance (SSDG). Bold indicates the best result.

Sampling Strategy	PSNR ↑	SSIM ↑	SAM ↓	ERGAS ↓	RMSE ↓
Baseline (No CFG)	50.1664	0.9964	0.9279	0.7485	0.0023
Standard CFG	50.1294	0.9964	0.9301	0.7478	0.0023
SSDG (Best PSNR)	50.1664	0.9964	0.9277	0.7486	0.0023
SSDG (Best ERGAS)	50.1306	0.9964	0.9296	0.7478	0.0023
SSDG (Best SAM)	50.0980	0.9964	0.9268	0.7506	0.0023

Table 7. Impact of the Frequency-Aware Optimization strategy. The baseline uses only pixel-wise MSE loss, the LP and DWT variant separately adapt Laplacian Pyramid decomposition and Discrete Wavelet Transform for the frequency-aware optimization. Bold indicates the best result.

Objective	PSNR ↑	SSIM ↑	SAM ↓	ERGAS ↓	RMSE ↓
pure MSE baseline	46.2942	0.9925	2.3813	1.6154	0.0059
LP constraint	46.7582	0.9930	2.3395	1.5926	0.0051
DWT constraint (Proposed)	47.7679	0.9935	2.3028	1.5569	0.0048

Table 8. Quantitative comparison of DDIM sampling with different step counts (T) between the proposed Time-Aware Condition Module and Simple Condition Module without time-aware FiLM. Bold indicates the best result in each metric.

Method	Steps (T)	Time (s)	PSNR ↑	SSIM ↑	SAM ↓	ERGAS ↓	RMSE ↓
Time-Aware Condition Module (Full)	5	0.1952	47.7672	0.9935	2.3028	1.5568	0.0048
	25	0.8605	47.7679	0.9935	2.3028	1.5569	0.0048
	100	3.4225	47.7681	0.9935	2.3028	1.5569	0.0048
	500	16.9945	47.7681	0.9935	2.3028	1.5569	0.0048
	1000	33.8550	47.7681	0.9935	2.3028	1.5569	0.0048
Simple Condition Module (w/o Time-Aware)	5	0.1952	47.1182	0.9932	2.3356	1.6052	0.0050
	25	0.8605	47.1247	0.9932	2.3355	1.6050	0.0050
	100	3.4225	47.1259	0.9932	2.3355	1.6050	0.0050
	500	16.9945	47.1263	0.9932	2.3355	1.6050	0.0050
	1000	33.8550	47.1263	0.9932	2.3355	1.6050	0.0050

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, X.; Qiao, J.; Zhou, J.; Yuan, K.; Feng, L. Modulated Diffusion with Spatial–Spectral Disentangled Guidance for Hyperspectral Image Super-Resolution. Remote Sens. 2026, 18, 1582. https://doi.org/10.3390/rs18101582

AMA Style

Xu X, Qiao J, Zhou J, Yuan K, Feng L. Modulated Diffusion with Spatial–Spectral Disentangled Guidance for Hyperspectral Image Super-Resolution. Remote Sensing. 2026; 18(10):1582. https://doi.org/10.3390/rs18101582

Chicago/Turabian Style

Xu, Xinlan, Jiaqing Qiao, Jialin Zhou, Kuo Yuan, and Lei Feng. 2026. "Modulated Diffusion with Spatial–Spectral Disentangled Guidance for Hyperspectral Image Super-Resolution" Remote Sensing 18, no. 10: 1582. https://doi.org/10.3390/rs18101582

APA Style

Xu, X., Qiao, J., Zhou, J., Yuan, K., & Feng, L. (2026). Modulated Diffusion with Spatial–Spectral Disentangled Guidance for Hyperspectral Image Super-Resolution. Remote Sensing, 18(10), 1582. https://doi.org/10.3390/rs18101582

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Modulated Diffusion with Spatial–Spectral Disentangled Guidance for Hyperspectral Image Super-Resolution

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Hyperspectral Image Fusion

2.2. Diffusion Models for HSI-SR

2.3. Conditional Guidance and Disentanglement

3. Materials and Methods

3.1. Problem Formulation

3.2. Diffusion Preliminaries

3.3. Dynamic Modulated Residual Network (DMRN)

3.3.1. Overall Architecture

3.3.2. Side-Injection Mechanism

3.3.3. Time-Aware Dynamic Modulation

3.4. Spatial–Spectral Disentangled Guidance (SSDG)

3.5. Training Objective and Efficient Sampling

3.5.1. Frequency-Aware Optimization Objective

3.5.2. Accelerated Inference via DDIM

3.6. Implementation Details

3.7. Datasets

3.8. Comparison Methods and Metrics

4. Results

4.1. Overall Generalization and Efficiency Comparison

4.2. Quantitative Comparison

4.3. Qualitative Comparison

5. Discussion

5.1. Ablation Studies

5.1.1. Dynamic Modulated Residual Network

5.1.2. Spatial–Spectral Disentangled Guidance

5.1.3. Impact of Optimization Strategy

5.2. Analysis of Hyperparameters

5.3. Analysis of Time-Aware Condition Module

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI