Hierarchical Scale-Adaptive Diffusion Priors for Efficient Remote Sensing Dehazing

Ju, Wei; Liang, Zheng; Chen, Huan; Shen, Jie

doi:10.3390/rs18121907

Open AccessArticle

Hierarchical Scale-Adaptive Diffusion Priors for Efficient Remote Sensing Dehazing

¹

School of Mechanical and Electrical Engineering, Chizhou University, Chizhou 247000, China

²

School of Internet, Anhui University, Hefei 230039, China

³

School of Big Data and Artificial Intelligence, Chizhou University, Chizhou 247000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(12), 1907; https://doi.org/10.3390/rs18121907 (registering DOI)

Submission received: 23 April 2026 / Revised: 21 May 2026 / Accepted: 25 May 2026 / Published: 9 June 2026

(This article belongs to the Special Issue Hyperspectral Remote Sensing Image Analysis via Advanced Deep Learning and Computer Vision)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Hierarchical Diffusion Prior Representation is developed to decompose global diffusion latents into multi-scale embeddings, enabling fine-grained and scale-aware restoration.
A Scale-Adaptive Prior Injection mechanism is introduced to dynamically modulate prior contributions across feature levels, improving feature utilization and robustness.

What are the implications of the main findings?

The proposed method significantly improves dehazing performance under heavy and spatially variant haze, achieving superior quantitative metrics and visual quality.
It provides an efficient and robust solution for remote sensing image restoration, enhancing the reliability of downstream Earth observation applications.

Abstract

Remote sensing image dehazing remains a formidable challenge due to complex atmospheric scattering and large-scale spatially varying degradation, which severely compromise fine-grained surface details. While recent diffusion-based restoration frameworks, such as DiffIR, have achieved remarkable efficiency by injecting compact diffusion priors into deterministic networks, they typically rely on a monolithic global Image Prior Representation (IPR). However, such a global design is suboptimal for the dehazed results of remote sensing imagery, where haze distribution exhibits strong spatial heterogeneity and scale dependency. To address this limitation, this paper presents the Hierarchical and Scale-Adaptive Diffusion Prior (HS-DiffIR) framework. Specifically, Hierarchical Image Prior Representation decomposes the holistic diffusion latent into multi-scale priors aligned with the hierarchical stages of the restoration network. Such a design facilitates fine-grained, scale-aware guidance by projecting the compact global latent into layer-specific representations, thereby bypassing the computational burden of high-dimensional generative modeling. Complementing this, the Scale-Adaptive Injection mechanism utilizes lightweight learnable coefficients to dynamically modulate the influence of diffusion priors across different feature scales, allowing the network to adaptively balance global semantic consistency and local detail recovery under dense-haze conditions. Evaluations on remote sensing benchmarks confirm that HS-DiffIR generally outperforms the DiffIR baseline. The method yields superior quantitative metrics (particularly PSNR) at a marginal computational cost while demonstrating robust detail restoration in regions subject to severe, spatially variant haze.

Keywords:

remote sensing image dehazing; diffusion models; generative priors; hierarchical representation; scale-adaptive modulation

1. Introduction

1.1. Background and Challenge

Remote sensing imagery serves as a cornerstone for numerous earth observation applications, ranging from environmental monitoring and urban planning to object detection and land-cover mapping [1,2,3,4,5]. However, these images are frequently compromised by atmospheric haze, a phenomenon caused by the scattering and absorption of light by aerosols [6,7]. Unlike natural scene images, remote sensing data typically cover vast geographical areas, where haze exhibits severe spatial heterogeneity and scale dependency. The degradation manifests not only as contrast reduction and color shift but also as a significant loss of high-frequency surface details, rendering the inverse problem of dehazing highly ill-posed. Consequently, recovering clear contents from such complex, non-uniform degradation remains a formidable challenge in the remote sensing community.

1.2. Existing Methods and Limitations

Early dehazing methodologies primarily relied on physical scattering models and handcrafted priors, such as the Dark Channel Prior (DCP) [7]. The core assumption of the DCP is that in local regions of clear, haze-free images, at least one color channel has very low intensity. While theoretically sound, these prior-based methods often rely on idealized assumptions that rarely hold in complex remote sensing scenes, leading to artifacts or color distortion [8,9]. With the advent of deep learning, Convolutional Neural Networks (CNNs) and Transformers [10,11,12,13] have achieved substantial progress by learning data-driven mappings from hazy to clear images. Despite their improved metrics (e.g., PSNR), purely regression-based methods often struggle with the fidelity–realism trade-off [14,15]. They tend to produce overly smooth results with “plastic” textures, as they lack the generative capability to hallucinate plausible high-frequency details lost in dense-haze regions.

1.3. The Rise of Diffusion Models and the Gap

To address the lack of fine-grained details, Denoising Diffusion Probabilistic Models (DDPMs) [16,17] have emerged as a powerful paradigm for image restoration. By modeling the data distribution explicitly, diffusion models serve as potent generative priors. However, the iterative sampling process of standard diffusion models is computationally prohibitive. Recently, DiffIR [18] was proposed as an efficient alternative: extracting a compact Image Prior Representation (IPR) from a lightweight diffusion model and injecting it into a deterministic restoration network. This “prior-guidance” strategy achieves a favorable balance between efficiency and restoration quality.

However, directly applying DiffIR to remote sensing dehazing reveals a critical limitation. DiffIR employs a monolithic global IPR that is uniformly injected across all stages of the restoration network. Instead of explicitly addressing the scale-variant nature of atmospheric degradation, this design relies on a single, scale-agnostic global prior. Haze acts as a low-frequency veil affecting global structures, while simultaneously suppressing high-frequency local textures. A single, scale-agnostic global prior fails to capture this structural hierarchy, often leading to either over-smoothed textures or inconsistent global artifacts.

1.4. Proposed Solution

Motivated by these insights, incorporating a mechanism that is both hierarchical and adaptive is deemed essential to high-quality remote sensing dehazing. To address the scale mismatch, this study develops a novel framework that adapts the diffusion prior to the multi-scale intricacies of remote sensing imagery. First, distinct from the use of a flat global vector, a Hierarchical Image Prior Representation is introduced. The latent space is decomposed to align with the hierarchical encoder–decoder structure of the restoration network, ensuring that each feature scale receives structurally aligned guidance. Second, given that the necessity for prior guidance exhibits strong scale dependency (e.g., semantic features versus texture details), a Scale-Adaptive Injection mechanism is developed. Utilizing lightweight learnable coefficients, the framework dynamically recalibrates the intensity of prior injection at each stage. This adaptive modulation harmonizes global semantic consistency with local detail recovery.

The proposed approach preserves the architectural efficiency of DiffIR while significantly enhancing its representational capability. The main contributions are summarized as follows:

Hierarchical Diffusion Prior Representation: The global diffusion latent is decomposed into multi-scale representations. This strategy aligns the generative priors with the hierarchical feature maps of the restoration network, enabling fine-grained guidance without increasing latent dimensionality.
Scale-Adaptive Injection Mechanism: A learnable modulation module is introduced to adaptively re-weight the influence of diffusion priors across different feature scales. This mechanism enables the network to autonomously optimize the utilization of priors, enhancing robustness against non-uniform haze.
Highly Competitive Performance on Remote Sensing Benchmarks: Extensive experiments demonstrate that the proposed method generally outperforms the baseline, DiffIR and other methods. The framework yields robust visual quality and strong quantitative metrics (particularly PSNR), demonstrating exceptional reliability in challenging scenarios with heavy and spatially variant haze.

2. Related Work

This section reviews the literature relevant to the proposed method, categorizing it into three progressive domains: remote sensing image dehazing, diffusion models for image restoration, and prior injection with multi-scale feature modeling.

2.1. Remote Sensing Image Dehazing

Image dehazing approaches have fundamentally transitioned from physical prior-based paradigms to data-driven learning models. Early methodologies heavily relied on physical scattering models and handcrafted statistical assumptions, notably the Dark Channel Prior (DCP) [7] and its variants. While computationally efficient, these handcrafted priors frequently fail in remote sensing scenarios where idealized assumptions are violated by specific ground objects (e.g., bright deserts and urban rooftops) or large-scale heterogeneous haze distributions, often resulting in severe color distortion and artifacts.

Subsequently, deep learning models, particularly Convolutional Neural Networks (CNNs) [19,20,21,22,23,24,25] like AOD-Net [13], superseded physical models by learning direct mappings from hazy to clear images. Recently, Transformer-based architectures [11,26,27,28,29,30,31] (e.g., DehazeFormer and RSDformer [10,32,33]) have been introduced to capture long-range dependencies, which is crucial to processing large-scale remote sensing imagery. Moreover, to mitigate the heavy reliance on large-scale paired datasets, which are often unavailable in real-world scenes, researchers have investigated unpaired and semi-supervised learning paradigms using Generative Adversarial Networks (GANs) and domain adaptation [34,35,36].

Despite achieving substantial improvements in quantitative metrics (e.g., PSNR), most existing CNN and Transformer models operate as deterministic regression frameworks optimized via pixel-wise losses (e.g., L1/L2 loss). In cases of heavy and non-uniform atmospheric scattering, these models tend to output the “average” of all possible solutions, leading to overly smoothed results with “plastic” textures. They inherently lack the generative capability to hallucinate plausible high-frequency surface details lost in dense haze.

2.2. Diffusion Models for Image Restoration

To overcome the inherent lack of fine-grained detail recovery in deterministic networks, Denoising Diffusion Probabilistic Models (DDPMs) [16,37,38] have emerged as a powerful generative paradigm, outperforming GANs [39,40] in training stability and mode coverage. By explicitly modeling complex data distributions, methods like SR3 [15] and WeatherDiffusion [41] have demonstrated unprecedented performance in solving ill-posed inverse problems. However, the standard iterative sampling process of diffusion models requires hundreds of steps, making them computationally prohibitive for high-resolution remote sensing data.

To mitigate this computational bottleneck, hybrid prior-guided frameworks have gained traction. DiffIR [18] pioneered an efficient alternative by training a lightweight diffusion model to compress the clean image distribution into a compact latent space, termed the Image Prior Representation (IPR). A separate deterministic network then leverages this IPR as a condition, successfully decoupling the heavy generative process from the restoration pipeline. While DiffIR achieves a remarkable balance between efficiency and quality, its direct application to remote sensing dehazing reveals a critical flaw: it treats the generative prior as a monolithic, scale-agnostic global condition. This design fails to accommodate the highly scale-variant nature of atmospheric degradation in remote sensing imagery, limiting its spatial adaptability.

2.3. Prior Injection and Multi-Scale Feature Modeling

Handling large-scale variations is universally recognized as essential in remote sensing and computer vision. Consequently, multi-scale feature modeling often implemented via U-Net architectures or Feature Pyramid Networks (FPNs) [42,43] is heavily utilized to process semantic context at low resolutions and local details at high resolutions. Simultaneously, the integration of external priors (e.g., semantic maps or latent codes) into such networks is typically achieved through Spatial Feature Transform (SFT) [44] or Cross-Attention mechanisms [14].

However, integrating current generative prior methods (like DiffIR) into hierarchical restoration networks introduces a significant structural mismatch. Existing approaches typically compress the generative prior into a low-dimensional, scale-agnostic global vector. When this monolithic prior is uniformly injected into a hierarchical restoration network, it forces the same semantic guidance onto both deep layers (which process abstract structures) and shallow layers (which process fine textures). In the context of remote sensing imagery, haze degradation is inherently scale-dependent: global atmospheric light dominates low-frequency semantics, while localized scattering severely blurs high-frequency surface details. Forcing a single static prior across all layers is therefore suboptimal. To fully unlock the potential of generative priors for remote sensing dehazing, there is an urgent need to develop prior representations that are structurally aligned with the multi-scale hierarchy of the restoration network, complemented by a mechanism to adaptively modulate the prior’s influence across different spatial scales. This precisely motivates the design of the proposed HS-DiffIR framework.

3. Methodology

This section presents the proposed framework, HS-DiffIR, designed for effective remote sensing image dehazing. The discussion begins by formulating the problem and revisiting the preliminaries of the DiffIR framework. Subsequently, the two core contributions are detailed: Hierarchical Image Prior Representation (H-IPR) and Scale-Adaptive Prior Injection (S-API). Finally, the training strategy is described.

Figure 1 illustrates the overall architecture of the proposed HS-DiffIR framework. While building upon the computational efficiency of DiffIR, the framework fundamentally redesigns the prior injection mechanism to explicitly accommodate the multi-scale nature of remote sensing degradation.

As illustrated in Figure 1a, the core distinction lies in the pathway between the prior extraction (CPEN) and the restoration network (DIRformer). Instead of a direct connection, a Hierarchical Decomposition stage is inserted to split the latent Z into multi-scale priors

z_{k}

. These priors are then passed through a Scale-Adaptive Injection module (depicted by the gating nodes

α_{k}

) to selectively guide different feature levels. Figure 1b shows the second stage, where a diffusion model learns to predict the root latent Z from hazy inputs.

3.1. Problem Formulation

Let

I \in R^{H \times W \times 3}

denote a hazy remote sensing image and

J \in R^{H \times W \times 3}

be its corresponding clean ground truth. The physical formation of haze is classically modeled as [6,7]

I (x) = J (x) \cdot t (x) + A \cdot (1 - t (x))

(1)

where x represents pixel coordinates,

t (x)

is the transmission map, and A is the global atmospheric light. While physical models offer theoretical grounding, estimating

t (x)

and A from single images is an ill-posed problem, particularly in remote sensing scenes with heterogeneous haze distributions. Consequently, a probabilistic framework is adopted, aiming to model the conditional distribution

P (J | I)

by leveraging generative diffusion priors.

3.2. Preliminaries: DiffIR Framework

DiffIR [18] proposes a “divide-and-conquer” strategy to address the computational inefficiency of standard diffusion models. It decouples the image restoration problem into two distinct tasks: (1) Generative Prior Extraction via a lightweight diffusion model, and (2) Deterministic Restoration via a Transformer-based network (DIRformer).

Compact Latent Representation: Unlike standard diffusion methods that operate in the high-dimensional pixel space (or substantial latent space like Latent Diffusion), DiffIR first learns a highly compact latent space to encapsulate the global structural and semantic information of ground-truth (GT) images. Let $I_{L Q}$ denote the hazy image, and $I_{G T}$ denote the clean GT image. An encoder E maps $I_{L Q}$ and $I_{G T}$ to a low-dimensional latent vector $z_{0} \in R^{C}$ :

$z_{0} = E (I_{L Q}, I_{G T})$

(2)

This $z_{0}$ serves as the target for the diffusion model. Since the dimension C is small, the diffusion process becomes extremely efficient.
Conditional Diffusion for Prior Generation: DiffIR trains a conditional diffusion model to estimate this latent code $z_{0}$ from the input hazy image $I_{L Q}$ .
Forward Process: A Markov chain gradually adds Gaussian noise to $z_{0}$ over T steps:

$q (z_{t} | z_{t - 1}) = N (z_{t}; \sqrt{1 - β_{t}} z_{t - 1}, β_{t} I)$

(3)

where $β_{t}$ represents the predefined variance schedule and I denotes the identity matrix.
Reverse Process: The diffusion model $ϵ_{θ}$ learns to denoise the latent variable, conditioned on the degradation features extracted from the hazy image $I_{L Q}$ . The training objective is to minimize the noise prediction error:

$L_{d i f f} = E_{z_{0}, ϵ, t, I_{L Q}} [{∥ϵ - ϵ_{θ} (z_{t}, t, Cond (I_{L Q}))∥}_{2}^{2}]$

(4)

where $C o n d (I_{L Q})$ represents the condition features extracted from $I_{L Q}$ . During inference, the model samples a random noise $z_{T} \sim N (0, 1)$ and iteratively denoises it to obtain the estimated latent code ${\hat{z}}_{0}$ . This ${\hat{z}}_{0}$ is termed the Image Prior Representation (IPR).
IPR-Guided Restoration: The estimated IPR ( ${\hat{z}}_{0}$ ) contains rich global priors but lacks high-frequency spatial details due to its compactness. Therefore, it is injected into a deterministic restoration network, DIRformer, which performs the final pixel-wise reconstruction:

$J_{r e c} = DIRformer (I_{L Q}, {\hat{z}}_{0})$

(5)

While DiffIR achieves efficiency, it treats the IPR as a holistic, scale-agnostic vector. The same vector ${\hat{z}}_{0}$ guides both the shallow layers (processing fine textures) and deep layers (processing semantic shapes) of the DIRformer uniformly. As discussed in Section 1, this design is suboptimal for remote sensing dehazing, where degradation is intrinsically hierarchical. This limitation motivates our proposed HS-DiffIR framework.

3.3. Hierarchical Image Prior Decomposition (H-IPR)

As discussed in Preliminaries: DiffIR Framework, the original DiffIR utilizes a monolithic IPR

{\hat{z}}_{0} \in R^{C}

to guide the entire restoration process. While it successfully encodes global context, directly broadcasting this compact vector to all layers of the DIRformer is suboptimal. The restoration network is hierarchical: shallow layers process high-frequency textures (requiring local detail guidance), while deep layers process low-frequency semantics (requiring global structural guidance) [25,44]. A single static

{\hat{z}}_{0}

cannot optimally satisfy these conflicting multi-scale demands simultaneously.

To address this, the Hierarchical Image Prior Decomposition module is proposed. Instead of modifying the pretrained diffusion model (which would be computationally expensive), a lightweight post-projection is performed on the generated IPR.

Formally, let the DIRformer have K hierarchical stages (encoder scales

E_{1} \dots E_{k}

and decoder scales

D_{1} \dots D_{k}

). We define a set of learnable scale-specific projectors

{\{P_{k}\}}_{k = 1}^{K}

, implemented as lightweight Multi-Layer Perceptrons (MLPs). Subsequently, the global prior

\hat{z}

is decomposed into scale-aware priors

{\{z_{k}\}}_{k = 1}^{K}

:

z_{k} = P_{k} ({\hat{z}}_{0})

(6)

Through this decomposition, the network learns to disentangle the compact information encoded in

\hat{z}

. During joint training, the shallowest projector

P_{1}

learns to filter and project texture-related codes, while the deepest projector

P_{K}

extracts structural layout codes. This ensures that each stage of the restoration network receives structurally aligned guidance without increasing the dimensionality of the diffusion generation process.

3.4. Scale-Adaptive Prior Injection (S-API)

Even with hierarchically aligned priors, the necessity of generative guidance varies. In remote sensing images, regions with light haze may only need simple contrast adjustment (where original features are reliable), while regions with dense haze require strong hallucination from the prior (where original features are corrupted). Indiscriminate injection can lead to artifacts or loss of fidelity.

To facilitate dynamic modulation, the Scale-Adaptive Injection mechanism is devised. This module functions by adaptively recalibrating the influence of the prior across distinct scales.

Learnable Injection Strength: For each scale k, we define a learnable gating parameter $α_{k}$ . We obtain the modulated prior ${\hat{z}}_{k}$ as:

${\hat{z}}_{k} = Sigmoid (α_{k}) \cdot z_{k}$

(7)

The Sigmoid function ensures the modulation strength is bounded in $(0, 1)$ . Crucially, we initialize $α_{k}$ to a small negative value, ensuring that the network starts with minimal prior influence and gradually learns to incorporate the generative guidance where necessary.
Injection to DIRformer: The DIRformer module (Dynamic IR Transformer) from the original DiffIR framework is leveraged. Let $F_{i}$ denote the input feature map of the i-th stage. In the proposed framework, instead of using the shared ${\hat{z}}_{0}$ , the stage-specific modulated prior ${\hat{z}}_{i}$ is employed as input to modulate the image features $F_{i}$ . The injection process at stage i is formulated as

$F^{'} = W_{γ}^{k} ({\hat{z}}_{k}) ⊙ Norm (F_{i}) + W_{β}^{k} ({\hat{z}}_{k})$

(8)

where $W_{γ}^{k}$ and $W_{β}^{k}$ represent linear projection layers tailored to scale k for generating scale and shift parameters. The replacement of the global input with the hierarchical ${\hat{z}}_{k}$ guarantees that the guidance received by the i-th stage is structurally aligned (via $P_{k}$ ) and intensity-calibrated (via $α_{k}$ ).

3.5. Training Strategy

To ensure training stability and efficiency, a two-stage strategy is employed, adhering to the DiffIR paradigm. This effectively decouples the optimization of the restoration capability from the generative prior modeling:

Stage 1: Pretraining for Hierarchical Prior-Guided Restoration:
In the first stage, training aims to optimize the restoration network (DIRformer) for effective prior utilization. An optimal latent code is extracted from ground-truth (GT) images to serve as supervision, facilitating the learning of an accurate restoration mapping.
Process: A compact prior extraction network, denoted by $C P E N_{S 1}$ , is employed. This network takes the concatenation of the ground-truth image $I_{G T}$ and the hazy image $I_{L Q}$ as input to produce a reference global latent $Z_{g t}$ :

$Z_{gt} = {CPEN}_{S 1} (PixelUnshuffle (Concat (I_{GT}, I_{LQ})))$

(9)

Crucially, in this stage, the proposed Hierarchical Decomposition and Scale-Adaptive Injection modules are integrated into the framework.
Optimization: The parameters of $C P E N_{S 1}$ , the DIRformer, and our proposed modules (projectors ${W_{k}}$ and scalars ${α_{k}}$ ) are jointly optimized. The loss function minimizes the $L_{1}$ distance between the restored image ${\hat{I}}_{H Q}$ and the ground truth:

$L_{stage 1} = {∥I_{G T} - DIRformer (I_{L Q}, {\{z_{k}\}}_{k = 1}^{K})∥}_{1}$

(10)

Note: By the end of Stage 1, the DIRformer has learned to effectively utilize hierarchically decomposed priors, and $C P E N_{S 1}$ has learned to encode the essential image manifold into a compact space.
Stage 2: Training the Diffusion Model for Prior Estimation:
In the second stage, the diffusion model is trained to estimate the target latent $Z_{g t}$ solely from degraded inputs, as $I_{G T}$ is unavailable during inference.
Configuration: With the parameters of $C P E N_{S 1}$ from Stage 1 frozen, a second extraction network, $C P E N_{S 2}$ (receiving solely $I_{L Q}$ ), and a denoising network $ϵ_{θ}$ are employed.
Forward Process: This paper uses the frozen $C P E N_{S 1}$ to extract the target latent $Z_{g t}$ from the training pair. Then we diffuse $Z_{g t}$ into noise $Z_{T}$ via the standard Gaussian transition $q (Z_{T} | Z_{g t})$ .
Reverse Process (Training): The denoising network $ϵ_{θ}$ is trained to predict the noise $ϵ$ added to the latent. It is conditioned on a vector D extracted from the hazy image via $C P E N_{S 2}$ :

$D = {CPEN}_{S 2} (PixelUnshuffle (I_{L Q}))$

(11)

The optimization objective follows the standard diffusion loss:

$L_{stage 2} = {∥ϵ - ϵ_{θ} (Z_{t}, t, D)∥}_{2}^{2}$

(12)

where t is the time step and $Z_{t}$ is the noisy latent.
Inference: During the inference phase, the trained diffusion model ( $C P E N_{S 2}$ and $ϵ_{θ}$ ) is utilized to generate a predicted latent $\hat{Z}$ . This $\hat{Z}$ is then passed through the frozen hierarchical projectors and injection modules (learned in Stage 1) to guide the DIRformer.

4. Experiments

4.1. Datasets and Implementation Details

Datasets. To verify the robustness of the model against large-scale atmospheric degradation, we employed two complementary datasets: SateHaze1k [45] and RICE [46]. The SateHaze1k dataset is utilized to assess performance across varying haze densities (Thin, Moderate, and Thick), ensuring the model’s adaptability to different degradation levels. Furthermore, the RICE dataset, collected from Google Earth, serves as a real-world benchmark featuring diverse land covers, such as urban areas, forests, and farmlands. Its inclusion allows for the evaluation of the proposed method on spatially heterogeneous haze distributions, ensuring the generalization capability in complex remote sensing scenarios.

Evaluation Metrics. Following standard protocols, restoration performance is quantitatively evaluated using the Peak Signal-to-Noise Ratio (PSNR [47]) and the Structural Similarity Index Measure (SSIM [48]). The PSNR measures pixel-wise fidelity, while the SSIM reflects perceptual structural consistency, which is particularly important for assessing the recovery of high-frequency edges (e.g., building outlines and roads) in remote sensing targets. Additionally, the Learned Perceptual Image Patch Similarity (LPIPS [49]) is employed to assess perceptual quality. Unlike the PSNR and the SSIM, which focus on pixel-level fidelity, LPIPS measures the distance between image features extracted from deep neural networks, aligning more closely with human visual perception. This metric is crucial to evaluating the naturalness of the recovered textures and ensuring that the generative priors do not introduce unrealistic artifacts.

Implementation Details. The proposed model is implemented in PyTorch 2.1.2+cu121 and trained on a single NVIDIA RTX 4090 GPU. During training, data augmentation is performed using random cropping to generate

256 \times 256

patches, combined with random horizontal/vertical flipping and random

90^{\circ}

rotations.

Following the two-stage paradigm of DiffIR, the training process is decoupled. In Stage 1 (prior-guided restoration), the restoration network is trained for 500k iterations with a batch size of 8. The Adam optimizer is employed with

β_{1} = 0.9

and

β_{2} = 0.99

. The initial learning rate is set to

2 \times 10^{- 4}

and gradually decayed to

1 \times 10^{- 6}

using the cosine annealing strategy. Regarding the scale-adaptive gating coefficients

α_{k}

, they are initialized to 0 to ensure training stability in the early stages.

In Stage 2 (diffusion prior estimation), the pretrained restoration network is frozen, and the lightweight diffusion model is trained for an additional 400k iterations with the same batch size and optimizer settings. To maintain computational efficiency, the dimension of the compact latent prior Z is set to

C = 256

. The total number of diffusion time steps T is set to 100 during training, and we employ the DDIM sampler with four steps to accelerate the reverse process during inference.

4.2. Comparison with Existing Methods

The proposed HS-DiffIR framework is benchmarked against representative dehazing approaches, categorized into the following:

(1) Prior-based methods: DCP [7];

(2) CNN-based methods: AOD-Net [13] and GridDehazeNet [12];

(3) Transformer-based methods: Uformer [28], AIDNet [50], and DehazeFormer [11];

(4) Diffusion-based frameworks: DiffIR (baseline) [18].

To ensure a fair comparison, all competing methods are retrained and evaluated on our custom dataset under identical experimental settings.

Quantitative Results. As shown in Table 1, the proposed HS-DiffIR framework demonstrates superior performance across both synthetic and real-world datasets. Specifically, compared with the state-of-the-art baseline, DiffIR, our method achieves notable overall gains in reconstruction fidelity. On the synthetic “Thin” dataset, HS-DiffIR improves the PSNR by 0.58 dB (25.55 dB vs. 24.97 dB) and reduces LPIPS by roughly 10% (0.0548 vs. 0.0607), indicating better perceptual quality.

More importantly, the proposed method exhibits robust generalization on real-world data. On the Rice1 dataset, HS-DiffIR outperforms DiffIR by a significant margin of 0.93 dB in PSNR. While the SSIM score on the “Moderate” subset is closely comparable to that of DiffIR, our method maintains a robust leadership in PSNR across all datasets, which confirms the effectiveness of our approach in high-fidelity restoration.

Crucially, the proposed framework maintains comparable efficiency to the original DiffIR with negligible computational overhead. This confirms that the performance boost is derived from the architectural innovations, specifically H-IPR and S-API, rather than a brute-force increase in model capacity.

Qualitative Comparison. Figure 2 presents a visual comparison of the Sate1K Thick remote sensing scene characterized by highly non-uniform haze distribution, where the upper-left region is covered by dense fog while the lower right remains relatively clear. This scenario poses a significant challenge for scale-agnostic methods.

(1): Failure of Traditional and CNN Methods

Prior-based methods like DCP (Figure 2c) fail to estimate the transmission map accurately in the dense-haze region, leaving significant residual fog and artifacts. AOD-Net (Figure 2d) suffers from severe color distortion, introducing an unnatural cyan cast across the urban structures. Although GridDehazeNet (Figure 2e) improves visibility, it yields a low-contrast result with washed-out colors.

(2): Limitations of Transformer and Baseline

The Transformer-based DehazeFormer model (Figure 2h) recovers global semantics but struggles with high-frequency fidelity, resulting in blurred edges around the building rooftops in the heavy-haze area. The diffusion baseline, DiffIR (Figure 2i), while effective in removing haze, tends to over-smooth local textures due to its reliance on a monolithic global prior that lacks spatial adaptivity.

(3): Superiority of HS-DiffIR

In contrast, HS-DiffIR (Figure 2j) achieves the best perceptual quality. Benefiting from the proposed Hierarchical Scale-Adaptive Diffusion Prior, the method successfully reconstructs plausible fine-grained details (e.g., the sharp boundaries of blue factory roofs and road markings) even in the severely degraded upper-left corner. The color restoration is vivid and natural, closely aligning with the ground truth (Figure 2b), demonstrating the effectiveness of the adaptive injection mechanism in handling spatially variant atmospheric degradation.

Figure 2, Figure 3 and Figure 4 present a comprehensive visual evaluation of synthetic benchmarks with varying haze densities (SateHaze1k). In these synthetic scenes, HS-DiffIR achieves overall better visual performance than other competing methods, presenting stronger haze removal capability and more accurate color restoration results.

Figure 5 and Figure 6 provide visual comparisons on the complex real-world RICE dataset. Experimental results reveal that our method possesses a favorable generalization ability in practical scenarios. Compared with the baseline, DiffIR, which tends to cause excessive texture smoothing and unnatural artifacts under uneven haze distributions, HS-DiffIR adopts a hierarchical, scale-adaptive structure. It can better recover fine-grained details while keeping global visual coherence, and achieves competitive results against existing mainstream dehazing approaches.

4.3. Ablation Study

To validate the effectiveness of the core contributions, a component-wise ablation study is conducted based on the DiffIR baseline. The impact of Hierarchical Image Prior Representation (H-IPR) and Scale-Adaptive Prior Injection (S-API) is analyzed, with results reported in Table 2.

Effectiveness of Hierarchical Decomposition (H-IPR). As shown in row (b), replacing the monolithic global prior with the proposed H-IPR yields a notable performance boost, increasing the PSNR from 22.8874 dB to 23.1091 dB compared with the baseline, DiffIR (row a). This improvement (+0.22 dB) validates the hypothesis that decomposing the latent space into scale-aware representations enables the restoration network to better capture fine-grained texture details that are often lost when using a single compressed vector.

Effectiveness of Scale-Adaptive Injection (S-API). Row (c) investigates the effectiveness of the S-API mechanism in isolation (applied to the global prior). This configuration achieves a PSNR of 22.9922 dB and increases the SSIM to 0.8830. The gain in structural similarity indicates that dynamically modulating the injection strength prevents the generative prior from introducing inconsistent artifacts, thereby preserving the structural integrity of remote sensing images.

Synergistic Effect. Finally, row (d) demonstrates the performance of the full HS-DiffIR framework. By integrating both H-IPR and S-API, the model achieves peak performance with a PSNR of 23.2747 dB and an SSIM of 0.8837. The total improvement of 0.39 dB over the baseline substantiates the complementarity of the two modules: H-IPR provides scale-aligned feature guidance, while S-API optimizes the utilization of these priors across different network stages.

4.4. Analysis of Hierarchical Disentanglement

To provide a deeper understanding of the mechanism behind the superior performance, intermediate feature maps are visualized alongside the hazy input and ground truth (GT) in Figure 7.

Reference Alignment (Col 1): The hazy input (top left) exhibits severe visibility loss and noise, while the GT (bottom left) serves as the ideal structural reference.

Texture Recovery (Enc 1-2): In the shallow stages, the proposed HS-DiffIR framework (top row) exhibits high-contrast activations that sharply trace object boundaries, effectively filtering out the haze noise observed in the input. Notably, the extracted edges closely resemble the clean textures present in the GT. In contrast, the baseline, DiffIR (bottom row), produces noisy and blurred features, indicating that feature representations remain “entangled” with haze degradation due to the lack of scale-specific guidance.

Semantic Reconstruction (Enc 3): In the deep stage, HS-DiffIR generates spatially coherent activation blocks (e.g., the uniform red area) that align perfectly with the semantic layout of the GT (e.g., the forest region). Conversely, DiffIR (bottom right) presents a fragmented response where semantically continuous regions are broken into disjointed patches.

These visual comparisons empirically validate that the HS-DiffIR module successfully guides the network to process information hierarchically: recovering texture fidelity in shallow layers and enforcing semantic consistency in deep layers.

4.5. Efficiency Analysis

To quantitatively validate the efficiency of the proposed framework, we evaluate the model complexity and inference speed between the baseline, DiffIR, and our HS-DiffIR. As reported in Table 3, all metrics are measured on an input size of

512 \times 512

using a single NVIDIA RTX 4090 GPU. Given that Hierarchical Image Prior Representation (H-IPR) is realized through lightweight MLPs operating on a highly compact latent space and that Scale-Adaptive Prior Injection (S-API) introduces merely layer-wise scalar gating parameters, our method introduces negligible computational overhead. Specifically, the total parameter count increases by only 0.2 M (about 0.7%), and the FLOPs are the same as the baseline. Furthermore, since the latent dimensionality of the diffusion model remains unchanged, the inference time of HS-DiffIR (249.8 ms) is practically equivalent to that of DiffIR (248.2 ms), with only a negligible time gap of 1.6 ms. These empirical results substantiate that the proposed method achieves significant performance gains in remote sensing dehazing without compromising computational efficiency.

5. Discussion

The proposed HS-DiffIR framework, characterized by Hierarchical Image Prior Representation (H-IPR) and Scale-Adaptive Prior Injection (S-API), introduces minimal architectural overhead into the baseline, DiffIR, yet yields substantial overall performance gains in remote sensing image dehazing. This section delves into the underlying mechanisms driving these improvements, interprets the learned modulation dynamics, and discusses the rationale behind the training strategy.

5.1. Mechanism of Hierarchical Disentanglement

A critical insight of this work is that the demand for generative guidance in remote sensing dehazing is inherently scale-dependent. In the frequency domain, atmospheric veil manifests as a low-frequency bias affecting global contrast, while scattering-induced degradation acts as a high-frequency filter suppressing local textures. A monolithic global prior, as employed in DiffIR, forces the restoration network to utilize the same latent vector for conflicting objectives: recovering global structure in deep layers and hallucinating fine details in shallow layers.

The H-IPR module addresses this structural mismatch through latent disentanglement. Although the dimensionality of the diffusion latent remains unchanged, the layer-specific projectors

{P_{k}}

learn to extract distinct sub-manifolds from the global latent code

Z

. Specifically, projectors for shallow layers tend to filter out semantic layout information to focus on high-frequency texture codes, while deep-layer projectors emphasize semantic consistency. This hierarchical alignment allows the deterministic network to resolve the “fidelity–realism” trade-off more effectively at each resolution level, preventing the over-smoothing often observed when global priors are applied indiscriminately to local features.

5.2. Interpreting Scale-Adaptive Calibration

Examination of the S-API mechanism demonstrates that the network autonomously optimizes the gating coefficients

α_{k}

for each scale. This suggests that effective restoration relies on balancing the prior influence, thereby highlighting the critical importance of fine-grained intensity calibration.

S-API functions as a hldynamic re-weighting mechanism rather than a binary gate. In the context of generative restoration, the network must balance the reliable features extracted from the input (fidelity) and the hallucinated features provided by the diffusion model (prior). In clear regions, overly strong prior injection might introduce “hallucinated” artifacts or inconsistent textures. The learned coefficients

α_{k}

allow the network to subtly attenuate the prior’s influence in sensitive layers while maintaining strong guidance elsewhere. Compared with heavy uncertainty-estimation modules [39,51] that require pixel-wise supervision, the proposed scalar-based modulation offers a robust, lightweight alternative that achieves stability through implicit regularization.

5.3. Limitations and Future Prospects

Despite its effectiveness, HS-DiffIR operates on a compact global latent space. While H-IPR mitigates the scale mismatch, it does not explicitly model spatial heterogeneity. For instance, in a large-scale scene containing both a clear forest and a hazy city, a single global vector is inevitably an average representation, potentially leading to suboptimal guidance in transition regions.

To quantitatively analyze the impact of this limitation, we conducted a sub-region evaluation on the challenging Sate1K Thick dataset. We partitioned the test scenes into “Homogeneous Regions” (areas with uniform haze distribution) and “Heterogeneous Regions” (transition areas with sharp haze density variations) based on local image variance. Specifically, our evaluation reveals that while HS-DiffIR achieves 23.95 dB (PSNR) in Homogeneous Regions, its performance drops to 22.15 dB in Highly Heterogeneous Regions. The baseline, DiffIR, exhibits an even steeper drop (from 23.50 dB to 21.30 dB). This 1.80 dB performance gap in our method quantitatively confirms the fundamental limitation of using a 1D global vector: it forces an averaged prior onto spatially diverse pixels. However, the fact that our method still outperforms DiffIR by 0.85 dB in these highly heterogeneous zones validates that our hierarchical and scale-adaptive mechanisms effectively alleviate this limitation, even if they do not completely eliminate it.

Future research could explore Spatial-Adaptive Latent Diffusion, where the prior is modeled as a 2D feature map rather than a 1D vector. Although this would increase computational cost, it could offer precise spatial control for spatially variant degradation tasks such as cloud removal [2] and uneven atmospheric correction. Furthermore, extending the hierarchical injection paradigm to other generative backbones, such as GANs or consistency models, remains a promising avenue.

6. Conclusions

This work addresses the limitations of monolithic global priors in remote sensing image dehazing by introducing HS-DiffIR, a framework designed to explicitly align generative diffusion priors with the hierarchical structure of restoration networks. Through the integration of Hierarchical Image Prior Representation (H-IPR) and Scale-Adaptive Prior Injection (S-API), the compact diffusion latent space is structurally synchronized with the multi-scale features of the deterministic network.

The findings underscore that the efficacy of diffusion priors is governed not merely by generative capacity but also fundamentally by the precision of the injection strategy. By leveraging lightweight latent disentanglement and dynamic intensity calibration, HS-DiffIR effectively reconciles global semantic consistency with local texture recovery. Quantitative and qualitative evaluations on a large-scale benchmark validate that the proposed method surpasses the DiffIR baseline and other state-of-the-art techniques with negligible computational overhead, ensuring applicability in practical earth observation tasks.

Ultimately, this study advocates for a paradigm shift in generative image restoration, moving beyond model capacity expansion toward the optimization of prior–feature interaction. Future avenues include extending this hierarchical paradigm to spatially adaptive priors to address increasingly complex atmospheric degradation. Furthermore, while this study demonstrates the effectiveness of optimizing the prior injection mechanism while keeping the base architecture (e.g., CPEN and DMTA) frozen for strict variable control, future research could explore integrating remote sensing-specific designs (e.g., multi-spectral attention or large-kernel spatial aggregators) into the backbone blocks to further strengthen the overall feature extraction capability.

Author Contributions

Methodology, W.J. and Z.L.; software, J.S.; validation, J.S. and H.C.; formal analysis, Z.L.; investigation, J.S.; visualization, J.S., Z.L. and H.C.; writing—original draft preparation, W.J. and J.S.; writing—review and editing, Z.L. and H.C.; funding acquisition, W.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research study was funded by the “Natural Science Foundation of Anhui Higher Education Institutions of China (2025AHGXZK20128)” and the “Scientific Research Startup Fund Project of Chizhou University (CZ2025YJRC118)”.

Data Availability Statement

The original data presented in the study are openly available from the following repositories: the SateHaze1k dataset (https://www.dropbox.com/s/k2i3p7puuwl2g59/Haze1k.zip?dl=0, accessed on 10 February 2026) and the RICE dataset (https://github.com/BUPTLdy/RICE_DATASET, accessed on 10 February 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H. Learning enriched features for fast image restoration and enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1934–1948. [Google Scholar] [CrossRef]
Zou, X.; Li, K.; Xing, J.L.; Zhang, Y.; Wang, S.Y.; Jin, L. DiffCR: A Fast Conditional Diffusion Framework for Cloud Removal From Optical Satellite Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5612014. [Google Scholar] [CrossRef]
Valanarasu, J.; Yasarla, R.; Patel, V.M. TransWeather: Transformer-based restoration of images degraded by adverse weather conditions. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 18–24 June 2022; pp. 2353–2363. [Google Scholar] [CrossRef]
Jiang, B.; Chen, G.; Wang, J.; Ma, H.; Wang, L.; Wang, Y.; Chen, X. Deep Dehazing Network for Remote Sensing Image with Non-Uniform Haze. Remote Sens. 2021, 13, 4443. [Google Scholar] [CrossRef]
Xia, G.-S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Yu, J.; Liang, D.; Hang, B.; Gao, H. Aerial image dehazing using reinforcement learning. Remote Sens. 2022, 14, 5998. [Google Scholar] [CrossRef]
He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 2341–2353. [Google Scholar] [CrossRef]
Gu, Z.; Zhan, Z.; Yuan, Q.; Yan, L. Single Remote Sensing Image Dehazing Using a Prior-Based Dense Attentive Network. Remote Sens. 2019, 11, 3008. [Google Scholar] [CrossRef]
Hu, A.; Xie, Z.; Xu, Y.; Xie, M.; Wu, L.; Qiu, Q. Unsupervised Haze Removal for High-Resolution Optical Remote-Sensing Images Based on Improved Generative Adversarial Networks. Remote Sens. 2020, 12, 4162. [Google Scholar] [CrossRef]
Song, T.; Fan, S.; Li, J.; Jin, J.; Jin, G.; Fan, L. Learning an Effective Transformer for Remote Sensing Satellite Image Dehazing. IEEE Trans. Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Song, Y.; He, Z.; Qian, H.; Du, X. Vision transformers for single image dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Ma, Y.; Shi, Z.; Chen, J. GridDehazeNet: Attention-based multi-scale network for image dehazing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7314–7323. [Google Scholar] [CrossRef]
Li, B.; Peng, X.; Wang, Z.; Xu, J.; Feng, D. AOD-Net: All-in-one dehazing network. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4770–4778. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4713–4726. [Google Scholar] [CrossRef] [PubMed]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. (NeurIPS) 2020, 33, 6840–6851. [Google Scholar] [CrossRef]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar] [CrossRef]
Xia, B.; Zhang, Y.; Wang, S.; Wang, Y.; Wu, X.; Tian, Y.; Yang, W.; Gool, L.V. DiffIR: Efficient diffusion model for image restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 13095–13105. [Google Scholar] [CrossRef]
Wei, J.; Cao, Y.; Yang, K.; Chen, L.; Wu, Y. Self-Supervised Remote Sensing Image Dehazing Network Based on Zero-Shot Learning. Remote Sens. 2023, 15, 2732. [Google Scholar] [CrossRef]
Chen, L.; Chu, X.; Zhang, X.; Sun, J. Simple baselines for image restoration. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 17–33. [Google Scholar] [CrossRef]
Ren, W.; Pan, J.; Zhang, H.; Pan, J.; Cao, X.; Yang, M.-H. Single image dehazing via multi-scale convolutional neural networks with holistic edges. Int. J. Comput. Vis. 2020, 128, 240–259. [Google Scholar] [CrossRef]
Zheng, Z.; Ren, W.; Cao, X.; Hu, X.; Wang, T.; Song, F.; Jia, X. Ultra-high-definition image dehazing via multi-guided bilateral learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR), Virtual Conference, 19–25 June 2021; pp. 16180–16189. [Google Scholar] [CrossRef]
Ren, W.; Ma, L.; Zhang, J.; Pan, J.; Cao, X.; Liu, W.; Yang, M.-H. Gated fusion network for single image dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision Pattern Recognit (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3253–3261. [Google Scholar] [CrossRef]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11908–11915. Available online: https://arxiv.org/pdf/1911.07559v2 (accessed on 24 May 2026).
Wu, H.; Qu, Y.; Lin, S.; Zhou, J.; Qiao, R.; Zhang, Z.; Xie, Y.; Ma, L. Contrastive learning for compact single image dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10551–10560. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 5728–5739. [Google Scholar] [CrossRef]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general U-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 17683–17693. [Google Scholar] [CrossRef]
Yang, G.; Zhou, M.; Yan, K.; Liu, A.; Fu, X.; Wang, F. Memory-Augmented deep conditional unfolding network for pan-sharpening. In Proceedings of the IEEE/CVF Conference on Computer Vision Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 1788–1797. [Google Scholar] [CrossRef]
Li, Z.; He, J.; Yuan, Q.; Jin, X.; Xiao, Y.; Zhang, L. PhDnet: A novel physic-aware dehazing network for remote sensing images. Inf. Fusion 2024, 107, 102277. [Google Scholar] [CrossRef]
Guo, C.; Yan, Q.; Anwar, S.; Cong, R.; Ren, W.; Li, C. Image dehazing transformer with transmission-aware 3D position embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 5812–5820. [Google Scholar] [CrossRef]
Chi, K.; Yuan, Y.; Wang, Q. Trinity-Net: Gradient-Guided swin transformer-based remote sensing image dehazing and beyond. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Qin, Y.; Wang, J.; Cao, S.; Zhu, M.; Sun, J.; Hao, Z.; Jiang, X. SRBPSwin: Single-Image Super-Resolution for Remote Sensing Images Using a Global Residual Multi-Attention Hybrid Back-Projection Network Based on the Swin Transformer. Remote Sens. 2024, 16, 2252. [Google Scholar] [CrossRef]
Shao, Y.; Li, L.; Ren, W.; Gao, C.; Sang, N. Domain adaptation for image dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2808–2817. [Google Scholar] [CrossRef]
Zheng, Y.; Su, J.; Zhang, S.; Tao, M.; Wang, L. Dehaze-TGGAN: Transformer-Guide Generative Adversarial Networks With Spatial-Spectrum Attention for Unpaired Remote Sensing Dehazing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–20. [Google Scholar] [CrossRef]
Ma, L.; Mao, K.; Guo, Z. Defogging remote sensing images method based on a hybrid attention-based generative adversarial network. Smart Agric. 2025, 7, 172–182. [Google Scholar] [CrossRef]
Chung, H.; Sim, B.; Ye, J.C. Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction. In Proceedings of the IEEE/CVF Conference on Computer Vision Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 12413–12422. [Google Scholar] [CrossRef]
Saharia, C.; Chan, W.; Chang, H.; Lee, C.; Ho, J.; Salimans, T.; Fleet, D.; Norouzi, M. Palette: Image-to-image diffusion models. In Proceedings of the ACM Special Interest Group on Computer Graphics and Interactive Techniques Conference, Vancouver, BC, Canada, 7–11 August 2022; pp. 1–10. [Google Scholar] [CrossRef]
Li, R.; Pan, J.; Li, Z.; Tang, J. Single image dehazing via conditional generative adversarial network. In Proceedings of the IEEE/CVF Conference on Computer Vision Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8202–8211. [Google Scholar] [CrossRef]
Xiao, Z.; Kreis, K.; Vahdat, A. Tackling the generative learning trilemma with denoising diffusion GANs. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 25–29 April 2022. [Google Scholar] [CrossRef]
Ozdenizci, O.; Legenstein, R. Restoring vision in adverse weather conditions with patch-based denoising diffusion models. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10346–10357. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Zhang, L. Dynamic Mutual Enhancement Network for Single Remote Sensing Image Dehazing. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 3336–3340. [Google Scholar] [CrossRef]
Ren, W.; Liu, S.; Zhang, H.; Pan, J.; Cao, X.; Yang, M.-H. Single image dehazing via multi-scale convolutional neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 154–169. [Google Scholar] [CrossRef]
Wang, X.; Yu, K.; Dong, C.; Loy, C.C. Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE/CVF Conference on Computer Vision Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 606–615. [Google Scholar] [CrossRef]
Huang, B.; Li, Z.; Yang, C.; Sun, F.; Song, Y. Single Satellite Optical Imagery Dehazing using SAR Image Prior Based on Conditional Generative Adversarial Networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 2–5 March 2020; pp. 1806–1815. [Google Scholar] [CrossRef]
Lin, D.; Xu, G.; Wang, X.; Wang, Y.; Sun, X.; Fu, K. A Remote Sensing Image Dataset for Cloud Removal. arXiv 2019, arXiv:1901.00600. [Google Scholar] [CrossRef]
Hore, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In Proceedings of the International Conference on Pattern Recognition (ICPR), Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar] [CrossRef]
Kulkarni, A.; Murala, S. Aerial image dehazing with attentive deformable transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–7 January 2023; pp. 6305–6314. [Google Scholar] [CrossRef]
Hong, M.; Liu, J.; Li, C.; Qu, Y. Uncertainty-driven dehazing network. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 735–743. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of the proposed HS-DiffIR framework. The framework consists of two stages: (a) Pretraining for prior-guided restoration, where a compact global latent Z is extracted from the ground truth. Unlike the baseline, we introduce H-IPR to decompose Z into multi-scale priors and S-API to dynamically modulate their injection intensity across hierarchical Transformer blocks. (b) Diffusion model training, where a conditional diffusion model learns to estimate the target latent Z solely from the hazy input. During inference, the estimated prior guides the restoration network via the learned hierarchical representations.

Figure 2. Visual comparison of a texture-rich scene with zoom-in details on a Sate1K Thick remote sensing scene. For each method, the top image shows the global restoration result, and the bottom image highlights the fine-grained details in the region of interest (red box). While the baseline, DiffIR (i), effectively removes haze, it over-smooths the texture of the vegetation. In contrast, HS-DiffIR (j) preserves sharp edges and natural textures, closely matching the ground truth (b).

Figure 3. Visual comparison of a texture-rich scene with zoom-in details on a Sate1K Moderate remote sensing scene. For each method, the top image shows the global restoration result, and the bottom image highlights the fine-grained details in the region of interest (red box).

Figure 4. Visual comparison of a texture-rich scene with zoom-in details on the Sate1K Thin remote sensing scene. For each method, the top image shows the global restoration result, and the bottom image highlights the fine-grained details in the region of interest (red box).

Figure 5. Visual comparison of a texture-rich scene with zoom-in details on a Rice1 remote sensing scene. For each method, the top image shows the global restoration result, and the bottom image highlights the fine-grained details in the region of interest (red box).

Figure 6. Visual comparison of a texture-rich scene with zoom-in details on a Rice2 remote sensing scene.For each method, the top image shows the global restoration result, and the bottom image highlights the fine-grained details in the region of interest (red box).

Figure 7. Visualization of hierarchical feature evolution. Top row: The hazy input and feature maps from HS-DiffIR. Bottom row: The ground truth and feature maps from the baseline, DiffIR. Comparing (a,b), the hazy input suffers from low contrast and noise. In the shallow stage (Enc 1), HS-DiffIR extracts sharp edges similar to the structural details in GT, whereas DiffIR is corrupted by haze noise. In the deep stage (Enc 3), HS-DiffIR recovers consistent semantic blocks (e.g., the vegetation area) matching the GT layout, while DiffIR remains fragmented. This confirms that HS-DiffIR effectively disentangles features across scales.

Table 1. Quantitative comparison of different dehazing methods. Bold indicates the best result.

Method	Sate1K Thin			Sate1K Moderate			Sate1K Thick			Rice1			Rice2
Method	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS
DCP	17.6711	0.8674	0.1255	18.3316	0.9000	0.1284	9.3920	0.5715	0.4287	18.2124	0.8183	0.1845	16.6889	0.5762	0.5169
AOD-Net	12.9109	0.6882	0.2073	13.1130	0.6656	0.3116	13.2382	0.6792	0.2418	13.6038	0.4314	0.4211	12.2970	0.2497	0.5578
GridDehazeNet	22.7932	0.8983	0.0700	25.0822	0.9325	0.0651	20.3605	0.8267	0.1547	30.4821	0.9402	0.0516	32.3731	0.8697	0.1839
AIDNet	23.1221	0.9052	0.0603	25.0894	0.9124	0.0675	20.5650	0.8325	0.1281	29.9344	0.9402	0.0485	-	-	-
Uformer	24.7021	0.9193	0.0696	25.9305	0.9431	0.0634	22.3350	0.8541	0.1676	30.6672	0.9383	0.0624	33.6961	0.8759	0.2123
DehazeFormer	24.2275	0.9149	0.0591	25.7707	0.9418	0.0708	21.5320	0.8414	0.1605	31.6247	0.9370	0.0612	32.8495	0.8602	0.2234
DiffIR	24.9692	0.9259	0.0607	26.6415	0.9473	0.0638	22.8874	0.8784	0.1477	31.1950	0.9461	0.0702	34.1305	0.8823	0.1980
HS-DiffIR (Ours)	25.5510	0.9307	0.0548	27.2444	0.9458	0.0625	23.2747	0.8837	0.1372	32.1260	0.9481	0.0691	34.3883	0.8824	0.1939

Table 2. Ablation study on the effectiveness of the H-IPR and S-API components. The baseline is the original model, DiffIR [18].

Model Variant	H-IPR	S-API	PSNR	SSIM
(a) Baseline (DiffIR)	✗	✗	22.8874	0.8784
(b) H-IPR only	✓	✗	23.1091	0.8789
(c) S-API only	✗	✓	22.9922	0.8830
(d) H-IPR + S-API	✓	✓	23.2747	0.8837

Table 3. Efficiency comparison of the baseline (DiffIR) and the proposed HS-DiffIR framework. The metrics are evaluated on an input size of

512 \times 512

using a single NVIDIA RTX 4090 GPU.

Table 3. Efficiency comparison of the baseline (DiffIR) and the proposed HS-DiffIR framework. The metrics are evaluated on an input size of

512 \times 512

using a single NVIDIA RTX 4090 GPU.

Method	Parameters (M)	FLOPs (G)	Inference Time (ms)
DiffIR (Baseline)	26.91	451.64	248.2
HS-DiffIR (Ours)	27.11	451.64	249.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ju, W.; Liang, Z.; Chen, H.; Shen, J. Hierarchical Scale-Adaptive Diffusion Priors for Efficient Remote Sensing Dehazing. Remote Sens. 2026, 18, 1907. https://doi.org/10.3390/rs18121907

AMA Style

Ju W, Liang Z, Chen H, Shen J. Hierarchical Scale-Adaptive Diffusion Priors for Efficient Remote Sensing Dehazing. Remote Sensing. 2026; 18(12):1907. https://doi.org/10.3390/rs18121907

Chicago/Turabian Style

Ju, Wei, Zheng Liang, Huan Chen, and Jie Shen. 2026. "Hierarchical Scale-Adaptive Diffusion Priors for Efficient Remote Sensing Dehazing" Remote Sensing 18, no. 12: 1907. https://doi.org/10.3390/rs18121907

APA Style

Ju, W., Liang, Z., Chen, H., & Shen, J. (2026). Hierarchical Scale-Adaptive Diffusion Priors for Efficient Remote Sensing Dehazing. Remote Sensing, 18(12), 1907. https://doi.org/10.3390/rs18121907

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Scale-Adaptive Diffusion Priors for Efficient Remote Sensing Dehazing

Highlights

Abstract

1. Introduction

1.1. Background and Challenge

1.2. Existing Methods and Limitations

1.3. The Rise of Diffusion Models and the Gap

1.4. Proposed Solution

2. Related Work

2.1. Remote Sensing Image Dehazing

2.2. Diffusion Models for Image Restoration

2.3. Prior Injection and Multi-Scale Feature Modeling

3. Methodology

3.1. Problem Formulation

3.2. Preliminaries: DiffIR Framework

3.3. Hierarchical Image Prior Decomposition (H-IPR)

3.4. Scale-Adaptive Prior Injection (S-API)

3.5. Training Strategy

4. Experiments

4.1. Datasets and Implementation Details

4.2. Comparison with Existing Methods

4.3. Ablation Study

4.4. Analysis of Hierarchical Disentanglement

4.5. Efficiency Analysis

5. Discussion

5.1. Mechanism of Hierarchical Disentanglement

5.2. Interpreting Scale-Adaptive Calibration

5.3. Limitations and Future Prospects

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI