CGSTA-Net: A Cross-Domain Generative Prior-Assisted Structure–Texture Adaptive Network for Remote Sensing Image Dehazing

Li, Xiaoyan; Zhao, Yankun; Niu, Na

doi:10.3390/sym18061027

Open AccessArticle

CGSTA-Net: A Cross-Domain Generative Prior-Assisted Structure–Texture Adaptive Network for Remote Sensing Image Dehazing

by

Xiaoyan Li

¹,

Yankun Zhao

¹ and

Na Niu

^2,*

¹

Aulin College, Northeast Forestry University, Harbin 150040, China

²

College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(6), 1027; https://doi.org/10.3390/sym18061027 (registering DOI)

Submission received: 15 May 2026 / Revised: 4 June 2026 / Accepted: 12 June 2026 / Published: 14 June 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Dehazing of images is important for proper interpretation of optical images in remote sensing. However, current dehazing networks tend to have limited receptive field and texture information loss caused by conventional downsampling and complementary cross-domain information not being utilized in dehazing frameworks. In order to cope with these problems, we propose a Cross-domain Generative Prior-assisted Structure–Texture Adaptive Network for remote sensing image dehazing. It is a dual-stream encoder–decoder framework, which enhances the domain-specific information of RGB and generated prior, and then integrates them adaptively for haze-free reconstruction. In order to minimize information loss in downsampling, wavelet pooling is introduced to consider the frequency-aware structural and textural features. Additionally, a Structure–Texture Calibration Block is designed to simultaneously improve the local frequency textures and construct sparse long-range dependencies of structures, so as to achieve better restoration performance under spatially non-uniform haze. To appropriately fuse the various representations from RGB and generated prior images, a Prior-aware Gated Adaptive Fusion module is developed to balance the domain-specific features dynamically and keep the fine details at multi-level feature fusion. Finally, we utilize pixel-level contrastive learning to guide the latent space away from hazy distributions, thus enhancing the discriminability of the features. Extensive experiments on the three datasets, namely RSID, RICE-I and HRSD, demonstrate that CGSTA-Net can effectively restore images under varying haze conditions and significantly outperforms the latest dehazing methods in terms of visual quality and quantitative performance. Specifically, compared with the most effective competitive method, CGSTA-Net increased the PSNR by 22.9% on RSID, by 13.2% on RICE-I, and by 7.2% on HRSD.

Keywords:

remote sensing image dehazing; cross-domain generative prior; wavelet pooling; structure–texture feature learning; contrastive learning

1. Introduction

High-fidelity optical remote sensing images form the basis of various earth observation activities, covering areas such as land use mapping [1], environmental monitoring [2], disaster management [3], and strategic reconnaissance [4]. Beyond human interpretation, these images play a vital role in automatic interpretation tasks such as object detection, target tracking and change detection. As such, the performance of subsequent analysis tasks heavily depends on the visual quality and consistency of the input images. But, in real scenarios, these images are frequently distorted by atmospheric haze, which degrades contrast, alters spectral characteristics and conceals spatial details. This not only degrades human visual quality but also significantly affects the consistency of automatic interpretation systems, thereby limiting the large-scale earth observation. To overcome this critical problem, image dehazing is essential as a preprocessing step to recover the structure and color consistency, thus enhancing the stability and reliability of downstream remote sensing tasks.

Early dehazing methods used the Atmospheric Scattering Model (ASM) [5,6], which used handcrafted priors, such as the Dark Channel Prior (DCP) [7] and the Color Attenuation Prior (CAP) [8]. Under ideal conditions, these methods could achieve good results, but they were overly dependent on prior assumptions. This made the models less accurate in complex real-world scenarios and led to problems such as color distortion, loss of details, and unnatural enhancements. Following the development of deep learning, the dehazing paradigm has been updated to data-driven end-to-end learning. The multi-stage physical parameter estimation prone to cumulative errors has been abandoned. Based on the end-to-end supervised method, representative approaches, which include GDN [9], FFA-Net [10], and CLDN [11], further advanced dehazing technology by incorporating attention mechanisms, multi-scale feature fusion, and gradient-domain optimization, and contrastive learning to directly learn the mapping relationship between the blurred and clear images. In spite of impressive performance, modern networks still suffer from several drawbacks. (1) The locality of convolution limits long-range dependency modeling, making it difficult to capture the spatial distribution of haze in remote sensing images. (2) The conventional downsampling will lose high frequency components, weaken edge responses and reduce the structure details. (3) Although infrared-like priors can provide structure-aware cues, real paired RGB-IR dehazing data are rarely available in remote sensing. (4) Existing training strategies mainly rely on positive supervision but overlook negative hazy samples, which weakens the feature-level separability between hazy and haze-free components.

To address these limitations, we have developed the Cross-domain Generative Prior-assisted Structure–Texture Adaptive Network (CGSTA-Net), a hierarchical multi-stage defogging framework specifically designed for the complexity of remote sensing images. In terms of architecture, CGSTA-Net adopts a dual-stream multi-level encoder–decoder structure, where the RGB restoration branch and pseudo-IR prior branch first perform domain-specific feature enhancement and are then adaptively integrated for haze-free reconstruction. To reduce the information loss in conventional downsampling, we introduce a Haar Wavelet Downsampling (HWD) module. By decomposing features into low- and high-frequency sub-bands, HWD provides complementary structural cues during resolution reduction and helps preserve edge and texture information. Second, to better exploit the frequency-aware representations produced by HWD, we design a Structure–Texture Calibration Block (STCB). STCB serves as a unified calibration module that jointly models local frequency textures and long-range structural dependencies. This dual-branch design enables the network to enhance haze-degraded edges and fine details while maintaining global structural consistency under spatially non-uniform haze distributions. Third, inspired by the content-guided attention mechanism in DEA-Net [12], we develop the Prior-aware Gated Adaptive Fusion (PGAF) module for RGB-prior feature integration. PGAF generates prior-aware gated weights to adaptively balance RGB texture cues and prior complementary responses at the pixel level, thereby improving the reliability of cross-domain fusion. Finally, we draw inspiration from the pixel-level contrastive regularization method [11] as the training strategy, constraining the optimization process trajectory by defining clear images as positive samples and hazy inputs as negative samples. This approach promotes the recovered features to tend towards the clean domain while avoiding the haze-degraded domain. Thus, the network’s representational ability is precisely guided, significantly enhancing the feature distinguishability and the model’s generalization ability, and does not increase additional computational overhead during inference.

In summary, the main contributions of this study are as follows:

(1) We propose CGSTA-Net, a hierarchical cross-domain remote sensing dehazing framework which introducs generated pseudo-IR images as structural priors to exploits cross-domain guidance.

(2) We introduce HWD and design STCB for frequency-aware structure–texture restoration, where HWD preserves frequency components during downsampling and STCB jointly models local textures and long-range structural dependencies.

(3) We develop PGAF, which dynamically generates gating weights to selectively integrate RGB and prior features, improving the reliability of cross-domain fusion.

(4) We adopt a pixel-level contrast learning strategy, clearly guiding the features away from the hazy area, enhancing the feature-level separability between clean and haze-degraded representations. The performance on RICE-I [13], RSID [14], and HRSD [15] show that CGSTA-Net has reached state-of-the-art (SOTA) visual effects and quantitative accuracy in remote sensing haze removal.

2. Related Work

2.1. Prior-Based Dehazing Methods

Early dehazing methods typically relied on ASM to reconstruct clear images by utilizing manually-defined physical priors to calculate the transmission map and atmospheric light, among which the DCP [7] is a representative technique. The DCP assumes that in most non-sky areas of clear scenes, the minimum intensity in the RGB channels tends to be zero, which provides a powerful constraint for dehazing. However, in remote sensing scenarios where there are bright surfaces or homogeneous land cover, this assumption is likely to be violated, leading to errors in the transmission estimation and visible artifacts. To overcome these issues, various prior-based algorithms have been proposed. For example, CAP [8] estimates scene depth based on difference in brightness and saturation, while NCP [16] uses haze lines in the RGB space to constrain the transmission map. The channel prior has been extended to include the Bright Channel Prior (BCP) [17]. Recent developments include the optimization of transmission in BCCR [18] using boundary constraints, and the joint optimization of physical parameters in PWAB [19]. Despite these developments, prior-based methods still suffer from over-enhancement, halo, and spectral distortion when applied to complex remote sensing scenes.

2.2. Learning-Based Dehazing Methods

Recent developments in deep learning have allowed data-driven methods to outperform previous fog removal approaches, offering better robustness and higher quality restored images for different types of sensors. These techniques can be broadly grouped into three categories according to their training framework: supervised, semi-supervised or unsupervised. Supervised learning is most commonly used, with paired images, which are typically simulated images for remote sensing. In this supervised setting, the current approaches can be broadly classified into two categories:

2.2.1. Parameter Estimation Methods

This approach employs convolutional neural networks (CNNs) to predict the intermediate physical parameters of the atmospheric scattering model (ASM), thus aiding image dehazing. DehazeNet [20] was one of the first methods in this category, which directly estimates the transmission image using an end-to-end convolutional network. Then, AOD-Net [21] extends this model by embedding the ASM into a learnable component, which leads to fewer reconstruction errors. To account for the diverse spectral characteristics and the variable atmospheric conditions of Earth observation images, adaptive parameter estimation is also crucial. For example, Wang et al. [22] estimates the transmittance based on the full-color histogram features, and estimates the local atmospheric light using a block thresholding approach. The Robust Light–Dark Prior [23] was proposed to ensure the stability of these estimates in inhomogeneous areas. While these improvements have been made, these approaches are still restricted by ASM assumptions. As a consequence, in the restoration process, the errors of intermediate parameters may interfere with each other, leading to structural integrity and color issues.

2.2.2. End-to-End Restoration Methods

In contrast, the end-to-end methods aim to directly reconstruct the image without haze, thereby bypassing the explicit estimation of atmospheric parameters. These approaches take advantage of deep learning’s excellent representational power to model nonlinear degradations, and are well able to cope with various complex and challenging Earth observation scenarios. Even in complex problems, such as varying aerosol distribution, complex scene structure, or non-uniform spectral content, these models can achieve high-quality image restoration.

The pioneering research in end-to-end restoration, such as GCA-Net [24], mainly focuses on aggregating context with smooth dilated convolution to reduce artifacts. FFA-Net [10] further builds upon this to introduce a hybrid attention mechanism to adaptively emphasize haze perception features, while GDN [9] places a particular emphasis on edge enhancement through gradient domain supervision. According to our research findings, these early methods based on CNNs typically achieved PSNR values ranging from 14.72 dB to 20.94 dB on RICE-I, and between 16.02 dB and 18.68 dB on RSID. Recent efforts have focused on advanced learning and multi-domain collaborations. For example, C2PNet [25] integrates curriculum contrastive learning with physical constraints, while MB-TaylorFormer [26] adopts a Taylor-expanded multi-branch Transformer to improve long-range dependency modeling. Furthermore, Zheng et al. [27] proposed a two-domain model, enhancing image restoration using haze prior data. These measures further enhanced the recovery accuracy. For instance, MB-TaylorFormer and C2PNet respectively achieved the strongest competition on RICE-I and RSID.

The most recent developments in remote sensing restoration are adaptive feature collaboration and frequency domain modeling. For instance, LCEFormer [28] adopts CNN and Transformers to model local and global information, respectively. For low-altitude observation, the UAVD-Net [29] can fully exploit both multi-layer global context and adaptive local information via cross-channel attention, and can be easily extended to address non-uniform haze. Furthermore, to account for the multi-scale feature variations, Wang et al. [30] proposed the Multi-Scale Adaptive Dehazing Network using dilated convolution and self-adaptive attention. Meanwhile, the Dual-Domain Feature Fusion Network proposed by Jin et al. [31] is fully leveraging the complementary spatial features and spectral features. This model achieves feature fusion between these two domains, which can remove the low-frequency haze component and recover the high-frequency spatial information that is essential to obtaining high-quality dehazing results. This frequency perception modeling is particularly important for remote sensing defogging technology.

2.3. Recent Intelligent Learning Frameworks

Beyond image dehazing, recent intelligent learning frameworks have shown strong potential in pattern recognition, prediction, and image restoration tasks. Multimodal fusion models improve feature representation by exploiting complementary information from different sources [32]. Hybrid attention mechanisms enhance discriminative feature learning by combining channel, spatial, and long-range dependency modeling [33,34]. Sequence-based and Transformer-inspired architectures strengthen global dependency modeling, while spectral learning methods provide efficient frequency-domain alternatives for contextual representation [35,36]. Lightweight architectures further improve deployment feasibility in resource-constrained environments [37]. These advances are closely related to CGSTA-Net. The RGB-prior dual-stream design follows multimodal feature interaction, PGAF performs adaptive cross-domain fusion, STCB combines local texture enhancement with global structural modeling, and HWD introduces frequency-aware downsampling. Different from common intelligent frameworks, CGSTA-Net is specifically designed for remote sensing image dehazing, and can simultaneously address haze degradation, details loss, and cross-domain structure guidance.

2.4. Generative Priors for Image Restoration

Apart from the network architecture design and learning strategies, auxiliary priors have been studied to boost the image restoration performance. Recent image-to-image translation approaches have proven to be a practical solution to build such priors when paired multi-sensor data is difficult to be available. Pix2Pix [38] learns conditional mappings from paired samples, and CycleGAN [39] is able to perform unpaired domain translation via cycle-consistency constraints. Given mappings between visual domains, these methods can be used to create pseudo-domain images that preserve scene-level structural correspondences and serve as auxiliary guidance for the restoration task. This is valuable, especially for remote sensing image dehazing, because pseudo-IR images, which can deliver IR-like structure-aware cues, are helpful while the dehazing network distinguishes the haze-induced degradation from real scene textures. Such generative priors are learned from data, and are able to adapt complex remote sensing scenes unlike handcrafted priors. In addition, they do not require exactly paired RGB-IR observations, which are often unavailable in the existing remote sensing dehazing benchmarks. In this work, the pseudo-IR image generated is served as a generative cross-domain prior, providing structure-aware assistance of RGB dehazing through adaptive cross-domain feature fusion.

2.5. Contrastive Learning Mechanism

Contrastive learning is a representation learning that brings together positive pairs and pushes apart negative pairs in the embedding space. It has been widely applied to image restoration, and has significantly improved reconstruction quality and feature discrimination. For instance, it effectively makes use of exposure variations and contrast-aware loss to recover original details in low-light image enhancement [40]. In the case of remote sensing image dehazing, contrastive learning is well adapted to the ill-posed haze compensation problem caused by the uneven distribution of haze. Traditional pixel-wise losses are often over-smoothing and lose spatial and spectral details. By using the ground truth image to construct positive samples, and combined with the fogged image to construct negative samples, contrastive learning helps the output to gravitate towards the non-foggy distribution while avoiding the blurred regions. Wu et al. [11] introduced a contrastive regularization term to maximise the mutual information between the dehazing result and the clear image while minimising its relationship with the foggy image. Based on this idea, Zheng et al. [25] proposed a curriculum-based framework, which dynamically adjusts the difficulty of negative samples during training. Finally, expanding the feature space distance between the hazy and restored image has proven to be a effective approach, which can alleviate the artifacts caused by fog and retain the high-frequency spatial details required for remote sensing interpretation.

3. Method

3.1. Overall Architecture

Figure 1 presents the proposed model, which employs a multi-stage U-shaped encoder–decoder architecture to achieve the dehazing processing. We chose the U-shaped architecture because of the characteristics of remote sensing fog. The U-shaped design addresses the high computational expense of large-scale contextual capture caused by limited receptive fields of CNNs. The hierarchical encoder is used to increase the receptive field to model the global haze distribution, and the skip connections are used to avoid the information loss through the bottleneck and to directly pass the textural information to the decoder, which helps to preserve the structure important for data analysis in remote sensing. Given a degraded RGB image

I_{rgb} \in R^{3 \times H \times W}

and the corresponding prior image

I_{prior} \in R^{1 \times H \times W}

, two shallow feature extraction branches will output the initial RGB and prior feature maps:

F_{rgb} = ϕ_{rgb} (I_{rgb}), F_{prior} = ϕ_{prior} (I_{prior})

(1)

where

ϕ_{rgb} (\cdot)

and

ϕ_{prior} (\cdot)

are shallow feature extraction functions, and

F_{rgb}

and

F_{prior}

represent the output feature maps of the RGB and prior branches, respectively.

In the hierarchical encoder, the feature resolution gradually decreases and channel dimension gradually increases. By adopting frequency-aware decomposition, HWD can retain the key high-frequency structural details with minimal computational overhead, as described in Section 3.2. At the terminal positions and bottlenecks of each encoder stage, STCB is deployed. Through a cross-domain dual-branch topology, STCB can simultaneously adjust local frequency textures and long-range structural dependencies of RGB and prior features, which is elaborated in Section 3.3. Finally, PGAF adaptively balances and fuses the complementary features extracted from the RGB and prior branches, ensuring precise information exchange across layers. The detailed mechanism is elaborated in Section 3.4. The whole architecture is multi-level, whose receptive field naturally expands and thus can increasingly learn abstract semantic representation with the degradation from haze. Additionally, the symmetric decoder is linked to the corresponding layers with U-Net-like skip connections, enabling low and high level information to be combined. The structural harmony ensures the exact reconstruction of the fine details. This type of

J_{rec}

is the reconstructed image:

J_{rec} = I_{rgb} + R (D e c ({F_{fuse}^{l}}_{l = 1}^{L}))

(2)

where

I_{rgb}

is the hazy RGB image to be reconstructed,

D e c (\cdot)

is the decoder,

R (\cdot)

denotes the final 3 × 3 convolution layer for residual prediction and L is the number of encoder stages.

3.2. HWD

The traditional downsampling operation tends to attenuate high frequency information like edges and textures. This restriction is particularly problematic in the case of dehazing of remote sensing images, where fine details of buildings, roads, agriculture boundaries, etc., and dense textured areas are of significance for the correct dehazing. To alleviate this problem, we introduce HWD, as shown in Figure 2, as an auxiliary frequency-aware downsampling branch alongside the conventional stride-based downsampling path. By leveraging the 2D Haar Wavelet Transform, HWD decomposes the feature map into orthogonal low-frequency and high-frequency sub-bands, thereby providing complementary frequency-aware structural cues during resolution reduction. In the proposed cross-domain prior-guided framework, HWD is applied to the RGB and prior feature maps in the same manner, providing frequency-preserving downsampling features for both domain branches.

For the input feature map

F_{in} \in R^{C \times H \times W}

from RGB or prior branch, the one-dimensional decomposition process is defined as follows:

\{\begin{matrix} F_{low} (x) = \frac{1}{\sqrt{2}} [F_{in} (2 x) + F_{in} (2 x + 1)], \\ F_{high} (x) = \frac{1}{\sqrt{2}} [F_{in} (2 x) - F_{in} (2 x + 1)] \end{matrix}

(3)

where

F_{low}

extracts the global macroscopic structure, while

F_{high}

hides the fine texture details. By extending this transformation to two dimensions, the original input is divided into four frequency sub-bands:

F_{LL}

,

F_{LH}

,

F_{HL}

, and

F_{HH}

. Each sub-band is reduced to

\frac{H}{2} \times \frac{W}{2}

. These four components are then concatenated along the channel dimension and projected by a

1 \times 1

convolution:

F_{hwd} = C_{1 \times 1} ([F_{LL}, F_{LH}, F_{HL}, F_{HH}])

(4)

where the channel-wise concatenation is indicated by

[\cdot]

and

C_{1 \times 1}

can convert the

4 C

-channel representation into the required output dimension.

We strategically integrate the HWD mechanism into the cross-level transformation path of the encoder. Specifically, at level l, the feature map of each domain is processed by two parallel downsampling paths: a conventional stride-based convolutional path and an HWD-based frequency-aware path. The HWD output is projected to match the channel dimension of level

l + 1

and then added to the corresponding backbone downsampled feature:

F_{out}^{l + 1} = D_{l} (F_{in}^{l}) + {HWD}_{l} (F_{in}^{l})

(5)

where the conventional stride based downsampling is represented by

D_{l} (\cdot)

and the Haar wavelet downsampling branch is represented by

HWD (\cdot)

. In this way, HWD introduces frequency selection priors into the resolution reduction stage while maintaining compatibility with the main backbone, thereby alleviating the structural information loss caused by conventional downsampling.

3.3. STCB

To address the conflict between local texture restoration and global structural consistency during the cross-domain prior-assisted dehazing process, we inserted the STCB module (Figure 3) at the end of each stage of the encoder and at the bottleneck of the network. Local frequency-aware calibration is effective for restoring edges and fine structures, but lacks long-range semantic constraints. Instead, global sparse attention can stabilize structural correspondence, but may ignore subtle high-frequency degradation. STCB constructs a dual-branch structure–texture calibration mechanism, including the MFMSA module and the MSCA mechanism, which operate in parallel. The MFMSA module performs frequency-guided local texture re-calibration, while the MSCA branch forms sparse semantic associations between multi-scale features to ensure the stability of feature matching and the consistency of the global structure.

Given the input feature

F_{in}

and the aggregated low-level feature

\hat{Y}

, the two branches of STCB are formulated as:

F_{tex} = MFMSA (F_{in}), F_{str} = MSCA (F_{in}, \hat{Y})

(6)

The two complementary representations are then concatenated and projected by a lightweight fusion layer:

F_{STCB} = F_{in} + C_{1 \times 1} ([F_{tex}, F_{str}])

(7)

where

[\cdot, \cdot]

denotes channel-wise concatenation. The residual connection preserves the original domain-specific representation while allowing STCB to inject frequency-enhanced texture cues and sparse global structural constraints.

3.3.1. MFMSA

Atmospheric haze usually weakens high-frequency structures and introduces spatially varying degradation. To recover fine details under such conditions, MFMSA, which is presented in Figure 4 combines frequency response modeling with spatial saliency estimation. For a given input feature map

F_{in} \in R^{C \times H \times W}

, we employ a two-dimensional discrete cosine transform (DCT) to obtain its frequency domain representation:

F_{f r e q} (u, v) = \sum_{h = 0}^{H - 1} \sum_{w = 0}^{W - 1} F_{i n} (h, w) B_{u, v} (h, w)

(8)

in which

F_{f r e q}

represents the obtained frequency components, while

B_{u, v} (h, w)

refers to the corresponding DCT basis function.

To recalibrate the channel response based on the spectral energy distribution, we perform global frequency statistics pooling (average, max, and min) operations at the global level. This is processed then, by a shared multi-layer perceptron (MLP). Channel attention map

A_{chan}

is calculated as:

A_{chan} = σ (\sum_{d \in {avg, \max, \min}} L_{2} (δ (L_{1} (Z_{d}))))

(9)

where

L_{1}

and

L_{2}

respectively represent the fully connected layers, while

σ

and

δ

denote the Sigmoid and ReLU activation functions.

Z_{d}

represents the compressed frequency descriptor obtained through the pooling operation d. The recalibrated feature is obtained as:

F_{calib} = F_{in} ⊙ A_{chan}

(10)

In order to further improve the boundary sensitive parts, a spatial refinement stage is used to control the use of foreground and background responses:

F_{tex} = C_{3} (γ_{fg} \cdot (F_{calib} ⊙ M_{fg}) + γ_{bg} \cdot (F_{calib} ⊙ M_{bg}))

(11)

where

C_{3}

stands for a

3 \times 3

convolution and

γ_{fg}

and

γ_{bg}

are learnable parameters that control the contributions of foreground and background features. The foreground mask

M_{fg}

is produced by applying a

1 \times 1

convolution and Sigmoid activation function and the background mask

M_{bg}

is set as

M_{bg} = 1 - M_{fg}

. By retaining high-frequency details of the structure while filtering out background noise which is irrelevant to the haze, this design will help preserve the high-frequency structural details.

Although MFMSA is effective in enhancing local frequency details, its has limited use in modeling long-range semantic correspondence. Therefore, it theoretically requires a global constraint.

3.3.2. MSCA

MSCA provides structural stability that MFMSA lacks by linking regions that are at a long distance and have semantic associations. To enhance computational efficiency while maintaining key cross-scale interactions, the MSCA module employs a sparse attention mechanism, which can capture multi-scale dependencies and filter out redundant correlations. Figure 5 illustrates the structure of MSCA.

Specifically, MSCA takes the advanced feature map X of the input domain and a multi-scale aggregated low-level feature map

\hat{Y}

. The query, key, and value embeddings are generated as:

\{\begin{matrix} Q = Flatten (X) W_{1}, \\ K, V = Split (LN (Flatten (\hat{Y}) W_{2})) \end{matrix}

(12)

where

Flatten (\cdot)

is used to convert spatial features to sequence data and

LN (\cdot)

denotes the layer normalization.

W_{1}

and

W_{2}

are learnable projection parameters.

The correlation matrix is first computed as

M = Q K^{T} / \sqrt{c}

. To obtain a more selective attention distribution, a bifurcated TopK sparse strategy is adopted:

M_{adj} = α \cdot Softmax ({TopK}_{1} (M)) + β \cdot Softmax ({TopK}_{2} (M))

(13)

where

{TopK}_{1} (\cdot)

and

{TopK}_{2} (\cdot)

respectively extract

k_{1}

and

k_{2}

of the most important elements from M. The learnable parameters

α

and

β

are used as adaptive weights to balance the contributions of the two sparse selections.

The final structural representation is obtained by applying the sparse attention matrix to the value embedding:

F_{str} = M_{adj} V

(14)

Overall, STCB integrates frequency-guided texture recalibration and sparse cross-scale structural modeling in a parallel topology. This design enables the network to preserve local high-frequency details while maintaining global structural consistency, thereby producing more reliable haze-aware features for subsequent cross-domain interaction.

3.4. PGAF

After the STCB module independently enhances the RGB and prior features, we introduce the PGAF strategy to further perform adaptive cross-domain feature fusion. Traditional addition or concatenation operations commonly often do not sufficiently account for the significant mismatch in receptive fields, semantic scales, and domain responses between shallow and deep feature maps. Nevertheless, PGAF uses a channel- and pixel-based spatial weighting mechanism that adapts to balance the contribution of features and domains. The architecture of PGAF is outlined in Figure 6.

Specifically, given the RGB and prior features

F_{rgb}^{l}

and

F_{prior}^{l}

produced by STCB at the l-th encoder stage, the initial cross-domain representation is obtained as:

F_{init}^{l} = F_{rgb}^{l} + F_{prior}^{l}

(15)

Based on this initial representation, PGAF jointly exploits channel attention and spatial attention to generate a coarse importance descriptor:

A_{coa}^{l} = C A (F_{init}^{l}) + S A (F_{init}^{l})

(16)

where

C A (\cdot)

and

S A (\cdot)

denote the channel and spatial attention operations, respectively.

The coarse descriptor is further refined by pixel attention to generate the prior-aware gating map:

W^{l} = σ (P A (F_{init}^{l}, A_{coa}^{l}))

(17)

where

P A (\cdot)

is the pixel attention that is achieved by a

7 \times 7

group convolution, and

σ (\cdot)

is the Sigmoid activation function. The learned weight map

W^{l}

is

[0, 1]

, which modulates the domain-specific residual responses at the pixel level.

Lastly, the RGB and prior features are fused using a residual gated fusion approach:

F_{{fuse}^{l}} = C_{1 \times 1} (F_{init}^{l} + W ⊙ F_{rgb}^{l} + (1 - W) ⊙ F_{prior}^{l})

(18)

where the operation ⊙ stands for element-wise multiplication, while

C_{1 \times 1} (\cdot)

is the

1 \times 1

convolution operation to refine channels. The weight map

W^{l}

is a dynamic moderator: the areas where visual texture information is reliable will have higher responses in RGB, while in haze-degraded areas, where the visible spectrum is weaker, it can be compensated by giving more weight to prior features. At the same time, to stabilize the fusion process and help the propagation of the gradient, the residual term

F_{init}^{l}

is introduced.

3.5. Loss Function

To balance the fidelity to the ground truth image and the visual quality, we adopt a hybrid objective function to train the model. Consistent with the current trend [41], the data fidelity term is the

L_{1}

loss (Mean Absolute Error), which is more suited than the

L_{2}

loss to retain the edges and details in the image. Its formula is defined as follows:

L_{rec} = {∥I_{gt} - J_{rec}∥}_{1}

(19)

where the former is the clean reference image

I_{gt}

and the latter is the restored output image

J_{rec} = ϕ (I_{rgb}, I_{prior}; w)

, generated by the network

ϕ

and parameter w. The operator

{∥\cdot∥}_{1}

is used to represent the

L_{1}

norm distance.

Although the reconstruction loss can ensure pixel-level consistency, when used alone, it often leads to overly smooth outputs. To address this limitation, we integrated a contrastive learning paradigm for image restoration. Specifically, the hazy RGB input

I_{rgb}

serves as the naturally aligned negative sample, while the ground truth image

I_{gt}

provides the positive target. Using the pre-trained VGG-19 network as a fixed feature extractor, the contrastive loss is defined as follows:

L_{cr} = \sum_{s = 1}^{3} λ_{s} \sum_{i = 1}^{n} ω_{i} \cdot \frac{D (E_{i} (I_{gt}^{(s)}), E_{i} (J_{rec}^{(s)}))}{D (E_{i} (I_{rgb}^{(s)}), E_{i} (J_{rec}^{(s)})) + ϵ}

(20)

where the output scale is represented by s and the feature mapping function of the i-th layer of the pre-trained VGG-19 network by

E_{i}

. In particular, we obtain representative features at relu1_1 through relu5_1 to highlight more semantic representations. The numerator pulls the restored output toward the clean reference, while the denominator pushes it away from the hazy RGB observation. The empirical coefficients

λ_{s}

are chosen as

1.0, 0.5,

and

0.25

for

s = 1, 2, 3

to compensate for the outputs at different scales. In addition, the hierarchical weights

ω_{i}

are designed as

1 / 32, 1 / 16, 1 / 8, 1 / 4, 1

to give more credit to the deeper layers, where more important data for semantic structure are contained.

In addition, since the proposed framework introduces RGB and prior inputs simultaneously, a Cross-domain Consistency Loss is further employed to regularize the interaction between the two domains. Although RGB and prior features have different imaging characteristics, they correspond to the same scene and should remain consistent in a shared latent space. Therefore, we constrain the enhanced RGB and prior features produced by STCB as follows:

L_{cross} = \frac{1}{L} \sum_{l = 1}^{L} {∥Π_{rgb} (F_{rgb}^{l}) - Π_{prior} (F_{prior}^{l})∥}_{1}

(21)

where

F_{rgb}^{l}

and

F_{prior}^{l}

are the enhanced RGB and prior features obtained by STCB at the l-th stage of the encoder, and L is the number of encoder stages.

Π_{rgb} (\cdot)

and

Π_{prior} (\cdot)

are domain projection functions that map RGB and prior features into a shared space. Minimizing

L_{cross}

would make the network strive to keep the cross-domain consistency while retaining the complementary domain specific information.

Finally, the overall objective function is formulated as:

L_{total} = L_{rec} + λ L_{cr} + λ_{cross} L_{cross}

(22)

where

λ

and

λ_{cross}

control the contributions of the contrastive regularization and RGB-prior consistency constraint, respectively.

Overall, the three losses provide complementary supervision for the restoration process.

L_{rec}

ensures pixel-level accuracy,

L_{cr}

improves perceptual feature discrimination, and

L_{cross}

aligns RGB and prior representations.

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets

To demonstrate the effectiveness of our proposed framework, we carried out a series of experiments on three typical remote sensing dehazing benchmark datasets: RICE-I [13], RSID [14], and HRSD [15]. These datasets include synthetic and real-world fogging conditions, and offer a platform for a comprehensive evaluation. In particular, RSID is a synthetic dataset created by using a standard physical scattering model, making it an excellent choice for quantitative analysis under various atmospheric visibility conditions. On the other hand, RICE-I is a real hazy dataset from Google Earth. With complex and natural atmospheric fog, RICE-I will help us assess the generalizability and robustness of our network in real remote sensing applications. Moreover, we also used HRSD, which is designed for high-resolution restoration. HRSD has two sub-datasets: light haze (LHID) and dense haze (DHID). The LHID is further divided into test A and B groups, enabling a comprehensive analysis of different levels of visual degradations, while the DHID contains extremely obstructed images, which can be used to test the limits of the structure restoration under extreme haze conditions. The detailed statistical information regarding the pixel resolution, pixel count, and spatial resolution of these datasets is summarized in Table 1. To apply our proposed approach on these standard benchmarks, we used image-to-image translation models, including CycleGAN [39] and Pix2Pix [38], to translate the hazy RGB inputs into pseudo-IR counterpart images for each of the three datasets. This pre-processing step builds the required inputs for the cross-domain to comprehensive evaluation of CGSTA-Net in the existing benchmark settings.

4.1.2. Implementation Details

The proposed framework was developed and implemented using the PyTorch version 2.1 [42]. All computational tasks were carried out in a hardware environment equipped with an Intel Xeon Platinum 8352 V processor (12 virtual processors, each at 2.10 GHz) and a NVIDIA general-purpose graphics processor (vGPU) with 32 GB of memory. To enhance the diversity of data and reduce the risk of overfitting, various data augmentation strategies were integrated into the training stage. The detailed implementation hyper-parameters, optimization strategies, and architectural configurations are summarized in Table 2.

The batch size and training epochs in Table 2 were fixed to ensure a consistent optimization budget and fair comparison. The batch size was selected based on GPU memory constraints and training stability. In our code, the epoch parameter defines the maximum training duration rather than an automatic early-stopping criterion. To mitigate overfitting, we used data augmentation, learning-rate scheduling, EMA, and checkpoint monitoring, and the best PSNR checkpoint was used for final evaluation.

4.1.3. Evaluation Metrics

To evaluate the performance of the proposed model, three standard quantitative indicators were adopted: Peak Signal-to-Noise Ratio (PSNR) [45], Structural Similarity Index (SSIM) [46], and Learned Perceptual Image Patch Similarity (LPIPS) [47]. These indicators assess the image enhancement effect from multiple perspectives, covering pixel-level fidelity, structural integrity, and perceptual consistency with the real reference image.

PSNR [45] quantifies the accuracy of image reconstruction by measuring the intensity difference from the clear reference image. A higher PSNR value indicates a reduction in reconstruction error and an improvement in signal fidelity. Its mathematical definition is:

PSNR = 10 {log}_{10} (\frac{{MAX}_{I}^{2}}{MSE (J_{r e c}, I_{g t})})

(23)

where

{MAX}_{I}

represents the peak pixel intensity. The calculation method of the mean square error (MSE) is as follows:

MSE = \frac{1}{m \times n} \sum_{i = 1}^{m} \sum_{j = 1}^{n} {(J_{r e c} (i, j) - I_{g t} (i, j))}^{2}

(24)

SSIM [46] measures the consistency of an image by simultaneously evaluating luminance, contrast, and spatial structure information. A higher SSIM indicates a stronger structural consistency. Its formula is as follows:

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(25)

where

μ

and

σ^{2}

represent the mean and variance respectively, while

σ_{x y}

is the covariance.

C_{1}

and

C_{2}

are constants to prevent numerical instability.

LPIPS [47] evaluates perceptual similarity, using deep features extracted from the pre-trained neural network. A lower LPIPS indicates the smaller the perceptual gap and the higher the visual realism. It is defined as:

d (x, x_{0}) = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} {∥ w_{l} ⊙ ({\hat{y}}_{h w}^{l} - {\hat{y}}_{0 h w}^{l}) ∥}_{2}^{2}

(26)

where x and

x_{0}

respectively represent the output result after dehazing and the true value.

{\hat{y}}_{h w}^{l}

and

{\hat{y}}_{0 h w}^{l}

denote the feature vectors after

L_{2}

regularization in the l-th layer.

H_{l}

and

W_{l}

are the dimensions of the feature map, and

w_{l}

represents the channel-level weights.

4.2. Comparison with State-of-the-Art Methods

4.2.1. Quantitative Evaluation

We verified the superiority of our model by comparing it with ten representative dehazing methods. This comparison covered early models such as AOD-Net [21] and GCA-Net [24], as well as CNN-based networks, such as DAD [48], FFA-Net [10], and C2PNet [25]. To ensure a comprehensive analysis, we also conducted tests on it using recent Transformer-inspired and hybrid architectures, including MB-TaylorFormer [26], DEA-Net [12], MixDehazeNet [49], DGFDNet [27], and DenseSNN [50]. Table 3, Table 4 and Table 5 provide detailed summaries of these quantitative results.

Table 3 presents the quantitative comparison results on the RICE-I and RSID datasets. Our method consistently achieves SOTA performance across all evaluation metrics. Specifically, on the RICE-I dataset, our algorithm outperforms the second-best method, MB-TaylorFormer, which achieves a PSNR of 21.67 dB and an SSIM of 0.8813. Our approach leads by 2.85 dB in PSNR and improves SSIM by 0.017. A similar trend is observed on the RSID dataset, where our method surpasses the current leading approach, C2PNet, raising PSNR from 19.45 dB to 23.91 dB, along with an improvement of 0.0207 in SSIM. Notably, our model achieves the lowest LPIPS scores among all competing methods, with 0.1264 on RICE-I and 0.1023 on RSID. Such a clear reduction in LPIPS suggests that our method not only optimizes pixel-wise accuracy but also excels at preserving perceptual similarity and fine visual details.

The quantitative performance on the HRSD subset is shown in Table 4 and Table 5. On LHID-A, we achieved the most advanced results 23.60 dB PSNR, 0.8344 SSIM, significantly outperforming the second-place DenseSNN. Although our method did not achieve the lowest LPIPS value on LHID-A, it only achieved 0.1672. This may due to the relatively mild and uniform distribution of haze in LHID-A. DEA-Net, which makes the generated results smoother, gives a slight advantage in LPIPS. In contrast, CGSTA-Net tends to preserve clearer structures and local contrast, which improves indicators such as PSNR and SSIM, but may slightly increase the perceptual feature distance measured by LPIPS. Similarly, on LHID-B, our model reached a peak PSNR of 24.91 dB and SSIM of 0.8822. For the challenging DHID subset with severe fog characteristics, our model still keeps the high-frequency spatial details and has the highest PSNR (25.32 dB) and SSIM (0.8698), and the lowest LPIPS score (0.1429). Compared to the second-best GCA-Net, we achieved a 1.9 dB improvement in PSNR.

To further evaluate the generalization ability of the proposed CGSTA-Net, we trained the model only on the HRSD training set and directly tested it on RICE-I and RSID without target-domain fine-tuning. As shown in Table 6, the PSNR values on RICE-I and RSID were 16.78 dB and 18.38 dB, respectively. Although these results are lower than the performance after fine-tuning, the model still maintained reasonable restoration ability with unseen data. This performance decline is mainly due to the domain differences in datasets, including fog density, scene content, imaging conditions, and spatial resolution.

We also evaluate the cost-effectiveness and resource efficiency of the proposed model using FLOPs and parameters, as shown in Table 7. CGSTA-Net contains 99.969 G FLOPs and 12.251 M parameters. Although its computational cost is higher than some lightweight methods, it remains lower than several high-complexity competitors, such as FFA-Net and C2PNet in terms of FLOPs. Considering its consistent improvements in PSNR, SSIM, and LPIPS across multiple datasets, CGSTA-Net achieves a reasonable balance between restoration performance and computational complexity.

4.2.2. Qualitative Evaluation

We performed extensive visual comparisons between the proposed CGSTA-Net and ten recent SOTA dehazing methods. Figure 7 and Figure 8 show visual results on the RICE-I, RSID and HRSD datasets.

Figure 7 illustrates the visual effects on RICE-I and RSID datasets. On real-world RICE-I images, most competing methods can reduce haze to some extent but have difficulty in preserving details and natural color. For example, FFA-Net and DGFDNet are prone to retaining haze artifacts, which distort the textures in the first column and ridges in the second column. In forest scenes, the results of AOD-Net are too dark in forest scenes, leading to difficulties in perceiving mountain ranges and vegetation textures. At the same time, GCA-Net and DenseSNN are more prone to over-saturation or color shift, and thus the output image will have excessive contrast and a prominent red hue. On the contrary, CGSTA-Net is capable of accurately recovering the real vegetation tones and mountain shapes.

For the RSID dataset, which has synthetic fog of various land cover types, it can be seen that certain architectures like AOD-Net and DGFDNet cannot fully remove fog, particularly in high-reflectivity regions such as snow scenes. In addition, while the visibility has been enhanced, some methods have poor performance in terms of color accuracy; for example, GCA-Net and DenseSNN introduces excessive color differences, making the soils appear in an unnatural orange-red tone. While DAD is effective in removing haze, it tends to over-smooth high-frequency details and produces over-brightening, which causes the loss of fine structural details of industrial buildings. Our method can effectively remove haze without distortion of the terrain’s real colors and textures.

The results of the HRSD subset are illustrated in Figure 8. The LHID-A subset includes complicated and intricate residential buildings and wind farms. In the residential scenario, these complex geometric structures are often confused by FFA-Net and DGFDNet, so that the residential layout is hard to identify. AOD-Net introduces a slight dark green hue deviation, especially on the road, whereas our method preserves the details of the turbine blades and the grid-like pattern of the city streets. In addition, CGSTA-Net offers stable contrast between the dark and light regions, and thus achieves faithful image restoration in residential scenarios. On the LHID-B subset, SOTA algorithms like GCA-Net and DenseSNN can have brightness distortion and severe color bias. For example, in the fourth column, the sea water in the image they obtained is more yellow. DAD can achieve a more uniform defogging effect, but it has a serious fading phenomenon, such as the red rubber track and the surrounding roads presenting an unrealistic light brown tone, and the edges of buildings also become blurry. In contrast, our proposed framework effectively suppresses these artifacts and accurately restores the vivid red tones on the racetrack, and the complex textures of the green fields and building roofs, which are highly consistent with the real situation. For the DHID subset, which is a benchmark dataset for urban remote sensing, characterized by dense fog and uniform spatial distribution, most of the SOTA methods struggle to find balance between completely removing the fog and maintaining the structural integrity. For instance, DAD introduces obvious halo artifacts at the building boundaries and vegetation edges. FFA-Net, MB-TaylorFormer, and DGFDNet still have large amount of residual fog, which reduces the clarity of the urban scene. While GCA-Net produces more complete fog removal, it tends to over-saturate the image. Our CGSTA-Net is highly generalizable for non-uniform fog removal, and can provide very precise road networks, uniform brightness, and realistic colors, supporting the advanced remote sensing applications.

4.3. Ablation Study

In order to rigorously evaluate the contribution of each component in CGSTA-Net, we conducted a series of ablation studies on the DHID dataset. The RGB base denotes a plain encoder–decoder network that follows the same overall backbone scale as CGSTA-Net but removes all proposed components. The RGB-prior base introduces the prior input branch and performs simple concatenation for RGB-prior fusion. Based on these baselines, we progressively introduce the prior domain, PGAF, HWD, STCB, and PCL, demonstrating that these modules are indispensable for the overall performance of the model. The quantitative results of these experiments are summarized in Table 8, while the corresponding visual comparison is presented in Figure 9.

4.3.1. Effectiveness of Generative Prior and PGAF

We analyze the effectiveness of cross-domain input and adaptive fusion. The PSNR values of the RGB-only baseline and RGB-Prior baseline are 19.92 dB and 20.58 dB, respectively, showing that the prior images offer additional information in remote sensing dehazing. When the prior branch was introduced, the SSIM value increased from 0.7407 to 0.7550. This is because the RGB and prior images have different imaging characteristics. The RGB image contains rich color and detailed information, while the prior image mainly provides intensity and structural information. With PGAF, the SSIM is given further boost to 0.7794 while LPIPS is cut down to 0.2198 from 0.2435. This improvement indicates that, compared with the simple concatenation method, the proposed prior-aware gated fusion effectively adapts to the fusion of RGB texture information and prior complementary responses.

4.3.2. Effectiveness of STCB and HWD

Next, we evaluate the frequency-preserving and structure-texture modeling components. With HWD added to RGB-Prior + PGAF, the PSNR is increased to 22.93 dB from 21.43 dB, while LPIPS is decreased to 0.1683 from 0.2198. It indicates that more high-frequency information is retained in the downsampled image using the wavelet-based downsampling methods, which helps reduce the degradation of the texture and blurring of the edges. On the other hand, the PSNR and SSIM values of STCB added are 23.71 dB and 0.8303, which represents a significant improvement compared to the situation without STCB. STCB takes into account both the enhancement of local details and the modeling related to the structure, which helps to restore the outlines and fine patterns damaged by haze while further capturing the broader spatial dependencies. It is worth noting that although STCB obtains higher PSNR and SSIM than HWD, its LPIPS is slightly worse, suggesting that STCB mainly benefits pixel-level fidelity through structure-texture modeling, whereas HWD is more effective in preserving perceptually relevant high-frequency textures. More importantly, when using both HWD and STCB, the PSNR, SSIM and LPIPS values achieved 24.94 dB, 0.8591 and 0.1488, respectively. This verifies their complementarity: HWD preserves low- and high-frequency representations to avoid detail loss, while STCB further refines the representations from local texture and global structure perspectives.

4.3.3. Effectiveness of PCL

Finally, we study the effect of pixel-level contrastive learning (PCL). Compared with the variant without PCL, the full CGSTA-Net further improves PSNR from 24.94 dB to 25.32 dB and SSIM from 0.8591 to 0.8698, while reducing LPIPS from 0.1488 to 0.1429. Since the network architecture remains unchanged in this comparison, the improvement mainly comes from the contrastive constraint during training. This demonstrates that PCL can guide the restored features toward the haze-free domain and push them away from haze-degraded representations, improving feature discriminability without increasing inference cost.

5. Conclusions and Discussion

In this paper, we proposed CGSTA-Net, a multi-stage cross-domain dehazing framework for balancing haze removal and detail preservation in remote sensing imagery. CGSTA-Net adopts a hierarchical architecture to model haze distributions from local textures to global structures. The HWD module preserves edge and texture information by decomposing features into low- and high-frequency components during resolution reduction. Based on these frequency-aware representations, STCB jointly calibrates local texture cues and long-range structural dependencies, while PGAF adaptively integrates complementary RGB and prior features. In addition, a pixel-level contrastive learning strategy regularizes the feature space by guiding restored features toward the haze-free domain and away from the haze-degraded domain. The results on various remote sensing datasets (such as RSID, RICE-I, and HRSD) showed that CGSTA-Net achieved competitive results in objective metrics and visual quality. The improvements suggest that remote sensing image dehazing is not merely a pixel-level correction task, but a joint restoration problem involving frequency preservation, structure-texture calibration, and cross-domain feature interaction. Compared with previous CNN-based and attention-based methods, CGSTA-Net provides a more explicit framework for combining frequency-aware restoration with cross-domain structural guidance.

Despite these results, this study still has some limitations. The assessment mainly focused on relatively flat urban and forest scenarios, while terrain with slopes and mountainous areas with significant altitude changes were not fully reflected. Additionally, the current model emphasizes restoration quality, but the deployment cost for large-scale or real-time earth observation still needs further optimization. Future work will expand the assessment scope to more diverse scenarios, especially those with slopes and mountainous areas, and explore lightweight architectures to enhance real-time and large-scale remote sensing image processing capabilities.

Author Contributions

Methodology, X.L.; software, X.L.; writing—original draft preparation, X.L.; writing—review and editing, X.L., Y.Z. and N.N.; supervision, N.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The RICE-I dataset used in this study is publicly available at: https://github.com/BUPTLdy/RICE_DATASET (accessed date: 11 June 2026). The RSID dataset used in this study is publicly available at: https://github.com/chi-kaichen/Trinity-Net (accessed date: 11 June 2026). The HRSD dataset used in this study is publicly available at: https://github.com/Shan-rs/DCI-Net (accessed date: 11 June 2026).

Acknowledgments

During the preparation of this manuscript, the authors used DeepSeek-V4 for the purposes of optimizing the English expression, correcting grammatical errors, and enhancing the readability of the text. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflict of interest.

References

García, A.A. Relationship between Blue Economy, Cruise Tourism, and Urban Regeneration: Case Study of Olbia, Sardinia. J. Urban Plan. Dev. 2021, 147, 05021029. [Google Scholar] [CrossRef]
Kulk, G.; Platt, T.; Dingle, J.; Jackson, T.; Jönsson, B.F.; Bouman, H.A.; Babin, M.; Brewin, R.J.W.; Doblin, M.; Estrada, M.; et al. Primary Production, an Index of Climate Change in the Ocean: Satellite-Based Estimates over Two Decades. Remote Sens. 2020, 12, 826, Correction in Remote Sens. 2021, 13, 3462. https://doi.org/10.3390/rs13173462. [Google Scholar] [CrossRef]
Li, S.; Fang, H.; Zhang, Y. Determination of the Leaf Inclination Angle (LIA) through Field and Remote Sensing Methods: Current Status and Future Prospects. Remote Sens. 2023, 15, 946. [Google Scholar] [CrossRef]
Lai, J.; Kang, X.; Lu, X.; Li, S. Review of Land Observation Satellite Remote Sensing Application Technology with New Generation Artificial Intelligence. Natl. Remote Sens. Bull. 2022, 26, 1530–1546. [Google Scholar] [CrossRef]
Cantor, A. Optics of the atmosphere–Scattering by molecules and particles. IEEE J. Quantum Electron. 1978, 14, 698–699. [Google Scholar] [CrossRef]
Narasimhan, S.G.; Nayar, S.K. Vision and the Atmosphere. Int. J. Comput. Vis. 2002, 48, 233–254. [Google Scholar] [CrossRef]
He, K.; Sun, J.; Tang, X. Single Image Haze Removal Using Dark Channel Prior. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 2341–2353. [Google Scholar] [CrossRef] [PubMed]
Zhu, Q.; Mai, J.; Shao, L. A Fast Single Image Haze Removal Algorithm Using Color Attenuation Prior. IEEE Trans. Image Process. 2015, 24, 3522–3533. [Google Scholar] [CrossRef]
Liu, X.; Ma, Y.; Shi, Z.; Chen, J. GridDehazeNet: Attention-Based Multi-Scale Network for Image Dehazing. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2019; pp. 7313–7322. [Google Scholar] [CrossRef]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature Fusion Attention Network for Single Image Dehazing. arXiv 2019, arXiv:1911.07559. [Google Scholar] [CrossRef]
Wu, H.; Qu, Y.; Lin, S.; Zhou, J.; Qiao, R.; Zhang, Z.; Xie, Y.; Ma, L. Contrastive Learning for Compact Single Image Dehazing. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2021; pp. 10546–10555. [Google Scholar] [CrossRef]
Chen, Z.; He, Z.; Lu, Z.M. DEA-Net: Single Image Dehazing Based on Detail-Enhanced Convolution and Content-Guided Attention. IEEE Trans. Image Process. 2024, 33, 1002–1015. [Google Scholar] [CrossRef]
Lin, D.; Xu, G.; Wang, X.; Wang, Y.; Sun, X.; Fu, K. A Remote Sensing Image Dataset for Cloud Removal. arXiv 2019, arXiv:1901.00600. [Google Scholar] [CrossRef]
Chi, K.; Yuan, Y.; Wang, Q. Trinity-Net: Gradient-Guided Swin Transformer-Based Remote Sensing Image Dehazing and Beyond. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4702914. [Google Scholar] [CrossRef]
Zhang, L.; Wang, S. Dense Haze Removal Based on Dynamic Collaborative Inference Learning for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5631016. [Google Scholar] [CrossRef]
Berman, D.; Treibitz, T.; Avidan, S. Non-local Image Dehazing. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2016; pp. 1674–1682. [Google Scholar] [CrossRef]
Panagopoulos, A.; Wang, C.; Samaras, D.; Paragios, N. Estimating Shadows with the Bright Channel Cue. In Trends and Topics in Computer Vision; Kutulakos, K.N., Ed.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 1–12. [Google Scholar]
Meng, G.; Wang, Y.; Duan, J.; Xiang, S.; Pan, C. Efficient Image Dehazing with Boundary Constraint and Contextual Regularization. In 2013 IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2013; pp. 617–624. [Google Scholar] [CrossRef]
Yu, T.; Song, K.; Miao, P.; Yang, G.; Yang, H.; Chen, C. Nighttime Single Image Dehazing via Pixel-Wise Alpha Blending. IEEE Access 2019, 7, 114619–114630. [Google Scholar] [CrossRef]
Cai, B.; Xu, X.; Jia, K.; Qing, C.; Tao, D. DehazeNet: An End-to-End System for Single Image Haze Removal. IEEE Trans. Image Process. 2016, 25, 5187–5198. [Google Scholar] [CrossRef]
Li, B.; Peng, X.; Wang, Z.; Xu, J.; Feng, D. AOD-Net: All-in-One Dehazing Network. In 2017 IEEE International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2017; pp. 4780–4788. [Google Scholar] [CrossRef]
Wang, H.; Ding, Y.; Zhou, X.; Yuan, G.; Sun, C. Dehazing of Panchromatic Remote Sensing Images Based on Histogram Features. Remote Sens. 2025, 17, 3479. [Google Scholar] [CrossRef]
Ning, J.; Zhou, Y.; Liao, X.; Duo, B. Single Remote Sensing Image Dehazing Using Robust Light-Dark Prior. Remote Sens. 2023, 15, 938. [Google Scholar] [CrossRef]
Chen, D.; He, M.; Fan, Q.; Liao, J.; Zhang, L.; Hou, D.; Yuan, L.; Hua, G. Gated Context Aggregation Network for Image Dehazing and Deraining. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV); IEEE: New York, NY, USA, 2019; pp. 1375–1383. [Google Scholar] [CrossRef]
Zheng, Y.; Zhan, J.; He, S.; Dong, J.; Du, Y. Curricular Contrastive Regularization for Physics-Aware Single Image Dehazing. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2023; pp. 5785–5794. [Google Scholar] [CrossRef]
Qiu, Y.; Zhang, K.; Wang, C.; Luo, W.; Li, H.; Jin, Z. MB-TaylorFormer: Multi-branch Efficient Transformer Expanded by Taylor Formula for Image Dehazing. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2023; pp. 12756–12767. [Google Scholar] [CrossRef]
Zheng, L.; Li, Y.; Yu, R.; Zhang, K. Efficient Dual-domain Image Dehazing with Haze Prior Perception. arXiv 2025, arXiv:2507.11035. [Google Scholar]
Li, Y.; Zhang, K.; Wang, F.; Zhao, L. Remote Sensing Image Dehazing via a Local Context-Enriched Transformer (LCEFormer). Remote Sens. 2024, 16, 1422. [Google Scholar] [CrossRef]
Zhou, Y.; Ning, J.; Liu, W.; Duo, B. A Dehazing Method for UAV Remote Sensing Based on Global and Local Feature Collaboration (UAVD-Net). Remote Sens. 2025, 17, 1688. [Google Scholar] [CrossRef]
Wang, X.; Yuan, B.; Dong, H.; Hao, Q.; Li, Z. End-to-End Multi-Scale Adaptive Remote Sensing Image Dehazing Network. Sensors 2025, 25, 218. [Google Scholar] [CrossRef]
Jin, H.; Chen, Z.; Song, Z.; Sun, K. DFFNet: A Dual-Domain Feature Fusion Network for Single Remote Sensing Image Dehazing. Sensors 2025, 25, 5125. [Google Scholar] [CrossRef] [PubMed]
Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Florence, Italy, 2019; pp. 6558–6569. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient Transformer for High-Resolution Image Restoration. In Proceedings of the CVPR 2022, New Orleans, LA, USA, 21–24 June 2022. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Advances in Neural Information Processing Systems 30; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
Lee-Thorp, J.; Ainslie, J.; Eckstein, I.; Ontañón, S. FNet: Mixing Tokens with Fourier Transforms. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Seattle, WA, USA, 2022; pp. 4296–4313. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2017; pp. 5967–5976. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In 2017 IEEE International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2017. [Google Scholar]
Jiang, S.; Mei, Y.; Wang, P.; Liu, Q. Exposure difference network for low-light image enhancement. Pattern Recognit. 2024, 156, 110796. [Google Scholar] [CrossRef]
Liu, P.; Zhang, H.; Zhang, K.; Lin, L.; Zuo, W. Multi-level Wavelet-CNN for Image Restoration. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: New York, NY, USA, 2018; pp. 886–88609. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Huynh-Thu, Q.; Ghanbari, M. Scope of validity of PSNR in image/video quality assessment. Electron. Lett. 2008, 44, 800–801. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 586–595. [Google Scholar] [CrossRef]
Shao, Y.; Li, L.; Ren, W.; Gao, C.; Sang, N. Domain Adaptation for Image Dehazing. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2020; pp. 2805–2814. [Google Scholar] [CrossRef]
Lu, L.; Xiong, Q.; Xu, B.; Chu, D. MixDehazeNet: Mix Structure Block for Image Dehazing Network. In 2024 International Joint Conference on Neural Networks (IJCNN); IEEE: New York, NY, USA, 2024; pp. 1–10. [Google Scholar] [CrossRef]
Li, H.; Liu, H.; Liu, M.; Xiao, Y.; Li, P.; Zan, G. U-Net-Like Spiking Neural Networks for Single Image Dehazing. In 2025 International Joint Conference on Neural Networks (IJCNN); IEEE: New York, NY, USA, 2025; pp. 1–9. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of the proposed CGSTA-Net.

Figure 2. Structure of HWD.

Figure 3. Structure of STCB.

Figure 4. Structure of MFMSA.

Figure 5. Structure of MSCA.

Figure 6. Structure of PGAF.

Figure 7. Visual comparison results on the RICE-I and RSID datasets. (a) Haze image; (b) AOD-Net; (c) GCA-Net; (d) DAD; (e) FFA-Net; (f) C2PNet; (g) MB-TaylorFormer; (h) DEA-Net; (i) MixDehaze; (j) DGFDNet; (k) DenseSNN; (l) Ours; and (m) Ground truth.

Figure 8. Visual comparison results on the HRSD dataset. (a) Haze image; (b) AOD-Net; (c) GCA-Net; (d) DAD; (e) FFA-Net; (f) C2PNet; (g) MB-TaylorFormer; (h) DEA-Net; (i) MixDehaze; (j) DGFDNet; (k) DenseSNN; (l) Ours; and (m) Ground truth.

Figure 9. Visual comparison of the ablation experiment results on the DHID dataset. (a) Haze image; (b) RGB base; (c) RGB-prior base; (d) RGB-prior + PGAF; (e) RGB-prior + STCB + PGAF; (f) RGB-prior + HWD + PGAF; (g) RGB-prior + HWD + STCB + PGAF; (h) CGSTA-Net; and (i) Ground truth.

Table 1. Detailed information of the datasets used in the experiments.

Dataset	Pixel Resolution	Pixel Count	Spatial Resolution
RICE-I	$512 \times 512$	262,144	∼5 m
RSID	$512 \times 512$	262,144	$0.5 \sim 2 m$
HRSD	$512 \times 512$	262,144	$0.3 \sim 3 m$

Table 2. Implementation details and training settings.

Parameter	Setting
Data augmentation	Random crop and horizontal flip
Patch size	$256 \times 256$
Optimizer	Adam [43]
Learning rate scheduler	Cosine annealing [44]
Learning rate	$1 \times 10^{- 4}$ to $1 \times 10^{- 6}$
Batch size	16
Training epochs	200

Table 3. Quantitative comparison results on RICE-I and RSID datasets. Best results are bold and second-best are underlined.

Methods	RICE-I				RSID
Methods	PSNR	SSIM	LPIPS	Runtime	PSNR	SSIM	LPIPS	Runtime
AOD-Net [21] (2017)	14.72	0.6584	0.2452	0.0009	18.22	0.8354	0.1816	0.0011
GCA-Net [24] (2019)	18.34	0.7361	0.3879	0.1843	17.49	0.8005	0.2358	0.0494
DAD [48] (2020)	20.94	0.8623	0.2140	0.0695	16.02	0.7680	0.2767	0.0186
FFA-Net [10] (2020)	19.89	0.8172	0.1790	0.6733	18.68	0.8536	0.1564	0.1773
C2PNet [25] (2023)	20.20	0.8777	0.1485	5.1203	19.45	0.8879	0.1378	0.5038
MB-TaylorFormer [26] (2023)	21.67	0.8813	0.1421	1.6406	17.96	0.8602	0.1558	0.6703
DEA-Net [12] (2024)	20.31	0.8703	0.1301	0.0305	19.11	0.8676	0.1405	0.0295
MixDehaze [49] (2024)	20.28	0.8551	0.1433	0.2501	18.96	0.8626	0.1438	0.0936
DGFDNet [27] (2025)	20.77	0.8539	0.1539	0.0340	18.15	0.8471	0.1613	0.0274
DenseSNN [50] (2025)	18.64	0.8388	0.2960	0.2305	19.27	0.8576	0.1666	0.1376
Ours	24.52	0.8983	0.1264	1.2651	23.91	0.9086	0.1023	0.0803

Table 4. Quantitative comparison results on LHID-A and LHID-B datasets. Best results are bold and second-best are underlined.

Methods	LHID-A				LHID-B
Methods	PSNR	SSIM	LPIPS	Runtime	PSNR	SSIM	LPIPS	Runtime
AOD-Net [21]	20.74	0.8070	0.1889	0.0010	20.48	0.8396	0.1887	0.0012
GCA-Net [24]	20.34	0.7880	0.2403	0.1998	22.52	0.8606	0.1725	0.1976
DAD [48]	16.37	0.7184	0.2726	0.0619	17.33	0.7907	0.2108	0.0652
FFA-Net [10]	14.29	0.6999	0.2413	0.7220	14.44	0.7326	0.2276	0.6854
C2PNet [25]	18.32	0.7747	0.1809	5.2562	19.47	0.8204	0.1593	3.3637
MB-TaylorFormer [26]	17.90	0.7732	0.1776	1.6469	18.86	0.8412	0.1548	1.6406
DEA-Net [12]	20.58	0.8075	0.1577	0.0384	21.68	0.8595	0.1348	0.0378
MixDehaze [49]	18.01	0.7704	0.1877	0.2607	19.76	0.8363	0.1519	0.2875
DGFDNet [27]	15.25	0.7275	0.2173	0.0398	15.26	0.7593	0.2026	0.0416
DenseSNN [50]	21.83	0.8273	0.1594	0.5658	23.63	0.8821	0.1381	0.2080
Ours	23.60	0.8344	0.1672	1.3982	24.91	0.8822	0.1220	1.4066

Table 5. Quantitative comparison results on DHID datasets. Best results are bold and second-best are underlined.

Methods	DHID
Methods	PSNR	SSIM	LPIPS	Runtime
AOD-Net [21]	16.90	0.7252	0.2904	0.0036
GCA-Net [24]	23.42	0.8449	0.1760	0.2301
DAD [48]	19.65	0.8131	0.2501	0.1002
FFA-Net [10]	14.25	0.6670	0.3173	1.0277
C2PNet [25]	16.89	0.7388	0.2732	3.4507
MB-TaylorFormer [26]	18.27	0.7767	0.2290	3.2166
DEA-Net [12]	19.09	0.7844	0.2126	0.0706
MixDehaze [49]	18.29	0.7587	0.2430	0.1048
DGFDNet [27]	15.29	0.6947	0.2971	0.0920
DenseSNN [50]	22.81	0.8627	0.1468	0.2190
Ours	25.32	0.8698	0.1429	1.9418

Table 6. Cross-dataset generalization evaluation.

Training Dataset	Testing Dataset	Fine-Tuning	PSNR ↑	SSIM ↑	LPIPS ↓
HRSD	RICE-I	No	16.78	0.7225	0.1823
HRSD	RSID	No	18.38	0.8467	0.1795

Note: The arrows ↑ and ↓ respectively indicate the preference for larger and smaller values.

Table 7. Model complexity comparison.

	AOD-Net	GCA-Net	DAD	FFA-Net	C2PNet	MB-TaylorFormer	DEA-Net	MixDehaze	DGFDNet	DenseSNN	Ours
FLOPs (G)	0.114	18.565	83.595	287.533	460.954	31.819	34.043	56.482	13.454	37.270	99.969
Parameters (M)	0.002	0.703	54.591	4.456	7.169	2.677	3.653	6.249	2.083	4.751	12.251

Table 8. Ablation study of the proposed CGSTA-Net on the DHID dataset. Best results are bold.

Variant	RGB	Prior	HWD	STCB	PGAF	PCL	PSNR ↑	SSIM ↑	LPIPS ↓
RGB base	✓						19.92	0.7407	0.2555
RGB-Prior base	✓	✓					20.58	0.7550	0.2435
RGB-Prior + PGAF	✓	✓			✓		21.43	0.7794	0.2198
RGB-Prior + STCB + PGAF	✓	✓		✓	✓		23.71	0.8303	0.1775
RGB-Prior + HWD + PGAF	✓	✓	✓		✓		22.93	0.8177	0.1683
RGB-Prior + HWD + STCB + PGAF	✓	✓	✓	✓	✓		24.94	0.8591	0.1488
CGSTA-Net	✓	✓	✓	✓	✓	✓	25.32	0.8698	0.1429

Note: ✓ indicates the inclusion of the corresponding component. The arrows ↑ and ↓ respectively indicate the preference for larger and smaller values.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, X.; Zhao, Y.; Niu, N. CGSTA-Net: A Cross-Domain Generative Prior-Assisted Structure–Texture Adaptive Network for Remote Sensing Image Dehazing. Symmetry 2026, 18, 1027. https://doi.org/10.3390/sym18061027

AMA Style

Li X, Zhao Y, Niu N. CGSTA-Net: A Cross-Domain Generative Prior-Assisted Structure–Texture Adaptive Network for Remote Sensing Image Dehazing. Symmetry. 2026; 18(6):1027. https://doi.org/10.3390/sym18061027

Chicago/Turabian Style

Li, Xiaoyan, Yankun Zhao, and Na Niu. 2026. "CGSTA-Net: A Cross-Domain Generative Prior-Assisted Structure–Texture Adaptive Network for Remote Sensing Image Dehazing" Symmetry 18, no. 6: 1027. https://doi.org/10.3390/sym18061027

APA Style

Li, X., Zhao, Y., & Niu, N. (2026). CGSTA-Net: A Cross-Domain Generative Prior-Assisted Structure–Texture Adaptive Network for Remote Sensing Image Dehazing. Symmetry, 18(6), 1027. https://doi.org/10.3390/sym18061027

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

CGSTA-Net: A Cross-Domain Generative Prior-Assisted Structure–Texture Adaptive Network for Remote Sensing Image Dehazing

Abstract

1. Introduction

2. Related Work

2.1. Prior-Based Dehazing Methods

2.2. Learning-Based Dehazing Methods

2.2.1. Parameter Estimation Methods

2.2.2. End-to-End Restoration Methods

2.3. Recent Intelligent Learning Frameworks

2.4. Generative Priors for Image Restoration

2.5. Contrastive Learning Mechanism

3. Method

3.1. Overall Architecture

3.2. HWD

3.3. STCB

3.3.1. MFMSA

3.3.2. MSCA

3.4. PGAF

3.5. Loss Function

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets

4.1.2. Implementation Details

4.1.3. Evaluation Metrics

4.2. Comparison with State-of-the-Art Methods

4.2.1. Quantitative Evaluation

4.2.2. Qualitative Evaluation

4.3. Ablation Study

4.3.1. Effectiveness of Generative Prior and PGAF

4.3.2. Effectiveness of STCB and HWD

4.3.3. Effectiveness of PCL

5. Conclusions and Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI