Hierarchical Semantic-Guided Contextual Structure-Aware Network for Spectral Satellite Image Dehazing

Yang, Lei; Cao, Jianzhong; Wang, Hua; Dong, Sen; Ning, Hailong

doi:10.3390/rs16091525

Open AccessArticle

Hierarchical Semantic-Guided Contextual Structure-Aware Network for Spectral Satellite Image Dehazing

by

Lei Yang

^1,2,

Jianzhong Cao

^1,2,

Hua Wang

^1,2

,

Sen Dong

¹ and

Hailong Ning

^3,*

¹

Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

³

School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an 710121, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(9), 1525; https://doi.org/10.3390/rs16091525

Submission received: 7 January 2024 / Revised: 12 April 2024 / Accepted: 23 April 2024 / Published: 25 April 2024

(This article belongs to the Special Issue Remote Sensing Cross-Modal Research: Algorithms and Practices)

Download

Browse Figures

Versions Notes

Abstract

Haze or cloud always shrouds satellite images, obscuring valuable geographic information for military surveillance, natural calamity surveillance and mineral resource exploration. Satellite image dehazing (SID) provides the possibility for better applications of satellite images. Most of the existing dehazing methods are tailored for natural images and are not very effective for satellite images with non-homogeneous haze since the semantic structure information and inconsistent attenuation are not fully considered. To tackle this problem, this study proposes a hierarchical semantic-guided contextual structure-aware network (SCSNet) for spectral satellite image dehazing. Specifically, a hybrid CNN–Transformer architecture integrated with a hierarchical semantic guidance (HSG) module is presented to learn semantic structure information by synergetically complementing local representation from non-local features. Furthermore, a cross-layer fusion (CLF) module is specially designed to replace the traditional skip connection during the feature decoding stage so as to reinforce the attention to the spatial regions and feature channels with more serious attenuation. The results on the SateHaze1k, RS-Haze, and RSID datasets demonstrated that the proposed SCSNet can achieve effective dehazing and outperforms existing state-of-the-art methods.

Keywords:

satellite image dehazing; CNN–Transformer; hierarchical semantic guidance; cross-layer fusion

1. Introduction

Satellite images play a pivotal role in numerous fields, from military surveillance [1] and agricultural monitoring [2,3] to natural disaster response [4,5] and resource exploration [6]. However, their value plummets when shrouded in haze or cloud. Hazy or cloudy images suffer from drastically reduced contrast and visibility, crippling their potential for downstream computer vision applications. Therefore, dehazing or declouding has become an important topic in satellite images interpretation, promising to enhance the enduring quality of satellite images even under adverse atmospheric conditions.

In recent years, a range of dehazing methods have been developed, roughly divided into prior and deep learning-based methods [7,8]. Most prior-based methods are based on the atmospheric scattering model and require accurate estimation for the transmission and atmospheric light [9]. To guide the estimation process, researchers have devised various prior assumptions about the haze characteristics, such as dark channel prior (DCP) [10] and color attenuation prior [11]. The DCP method [10] is the milestone study in this field, which effectively estimates the transmission map but performs poorly in sky images or white scenes, leading to color distortion. While prior-based methods have proven effective in many scenarios, they can struggle with dense and non-homogeneous haze due to the complexity of estimating multiple haze parameters accurately.

Subsequently, deep learning has revolutionized image dehazing, offering a powerful alternative to prior-based methods. Instead of relying on hand-crafted features, deep neural networks automatically learn the complex relationships within hazy images. Early models like DehazeNet [12] and AOD-Net [13] have focused on estimating transmission maps for haze removal. Later approaches like FFA-Net [14] emphasized capturing image details and color fidelity through attention mechanisms. In recent years, the rise of Transformer models in computer vision [15] has sparked interest in their application to dehazing due to the ability to capture non-local interaction information. For instance, DehazeFormer [8] exemplifies this trend with its innovative normalization layer and spatial information aggregation scheme. In the past two years, some dehazing methods tailored for satellite images [16,17,18,19,20] have been explored, promoting better applications of satellite images and sharpening our view from space.

In fact, satellite images exhibit more non-homogeneous hazy distributions, more complex terrain information, and more severe object occlusion resulting from excessive fog concentration compared to natural hazy images. Despite the remarkable progress, existing methods are not very effective for SID with the deepening of haze density and haze non-homogeneity. This is due to the significant loss of spectral information and texture details, which are crucial for accurate SID. As shown in Figure 1, all dehazing methods performed well on the first row of images with light homogeneous haze, except for non-learning DCP. However, when processing the dense non-homogeneous hazy images in the second row of Figure 1, both natural image dehazing methods (such as DCP and Restormer) and dehazing methods tailored for satellite images (such as DCRD-Net), exhibited suboptimal performance. Although DCRD-Ne is tailored for spectral satellite image dehazing, it still exhibits significant color distortion.

Therefore, this study proposes a hierarchical semantic-guided contextual structure-aware network (SCSNet) for spectral satellite image dehazing (SID) by integrating CNN and Vision Transformer architecture to fully explore the semantic structure information and inconsistent attenuation. The network does not rely on prior knowledge or physical models and is an end-to-end trainable dehazing network capable of directly restoring hazy images into clear ones, even with limited data. As shown in Figure 2, the proposed SCSNet consists of a CNN feature encoder branch, Transformer encoder branch, hierarchical semantic guidance (HSG) module, and image restoration branch. First, the input hazy image is processed through the CNN feature encoder branch for learning hierarchical local structure features and the Transformer encoder branch for extracting hierarchical non-local semantic information. Secondly, the HSG module is devised to learn semantic structure information by synergetically complementing multi-scale local and non-local information. Thirdly, the learned semantic structure information and multi-layer local structure features are combined to restore the hazy-free image. Note that we specially devise a cross-layer fusion (CLF) module to replace the traditional skip connection in the feature decoding stage, which can reinforce the attention to the spatial regions and feature channels with more serious attenuation. The experiments on the SateHaze1k, RS-Haze, and RSID datasets show that our SCSNet outperforms contemporaneous methods.

In summary, the main contributions of this study are three-fold:

To better learn semantic structure information in spectral satellite images with non-homogeneous haze, we propose a hybrid CNN–Transformer architecture, in which a hierarchical semantic guidance (HSG) module is introduced to synergetically aggregate local structure features and non-local semantic information.
To fully consider the inconsistent attenuation, we present a cross-layer fusion (CLF) module, which is significantly better than traditional skip connection for integrating cross-layer features and reinforcing the attention to the spatial regions and feature channels with more serious attenuation.
We establish a hierarchical semantic-guided contextual structure-aware network (SCSNet), which can effectively restore hazy-free images from non-homogeneous hazy satellite ones. Our SCSNet achieves nontrivial performance on three challenging SID datasets.

The remainder of this study is arranged as follows: Section 2 reviews existing dehazing methods. Section 3 elaborates the proposed SCSNet. The experimental results are shown and analyzed in Section 4. Finally, Section 5 gives the conclusion.

2. Related Work

Current image dehazing techniques can be broadly classified into two main categories [23]: prior-based methods and deep learning-based methods. This section reviews the advances of these methods during the past decades. Then, we also investigate recent dehazing methods tailored for satellite images.

2.1. Prior-Based Methods

Prior-based methods rely on pre-defined priors like dark channel priors (DCP) [10], color attenuation priors [10], and haze-line priors [24,25]. Each approach tackles specific scenarios but faces limitations due to inherent imprecision in prior information. For example, He et al. [10] pioneered an efficient dark channel prior method that exploits the low pixel values in non-sky regions of the RGB channels. This enables reliable transmission estimation while potentially introducing color distortion in the sky. Subsequently, Zhu et al. [11] introduced a mathematical equation considering brightness and saturation disparities to improve transmission map accuracy. Berman et al. [24,25] utilized a non-local dehazing method that analyzes haze lines in degraded images, leveraging the tight clusters of haze-free image colors in the RGB space. Bui and Kim [26] introduced a dehazing method using color ellipsoids for high contrast, robust transmission values, and artifact-free results with low complexity. Peng et al. [27] leveraged regression analysis to model depth-dependent color shifts, then leveraged light differential to estimate medium transmission. Xu et al. [28] introduced the concept of “virtual depth” in their iterative dehazing method, quantifying Earth surface coverings to improve transmission estimation and haze removal. To deal with the problem of transmission map misestimation and oversaturation, Zhao et al. [29] proposed a transmission misestimated recognition method multi-scale optimal fusion model for single image dehazing. Recently, He et al. [4] proposed a heterogeneous atmospheric light prior and a side window filter, further enhancing remote sensing image dehazing.

While prior-based methods can enhance image quality, they often struggle with accurate haze parameter estimation in complex environments. In contrast, our SCSNet is data-driven, which overcomes this limitation.

2.2. Deep Learning-Based Methods

Deep learning-based methods use deep neural networks to learn a mapping from hazy images to clear images. Early deep learning-based approaches leverage CNNs to estimate atmospheric light and transmission maps in the atmospheric scattering model [12,13,30]. Cai et al. [12] pioneered this approach by designing a CNN to map hazy images to their transmission maps. However, relying on pre-defined priors often leads to inaccuracies in estimating the atmospheric light and transmission map, resulting in dehazed images marred by artifacts, color distortion, and loss of detail. Similarly, Gu et al. [30] leveraged both prior knowledge and CNN-based feature extraction for more robust dehazing. However, estimating atmospheric light and transmission maps can be challenging due to the complex and diverse nature of real-world atmospheric conditions. In addition, inaccurate estimations may lead to artifacts like color distortion or loss in the dehazed image.

To combat image distortions caused by estimation errors of intermediate estimations, end-to-end dehazing methods have emerged as the dominant approach, which directly learn the mapping from hazy to clear images [13,31]. This paradigm increases flexibility and adaptability to complex haze conditions. Specifically, Li et al. [13] proposed AOD-Net by reconstructing the atmospheric scattering model directly and achieved superior image clarity. Qin et al. [14] presented a feature fusion attention network (FFA-Net), which can restore image details and color fidelity by retaining shallow information. Dong et al. [32] incorporated dense skip connections based on the U-Net architecture for better information flow. Recently, Transformer models have made breakthroughs in computer vision [18,21,33], and many modified Transformer architectures have been proposed for low-level vision tasks. For example, Song et al. [8] made some improvements on the normalization layer and activation function based on Swin Transformer [34] to adapt to image dehazing. Qiu et al. [33] employed the Taylor-based approximation to approximate softmax-attention in Transformers for image dehazing, achieving linear complexity for efficient computation. While existing end-to-end dehazing methods show promise in dehazing, they are not very effective for satellite images with non-homogeneous haze due to data characteristics such as complex terrain information and severe object occlusion.

In recent years, some dehazing methods tailored for satellite images [16,17,18] have been explored. Jiang et al. [35] introduced an empirical haze removal method for visible remote sensing images by applying an additive haze model. Bie et al. [16] proposed a Gaussian and physics-guided dehazing network (GPD-Net) to better extract hazy features and guide the model to real-world conditions. Beyond single-stage networks, Li and Chen [17] presented a two-stage dehazing network (FCTF-Net) for haze removal tasks on satellite images by performing coarse dehazing and then refining the results for enhanced performance. Huang and Chen [22] introduced DCRD-Net, a cascaded residual dense network specifically designed for SID. Jiang et al. [36] tackled non-uniform haze degradation by combining discrete wavelet transform with a deep learning network, while KFA-Net [6] extends the temporal and spatial scope of dehazing applications. Additionally, Chen et al. [37] leveraged multi-scale features to handle varying haze levels, while Li et al. [38] employed a multi-model joint estimation and self-correcting framework for improved accuracy and dehazing outcomes. Subsequently, Huang et al. [39] developed an adaptive region-based diffusion model for outstanding dehazing performance on both synthetic and real-world images. While these methods are tailored for satellite images, they often neglect crucial factors like semantic structure information and the inherent inconsistencies in attenuation. This oversight leads to subpar dehazing performance.

Compared with existing methods, our SCSNet provides the following unique characteristics: (1) the HSG module bridges the gap between local details and global context, enabling SCSNet to capture the inherent structure of hazy satellite images; (2) replacing traditional skip connections, CLF directly fuses features from different network layers, improving attention towards regions and channels with stronger haze effects.

3. The Proposed Method

As illustrated in Figure 2, the SCSNet is composed of four main parts: CNN encoder branch for learning hierarchical local structure features, Transformer encoder branch for extracting hierarchical non-local semantic information, hierarchical semantic guidance (HSG) module for learning semantic structure information by synergetically complementing multi-scale local and non-local information, and image restoration branch for restoring the hazy-free image. In this section, we first elaborate on the encoder branches in Section 3.1, including both CNN and Transformer encoder branches. Then, the HSG module is described in detail in Section 3.2. In addition, the image restoration branch with a coarse-to-fine fusion (CFF) module is introduced in Section 3.3. Finally, we will describe the loss function in detail.

3.1. Image Encoders

Motivated by the advantages of CNNs in extracting local structure features and the advantages of Transformers in extracting non-local semantic features, we designed a hybrid CNN–Transformer image encoder. In the following, we will provide a detailed description of the two encoders.

3.1.1. CNN Encoder Branch

In satellite scenes, the distribution of haze across different pixels in an image is non-homogeneous. However, most existing image dehazing methods primarily target homogeneous haze distribution in natural images, resulting in subpar performance when dealing with the non-homogeneous distributed haze in satellite images. Inspired by [40], we design a nested u-structure CNN encoder branch that effectively models the complex distribution of haze in satellite images by introducing residual and attention mechanisms to learn hierarchical local structure features. The nested u-structure CNN encoder branch consists of 4 cascaded u-structure CNN blocks, which are motivated by the strength of U-shape architecture in capturing multi-scale contextual information.

As shown in Figure 3, each u-structure CNN block includes three fundamental stages: (1) The input convolution layer applies basic convolutions to the input feature map

X_{i}^{in} \in R^{H \times W \times C_{i n}}

for generating a locally intermediate feature

X_{local}^{in} \in R^{H \times W \times C_{o u t}}

. (2) A symmetric U-Net-like structure with D layers is designed as the intra-block encoding–decoding stage for progressively extracting and encoding multi-scale contextual feature

U (X_{i}^{local}) \in R^{H \times W \times C_{o u t}}

from the intermediate features, where

U

represents the symmetric U-Net-like structure. Increasing D expands receptive fields, leading to richer representations encompassing both local details and global context. Notably, configuring the parameter D enables effective multi-scale feature extraction from diverse input resolutions through step-wise upsampling, feature fusion, and subsequent convolutions. This mitigates the loss of details often encountered in direct large-scale upsampling. It is worth noting that channel attention and spatial attention are introduced into the fusion process of intra-block encoding and decoding features. The combination of channel attention and spatial attention contributes to guiding the model in attending to the dense haze area and reducing the attention to the thin haze area, thereby improving the discriminative performance of feature representation. (3) The local feature

X_{local}^{in}

and the multi-scale feature

U (X_{i}^{local})

extracted by the U-Net-like structure are fused via summation to obtain the rich local structure feature:

F_{i}^{c e} = X_{i}^{local} + U (X_{i}^{local})

(1)

This summation serves to preserve detailed information that might be lost through direct upsampling.

With the u-structure CNN block, the local structure feature obtained by the CNN encoder branch not only contains local structure information but also exhibits spatial adaptability. For efficiency, the CNN encoder branch is achieved by four cascading u-structure CNN blocks, obtaining four hierarchical local structure features

F_{i}^{c e}, i \in {1, \dots, 4}

.

3.1.2. Transformer Encoder Branch

Given the advantages of Transformer models in capturing global semantic features and the excellent performance of Dehazeformer [8] in image dehazing, we adopt an improved dehazeformer as the basic module of the Transformer encoder branch. As shown in Figure 2, the Transformer encoder branch consists of four groups of cascaded Transformer blocks. Unlike the original DehazeFormer, the devised Transformer encoder branch only adopts the first 3 groups of Transformer blocks from the original Dehazeformer, and the fourth group is self-designed. This design is intended to keep the output features of the same spatial size as the CNN encoder branch in Section 3.1.1.

The architecture of each Transformer block is shown in Figure 4. To empower image dehazing, the Transformer block is equipped with a powerful combination of techniques: re-scale layer normalization, window-shifted multi-head self-attention with parallel convolution (W-MHSA-PC) as a pivotal component, affine transformation, and multi-layer perceptron (MLP). The details can be found in [8]. For efficiency, each group of Transformer blocks consists of 8 cascading Transformer blocks. With the Transformer branch based on 4 groups of transformer blocks, the non-local semantic features

F_{i}^{t e}, i \in {1, \dots, 4}

can be captured to emphasize spatially varying haze distribution features.

3.2. Hierarchical Semantic Guidance Module

In order to guide the CNN encoder branch capture hierarchical local structure features with non-local semantic information, a hierarchical semantic guidance (HSG) module is proposed, which can dynamically combine shared information and faithfully preserve their unique complementary properties. As shown in Figure 5, the HSG module mainly consists of two main steps: coarse guidance and fine guidance. The coarse-guidance step generates a global feature descriptor by combining information from local structure features and non-local semantic features. The fine-guidance step uses these descriptors to re-calibrate and aggregate local structure features and non-local semantic features. The implementation is as follows.

(1): Coarse-guidance: First, the local structure features $F_{i}^{c e}$ and non-local semantic features $F_{i}^{d}$ are combined using the concatenation operation and compressed via channel reduction by:

$F_{i}^{c o a r s e} = ħ_{1 \times 1} (concat (F_{i}^{c e}, F_{i}^{t e}))$

(2)

where $F_{i}^{c o a r s e}$ is the coarse-guidance feature, $ħ_{1 \times 1}$ denotes the $1 \times 1$ convolutional layer with batch normalization for channel reduction, and $concat (\cdot, \cdot)$ represents the concatenation operation. Then, we use global average pooling on the spatial dimension of the coarse-guidance $F_{i}^{c o a r s e}$ to compute the channel-wise descriptor. Finally, we use two parallel 1 × 1 convolution layers to process the channel-wise descriptor, obtaining $w_{1}$ and $w_{2}$ , which enhance the interaction between local structure features and non-local semantic features.
(2): Fine-guidance: We apply the softmax function to $w_{1}$ and $w_{2}$ generating attention activations, which are leveraged to adaptively re-calibrate the local structure features $F_{i}^{c e}$ and the non-local semantic features $F_{i}^{t e}$ . Finally, the hierarchical semantic guidance feature obtained after the fine-graining is:

$F_{i}^{h s} = softmax (w_{1}) \cdot F_{i}^{c e} + softmax (w_{2}) \cdot F_{i}^{t e}$

(3)

Through the coarse-to-fine guidance strategy, we can efficiently integrate local structure features and non-local semantic features, and achieve efficient interaction between the two, which is beneficial for better preserving the structural details and color information of reconstructed hazy-free images.

3.3. Image Restoration

In satellite scenes, terrain information is quite complex. In order to preserve more detailed information and fully leverage the features from non-adjacent levels in the reconstructed hazy-free images, we propose a cross-layer fusion (CLF) module. Different from the simple concatenation, the CLF module introduces the dense contextual feature learning and context residual learning in the fusion stage, which is more effective. The workflow of the proposed CLF module is shown in Figure 6.

Firstly, to explore the potential of features from non-adjacent layers, the combined feature

F_{i}

is captured by concatenating the decoder feature

F_{i}^{d}

and local structure features

F_{i}^{c e}

. Note that

F_{i}^{d} = F_{i}^{h s} + F_{i}^{c e}

, when

i = 0

.

Secondly, to obtain better dense context features, the dense connection operation is applied to integrate the feature maps from the previous layers, which increases the variation of the input to subsequent layers and thus enhances the representation ability. Because the concatenation operation in the dense block increases the channels of the obtained features, we adopt an element-wise summation operation instead of the concatenation operation to reduce its channel number. This strategy has fewer parameters and a faster computation speed. In order to capture more information, we combine all the output feature maps. Then, the

1 \times 1

convolution is utilized to achieve dense context feature fusion as follows:

F_{i}^{c} = conv 1 (concat (F_{i, 1}, F_{i, 2}, F_{i, 3}, F_{i, 4}))

(4)

where

F_{i}^{c}

represents the dense context fusion feature, and

conv 1 (\cdot)

denotes the

1 \times 1

convolution operation.

Thirdly, to preserve previous information, we also conduct the context residual learning. Based on the input feature

F_{i}

, the channel weight descriptor is computed with the residual learning mechanism guided by channel attention. Then, the leaned channel weight descriptor is applied to re-weight the dense context fusion feature

F_{i}^{c}

, and then the next decoder feature

F_{i}^{e}

is obtained by context residual learning. The context residual learning can be represented as:

F_{i}^{d} = Upsample (F_{i} + {conv 1}_{\times 2} (GAP (F_{i})) \cdot F_{i}^{c})

(5)

where

GAP (\cdot)

denotes the global average pooling operation,

Upsample

stands for the operation of increasing the resolution of feature by a factor of 2, and

{conv 1}_{\times 2}

represents 2 stacked

1 \times 1

convolutional layers. The context residual learning ensures that the model adapts to skip less important information such as thin fog or low-frequency areas and focuses more on dense fog areas. After four cascaded CLF modules, the final decoder feature

F_{4}

is obtained for restoring haze-free images. To this end, a readout network is designed for post processing and restoring haze-free images. Specifically, the operation of the readout network can be expressed as:

I^{pred} = (tanh ({conv 3}_{\times 2} (D^{e})) + 1) / 2

(6)

where

{conv 3}_{\times 2}

means 2 stacked

3 \times 3

convolutional layers.

3.4. Loss Function

We adopt two different loss functions to train the proposed network, including smooth L1 loss and perceptual loss. The smooth L1 loss can ensure that the reconstructed hazy-free image is close to the clean image, and has been proven to be superior to L2 loss in many image restoration tasks, and more robust than L1 loss. Let

I^{pred}

indicate the reconstructed hazy-free image and

I^{gt}

represent the clean image. The smooth L1 loss can be written as:

L_{l 1} = \frac{1}{N} \sum_{x = 1}^{N} \sum_{j = 1}^{3} {smooth}_{L 1} (I_{j}^{pred} (x) - I_{j}^{gt} (x))

(7)

in which,

\begin{matrix} {smooth}_{L 1} (y) = \{\begin{matrix} 0.5 y^{2}, & i f | y | < 1 \\ | y | - 0.5, & o t h e r w i s e \end{matrix} \end{matrix}

(8)

where j indexes the j-th color channel for images and N is the total number of pixels.

The perceptual loss can provide additional supervision in the high-level feature space, which is used to quantify the visual difference between the reconstructed hazy-free image and the clean image. The perceptual loss function is described as follows:

L_{p e r c} = \sum_{j = 1}^{3} \frac{1}{W_{j} H_{j} C_{j}} {∥ f_{j} (I^{gt}) - f_{j} (I^{pred}) ∥}_{2}^{2},

(9)

where

f_{j} (I^{gt})

and

f_{j} (I^{pred})

mean the feature maps extracted by the pre-trained VGG16 model associated with the clean image and the reconstructed hazy-free image, respectively.

Finally, the total loss function is defined as:

L_{final} = L_{l 1} + λ \cdot L_{p e r c}

(10)

where

λ

is the coefficient that balances the two losses. The optimization of the entire model is achieved by minimizing the total loss.

4. Experiment and Results

In this section, we first describe the experimental settings, including the dataset, training details, and evaluation metrics. Then, we compare the proposed SCSNet with other state-of-the-art dehazing methods quantitatively and qualitatively. Finally, ablation studies are conducted to further demonstrate the superiority of the proposed SCSNet and the effectiveness of different components in the proposed SCSNet.

4.1. Datasets

We evaluated the proposed SCSNet using the SateHaze1k [41], RS-Haze [8], and RSID [1] datasets.

(1): SateHaze1k: The SateHaze1k dataset [41] consists of 1200 pairs of hazy images, corresponding clear images, and SAR images. The dataset has three levels of haze, covering thin, medium, and thick haze, each with 400 pairs, which is beneficial for evaluating the robustness of the proposed method. Following the previous work [41], we divided the training, validation, and testing data ratio into 8:1:1 for each level of haze. In addition, in order to better evaluate the dehazing effect in real situations, we mixed the data of the three different haze levels together.
(2): RS-Haze: The RS-Haze dataset [8] is a challenging and representative large-scale image dehazing dataset consisting of 51,300 paired images, of which 51,030 are for training and 270 are for testing. The dataset covers a variety of scenes and haze intensities, including urban, forest, beach, and mountain scenes.
(3): RSID: The RSID dataset [1] offers a collection of 1000 image pairs, each consisting of a hazy and haze-free counterpart. Following the previous work [42], we randomly selected 900 of these pairs. The remaining 100 pairs were reserved as a distinct test set to assess the model’s ability to generalize to unknown data.

4.2. Implementation Details and Evaluation Metrics

During the training process, in order to avoid overfitting, we augment the training set with random rotation at 90, 180, and 270 degrees, horizontal flips, and vertical flips. The Adam optimizer is utilized to optimize the model. The initial learning rate is set to 0.0001, and we use the cosine annealing strategy to adjust the learning rate until convergence. The batch size is set to 64, and the training epoch is set to 30. Considering the high cost of adjusting the weighting coefficients, we adopt a multi-task learning strategy with equal variances and uncertainties [43] to learn the optimal weighting coefficients. The proposed SCSNet is implemented using the PyTorch library on an RTX 3090 GPU with 24 GB of memory.

We quantitatively evaluate the dehazing performance using commonly used metrics, including peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM). The formula expression of PSNR is as follows:

P S N R = 20 \cdot {log}_{10} (\frac{M A X_{I}}{\sqrt{M S E}})

(11)

where

M A X_{I}

refers to the maximum possible pixel value of the input clean image

I^{gt}

, and

M S E

represents the mean squared error between corresponding pixels of the input clean image

I^{gt}

and the reconstructed hazy-free image

I^{pred}

.

The formula expression of SSIM is as follows:

S S I M (I^{gt}, I^{pred}) = \frac{(2 μ_{g t} μ_{p r e d} + c_{1}) (2 σ_{g t p r e d} + c_{2})}{(μ_{g t}^{2} + μ_{p r e d}^{2} + c_{1}) (σ_{g t}^{2} + σ_{p r e d}^{2} + c_{2})}

(12)

where

μ_{g t}

and

μ_{p r e d}

are the mean values of

I^{gt}

and

I^{pred}

, respectively.

σ_{g t}

and

σ_{p r e d}

denote the standard deviations of

I^{gt}

and

I^{pred}

, respectively.

σ_{g t p r e d}

means the covariance of

I^{gt}

and

I^{pred}

.

c_{1}

and

c_{2}

are two constants, typically set as

c_{1} = {(0.01 * M A X_{I})}^{2}

and

c_{2} = {(0.03 * M A X_{I})}^{2}

.

4.3. Comparison with State-of-the-Arts

In this subsection, we quantitatively and qualitatively compare the proposed SCSNet with state-of-the-art methods on the SateHaze1k, RS-Haze, and RSID datasets. These methods include (1) methods for natural image dehazing, including DCP [10], CEP [26], MOF [29], AOD-Net [13], Light- DehazeNet [44], FFA-Net [14], Restormer [21], and DehazeFormer [8]; and (2) tailored methods for spectral satellite image dehazing, such as DCRD-Net [22] and FCTF-Net [17]. To ensure a fair comparison, all evaluated methods were trained using the corresponding authors’ publicly released code and the same data partition scheme.

Quantitative analysis: Table 1 shows the quantitative comparison results of different methods trained on the SateHaze1k dataset. According to the results in the table, DCP, CEP, and MOF methods perform poorly in dehazing, with both low PSNR and SSIM. This may be because they are prior-based methods, with which it is difficult to learn discriminative semantic features from the image. The dehazing performance of AOD-Net and Light-DehazeNet is also mediocre, which may be due to the fact that these two methods are among the early and lightweight neural networks used for dehazing and are limited by the depth of the network. Other comparison methods consciously designed their network structures, so they achieved impressive results with higher PSNR and SSIM. It is worth noting that DCRD-Net and FCTF-Net are models specifically tailored for spectral satellite image dehazing and also show good performance. However, compared to them, our proposed SCSNet achieves the best results in all three levels of haze. It not only outperforms tailored methods for spectral satellite image dehazing but also achieves higher metrics than natural image dehazing methods. This indicates that the proposed SCSNet has high efficiency and generalization in removing haze with different degrees. In addition, our proposed SCSNet also shows optimal performance in mixed haze data, indicating better practical application. Table 2 shows the quantitative comparison results of different methods trained on RS-Haze and RSID datasets. Based on the result, we can find that our proposed SCSNet delivers substantial PSNR improvements of 2.5724 and 2.6962 on RS-Haze and RSID datasets, respectively. This highlighting the exceptional generalization capabilities of our proposed SCSNet.

Qualitative analysis: Figure 7 shows qualitative comparison results of different methods on the SateHaze1k dataset. The red boxes in the figure highlight the regions where the dehazing results of various methods differ significantly. Distinctly, the proposed SCSNet generated better results with more structural details and color information, which indicates that our method is superior to all comparison methods and produces significant visibility improvement. In addition, our proposed SCSNet performs well on images with thin haze, moderate haze, and thick haze, further demonstrating its better generalization in removing haze to different degrees. Note that while our method is effective in removing non-homogeneous fog to a certain extent, it may encounter color distortion in severe fog inhomogeneity cases, as exemplified by the third row of images depicted in Figure 7. This distortion arises from the inherent challenges in accurately estimating atmospheric light in such complex scenarios. In fact, almost all comparison dehazing methods struggle with this issue. Nonetheless, even with this distortion, our method consistently demonstrated superior fog removal capabilities compared to other methods.

Figure 8 shows qualitative comparison results of different methods on the RS-Haze and RSID datasets. As illustrated, our proposed SCSNet achieves relatively satisfactory dehazing results, which more closely resemble the groundtruth. It is noteworthy that our proposed SCSNet significantly outperforms other comparative methods in terms of heavy haze removal. As shown in the red-boxed areas of the second row images in Figure 7, other comparison methods, except ours, fail to effectively recover the real objects covered by fog. While DCRD-Net [22] and FCTF-Net [17] produce reasonably effective dehazing, they fall short of fully eliminating non-homogeneous haze as shown in the second row images. Other comparison methods either struggle to remove dense and non-homogeneous haze or introduce unwanted color shifts and structural details loss. This further demonstrates the superiority of our method in detail and color restoration, as we better consider semantic structure information and inconsistent attenuation.

To better compare the generalization ability of different models, they were trained and tested on two different datasets. Specifically, these models are trained on the RS-Haze dataset and tested on the RSID dataset. Figure 9 shows the comparison results of different comparison methods. It can be seen that all methods exhibit performance degradation compared to the experimental results on the same dataset. However, the degradation of the proposed SCSNet is not very significant, which indicates that our method has good generalization ability. This is because the proposed SCSNet adopts hierarchical semantic guidance, which can better capture the rich semantic information contained in the input image.

4.4. Ablation Analysis

In order to further demonstrate the effectiveness of the proposed SCSNet, we conducted ablation studies by considering the different modules. We primarily focused on the following factors: (1) the designed u-structure CNN block, (2) the CNN encoder branch for local structure features extraction, (3) the Transformer encoder branch for non-local semantic information learning, (4) the hierarchical semantic guidance (HSG) module, and (5) the cross-layer fusion (CLF) module. The corresponding ablation models include (1) w/plain ConvE: using plain convolutional encoding module to replace the u-structure CNN block; (2) w/CNN encoder branch; (3) w/Transformer encoder branch; (4) w/o HSG: using the summation operation to replace the HSG module for semantic guidance; and (5) w/concat: using the concatenation operation to replace the CLF module for feature fusion.

We conducted ablation experiments on SateHaze1k, RS-Haze, and RSID datasets. The results are shown in Table 3. We can see that all components can improve the dehazing performance and the best PSNR achieves 25.1759 dB, 35.2504 dB, and 25.3821 dB by using the full model on SateHaze1k, RS-Haze, and RSID datasets, respectively. This demonstrates the effectiveness of each proposed components. Specifically, when using plain convolutional encoding module to replace the u-structure CNN block, the model’s PSNR and SSIM decrease by 3.1543 and 0.0269 on the SateHaze1k dataset, respectively. This indicates the superiority of the designed u-structure CNN block in extracting local structural and spatially adaptive features. When feature learning relies solely on either the CNN encoder branch or the Transformer encoder branch, the performance experiences a notable decline, thus highlighting the effectiveness of the introduced hybrid CNN–Transformer architecture. When using the summation operation to replace the HSG module, the significant drop in model performance indicates the effectiveness of the designed HSG module in integrating local structural features and non-local semantic features. When using the concatenation operation to replace the CLF module, the decrease in model performance indicates the effectiveness of the designed CLF module in preserving more detailed information and fully leveraging the features from non-adjacent levels.

5. Conclusions

In this work, we propose a hierarchical semantic-guided contextual structure-aware network (SCSNet) for effective spectral satellite image dehazing. The key insight of this work is designing a hybrid CNN–Transformer architecture with hierarchical semantic guidance (HSG) module for synergetically complementing local representation from non-local features, and devising cross-layer fusion (CLF) module for reinforcing the attention to the spatial regions and feature channels with more serious attenuation. The experimental results on the SateHaze1k, RS-Haze, and RSID datasets demonstrate that the designed hybrid CNN–Transformer architecture, HSG module, and CLF module, are effective. In addition, the proposed SCSNet achieves better performance than other state-of-the-art methods.

While the method has demonstrated impressive performance, it still has some limitations, particularly in terms of its lightweightness. In the future, we will aim to develop a more lightweight, yet powerful model that can be widely deployed on low-resource devices.

Author Contributions

Data curation, L.Y.; Conceptualization, L.Y. and H.W.; Methodology, L.Y., J.C. and S.D.; Supervision, J.C. and H.N.; Writing—original draft, L.Y.; writing—review and editing, L.Y., J.C. and H.N.; formal analysis, H.W. and S.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by National Natural Science Foundation of China under Grant 62201452, in part by Scientific Research Program Funded by Shaanxi Provincial Education Department under Grant 22JK0568, and in part by Natural Science Basic Research Program of Shaanxi Province, China under Grant 2022JQ-592.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

We would also like to express our gratitude to the anonymous reviewers and the editors for their valuable advice and assistance.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chi, K.; Yuan, Y.; Wang, Q. Trinity-Net: Gradient-Guided Swin Transformer-based Remote Sensing Image Dehazing and Beyond. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Sun, H.; Luo, Z.; Ren, D.; Hu, W.; Du, B.; Yang, W.; Wan, J.; Zhang, L. Partial Siamese with Multiscale Bi-codec Networks for Remote Sensing Image Haze Removal. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Zheng, X.; Sun, H.; Lu, X.; Xie, W. Rotation-invariant attention network for hyperspectral image classification. IEEE Trans. Image Process. 2022, 31, 4251–4265. [Google Scholar] [CrossRef] [PubMed]
He, Y.; Li, C.; Bai, T. Remote Sensing Image Haze Removal Based on Superpixel. Remote. Sens. 2023, 15, 4680. [Google Scholar] [CrossRef]
Zheng, X.; Chen, X.; Lu, X.; Sun, B. Unsupervised change detection by cross-resolution difference learning. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Jiang, B.; Wang, J.; Wu, Y.; Wang, S.; Zhang, J.; Chen, X.; Li, Y.; Li, X.; Wang, L. A Dehazing Method for Remote Sensing Image Under Nonuniform Hazy Weather Based on Deep Learning Network. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–17. [Google Scholar] [CrossRef]
Dong, W.; Wang, C.; Sun, H.; Teng, Y.; Liu, H.; Zhang, Y.; Zhang, K.; Li, X.; Xu, X. End-to-End Detail-Enhanced Dehazing Network for Remote Sensing Images. Remote. Sens. 2024, 16, 225. [Google Scholar] [CrossRef]
Song, Y.; He, Z.; Qian, H.; Du, X. Vision transformers for single image dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef]
Peng, Y.T.; Lu, Z.; Cheng, F.C.; Zheng, Y.; Huang, S.C. Image haze removal using airlight white correction, local light filter, and aerial perspective prior. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 1385–1395. [Google Scholar] [CrossRef]
He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 2341–2353. [Google Scholar]
Zhu, Q.; Mai, J.; Shao, L. A fast single image haze removal algorithm using color attenuation prior. IEEE Trans. Image Process. 2015, 24, 3522–3533. [Google Scholar]
Cai, B.; Xu, X.; Jia, K.; Qing, C.; Tao, D. Dehazenet: An end-to-end system for single image haze removal. IEEE Trans. Image Process. 2016, 25, 5187–5198. [Google Scholar] [CrossRef]
Li, B.; Peng, X.; Wang, Z.; Xu, J.; Feng, D. Aod-net: All-in-one dehazing network. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4770–4778. [Google Scholar]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11908–11915. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
Bie, Y.; Yang, S.; Huang, Y. Single Remote Sensing Image Dehazing using Gaussian and Physics-Guided Process. IEEE Geosci. Remote. Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Li, Y.; Chen, X. A coarse-to-fine two-stage attentive network for haze removal of remote sensing images. IEEE Geosci. Remote. Sens. Lett. 2020, 18, 1751–1755. [Google Scholar] [CrossRef]
Kulkarni, A.; Murala, S. Aerial Image Dehazing With Attentive Deformable Transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6305–6314. [Google Scholar]
Ning, J.; Zhou, Y.; Liao, X.; Duo, B. Single Remote Sensing Image Dehazing Using Robust Light-Dark Prior. Remote. Sens. 2023, 15, 938. [Google Scholar] [CrossRef]
Wei, J.; Wu, Y.; Chen, L.; Yang, K.; Lian, R. Zero-shot remote sensing image dehazing based on a re-degradation haze imaging model. Remote. Sens. 2022, 14, 5737. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
Huang, Y.; Chen, X. Single Remote Sensing Image Dehazing Using a Dual-Step Cascaded Residual Dense Network. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AL, USA, 19–22 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3852–3856. [Google Scholar]
Frants, V.; Agaian, S.; Panetta, K. QCNN-H: Single-image dehazing using quaternion neural networks. IEEE Trans. Cybern. 2023, 53, 5448–5458. [Google Scholar] [CrossRef] [PubMed]
Berman, D.; Avidan, S. Non-local image dehazing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1674–1682. [Google Scholar]
Berman, D.; Treibitz, T.; Avidan, S. Air-light estimation using haze-lines. In Proceedings of the 2017 IEEE International Conference on Computational Photography (ICCP), Stanford, CA, USA, 12–14 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–9. [Google Scholar]
Bui, T.M.; Kim, W. Single image dehazing using color ellipsoid prior. IEEE Trans. Image Process. 2017, 27, 999–1009. [Google Scholar] [CrossRef]
Peng, Y.T.; Cao, K.; Cosman, P.C. Generalization of the dark channel prior for single image restoration. IEEE Trans. Image Process. 2018, 27, 2856–2868. [Google Scholar] [CrossRef] [PubMed]
Xu, L.; Zhao, D.; Yan, Y.; Kwong, S.; Chen, J.; Duan, L.Y. IDeRs: Iterative dehazing method for single remote sensing image. Inf. Sci. 2019, 489, 50–62. [Google Scholar] [CrossRef]
Zhao, D.; Xu, L.; Yan, Y.; Chen, J.; Duan, L.Y. Multi-scale Optimal Fusion model for single image dehazing. Signal Process. Image Commun. 2019, 74, 253–265. [Google Scholar] [CrossRef]
Gu, Z.; Zhan, Z.; Yuan, Q.; Yan, L. Single remote sensing image dehazing using a prior-based dense attentive network. Remote. Sens. 2019, 11, 3008. [Google Scholar] [CrossRef]
Sun, H.; Li, B.; Dan, Z.; Hu, W.; Du, B.; Yang, W.; Wan, J. Multi-level Feature Interaction and Efficient Non-Local Information Enhanced Channel Attention for image dehazing. Neural Netw. 2023, 163, 10–27. [Google Scholar] [CrossRef] [PubMed]
Dong, H.; Pan, J.; Xiang, L.; Hu, Z.; Zhang, X.; Wang, F.; Yang, M.H. Multi-scale boosted dehazing network with dense feature fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2157–2167. [Google Scholar]
Qiu, Y.; Zhang, K.; Wang, C.; Luo, W.; Li, H.; Jin, Z. MB-TaylorFormer: Multi-branch efficient transformer expanded by Taylor formula for image dehazing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 12802–12813. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Jiang, H.; Lu, N.; Yao, L.; Zhang, X. Single image dehazing for visible remote sensing based on tagged haze thickness maps. Remote. Sens. Lett. 2018, 9, 627–635. [Google Scholar] [CrossRef]
Jiang, B.; Chen, G.; Wang, J.; Ma, H.; Wang, L.; Wang, Y.; Chen, X. Deep dehazing network for remote sensing image with non-uniform haze. Remote. Sens. 2021, 13, 4443. [Google Scholar] [CrossRef]
Chen, X.; Li, Y.; Dai, L.; Kong, C. Hybrid high-resolution learning for single remote sensing satellite image Dehazing. IEEE Geosci. Remote. Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Li, S.; Zhou, Y.; Xiang, W. M2SCN: Multi-Model Self-Correcting Network for Satellite Remote Sensing Single-Image Dehazing. IEEE Geosci. Remote. Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Huang, Y.; Xiong, S. Remote sensing image dehazing using adaptive region-based diffusion models. IEEE Geosci. Remote. Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
Huang, B.; Zhi, L.; Yang, C.; Sun, F.; Song, Y. Single satellite optical imagery dehazing using SAR image prior based on conditional generative adversarial networks. In Proceedings of the IEEE/CVF winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1806–1813. [Google Scholar]
Wen, Y.; Gao, T.; Zhang, J.; Li, Z.; Chen, T. Encoder-free Multi-axis Physics-aware Fusion Network for Remote Sensing Image Dehazing. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–15. [Google Scholar]
Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7482–7491. [Google Scholar]
Ullah, H.; Muhammad, K.; Irfan, M.; Anwar, S.; Sajjad, M.; Imran, A.S.; de Albuquerque, V.H.C. Light-DehazeNet: A novel lightweight CNN architecture for single image dehazing. IEEE Trans. Image Process. 2021, 30, 8968–8982. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Sample results of several representative image dehazing methods. (First row) Light homogeneous haze. (Second row) Dense non-homogeneous haze. (a) Hazy images. (b) Real clean images. (c) DCP [10]. (d) Restormer [21]. (e) DCRD-Net [22]. (f) SCSNet (Ours).

Figure 2. The overall framework of the proposed SCSNet. First, the CNN and Transformer encoder branches are used to extract the hierarchical local structure and global non-local semantic features, respectively. Next, the HSG module is devised to learn semantic structure information by synergetically complementing multi-scale local and non-local information. Finally, the semantic structure information is combined with the multi-layer local features for high-fidelity image restoration and obtaining a hazy-free image.

Figure 3. The designed u-structure CNN block.

Figure 4. The workflow of the Transformer block.

Figure 5. The proposed hierarchical semantic guidance (HSG) module.

Figure 6. The proposed cross-layer fusion (CLF) module.

Figure 7. Qualitative results of different methods on the SateHaze1k dataset. The red box highlights some key areas that need to be compared.

Figure 8. Qualitative results of different methods on RS-Haze and RSID datasets. The red box highlights some key areas that need to be compared.

Figure 9. Qualitative results of different methods trained and tested on different datasets.

Table 1. The quantitative comparison on the SateHaze1k dataset.

Methods	Thin Fog		Moderate Fog		Thick Fog		Mixed Fog
Methods	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
DCP [10]	19.1183	0.8518	19.8384	0.8812	16.793	0.7701	18.5833	0.8344
CEP [26]	13.5997	0.7222	14.2122	0.7270	16.0824	0.7762	14.6950	0.7512
MOF [29]	15.3891	0.7291	14.7418	0.6256	16.2495	0.6767	15.5146	0.6859
AOD-Net [13]	19.0548	0.7777	19.4211	0.7015	16.4672	0.7123	17.4859	0.6332
Light-DehazeNet [44]	18.4868	0.8658	18.3918	0.8825	16.7662	0.7697	17.8132	0.8352
FFA-Net [14]	20.141	0.8582	22.5586	0.9132	19.1255	0.7976	21.2873	0.8663
Restormer [21]	20.9829	0.8686	23.1574	0.9036	19.6984	0.7739	20.7892	0.8379
DehazeFormer [8]	21.9274	0.8843	24.4407	0.9268	20.2133	0.8049	22.0066	0.8659
DCRD-Net [22]	20.8473	0.8767	23.3119	0.9225	19.725	0.8121	21.7468	0.8812
FCTF-Net [17]	18.3262	0.8369	20.9057	0.8553	17.2551	0.6922	19.5883	0.8434
SCSNet	26.1460	0.9415	28.3501	0.9566	24.6542	0.9015	25.1759	0.9223

Table 2. The quantitative comparison on the RS-Haze and RSID datasets.

Methods	RS-Haze		RSID
Methods	PSNR	SSIM	PSNR	SSIM
DCP [10]	18.1003	0.6704	17.3256	0.7927
CEP [26]	15.9097	0.5772	14.2375	0.7034
MOF [29]	16.1608	0.5628	14.2375	0.7034
AOD-Net [13]	23.9677	0.7207	18.7037	0.7424
Light-DehazeNet [44]	25.5965	0.8209	17.9279	0.8414
FFA-Net [14]	29.1932	0.8846	21.2876	0.9042
Restormer [21]	25.6700	0.7563	11.7240	0.5971
DehazeFormer [8]	29.3419	0.8730	22.6859	0.9118
DCRD-Net [22]	29.6780	0.8878	22.1643	0.8926
FCTF-Net [17]	29.6240	0.8958	20.2556	0.8397
SCSNet	32.2504	0.9271	25.3821	0.9585

Table 3. The experimental results for the ablation study.

Datasets	Ablated Components	Baselines	PSNR	SSIM
SateHaze1k	Feature Encoding	w/plain ConvE	22.0216	0.8954
	CNN Encoder Branch	w/CNN Encoder Branch	21.1281	0.8647
	Transformer Encoder Branch	w/Transformer Encoder Branch	20.6172	0.8470
	HSG	w/o HSG	21.6719	0.8927
	CLF	w/add	23.5234	0.9026
	Full model (SCSNet)	—	25.1759	0.9223
RS-Haze	Feature Encoding	w/plain ConvE	30.4251	0.9126
	CNN Encoder Branch	w/CNN Encoder Branch	29.0637	0.8813
	Transformer Encoder Branch	w/Transformer Encoder Branch	28.5183	0.8729
	HSG	w/o HSG	29.4806	0.9081
	CLF	w/add	31.6208	0.9217
	Full model (SCSNet)	—	35.2504	0.9471
RSID	Feature Encoding	w/plain ConvE	23.7316	0.9218
	CNN Encoder Branch	w/CNN Encoder Branch	22.8219	0.9075
	Transformer Encoder Branch	w/Transformer Encoder Branch	22.5311	0.8962
	HSG	w/o HSG	23.2430	0.9168
	CLF	w/add	24.0257	0.9349
	Full model (SCSNet)	—	25.3821	0.9585

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, L.; Cao, J.; Wang, H.; Dong, S.; Ning, H. Hierarchical Semantic-Guided Contextual Structure-Aware Network for Spectral Satellite Image Dehazing. Remote Sens. 2024, 16, 1525. https://doi.org/10.3390/rs16091525

AMA Style

Yang L, Cao J, Wang H, Dong S, Ning H. Hierarchical Semantic-Guided Contextual Structure-Aware Network for Spectral Satellite Image Dehazing. Remote Sensing. 2024; 16(9):1525. https://doi.org/10.3390/rs16091525

Chicago/Turabian Style

Yang, Lei, Jianzhong Cao, Hua Wang, Sen Dong, and Hailong Ning. 2024. "Hierarchical Semantic-Guided Contextual Structure-Aware Network for Spectral Satellite Image Dehazing" Remote Sensing 16, no. 9: 1525. https://doi.org/10.3390/rs16091525

APA Style

Yang, L., Cao, J., Wang, H., Dong, S., & Ning, H. (2024). Hierarchical Semantic-Guided Contextual Structure-Aware Network for Spectral Satellite Image Dehazing. Remote Sensing, 16(9), 1525. https://doi.org/10.3390/rs16091525

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Semantic-Guided Contextual Structure-Aware Network for Spectral Satellite Image Dehazing

Abstract

1. Introduction

2. Related Work

2.1. Prior-Based Methods

2.2. Deep Learning-Based Methods

3. The Proposed Method

3.1. Image Encoders

3.1.1. CNN Encoder Branch

3.1.2. Transformer Encoder Branch

3.2. Hierarchical Semantic Guidance Module

3.3. Image Restoration

3.4. Loss Function

4. Experiment and Results

4.1. Datasets

4.2. Implementation Details and Evaluation Metrics

4.3. Comparison with State-of-the-Arts

4.4. Ablation Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI