U-Shaped Dual Attention Vision Mamba Network for Satellite Remote Sensing Single-Image Dehazing

Sui, Tangyu; Xiang, Guangfeng; Chen, Feinan; Li, Yang; Tao, Xiayu; Zhou, Jiazu; Hong, Jin; Qiu, Zhenwei

doi:10.3390/rs17061055

Open AccessArticle

U-Shaped Dual Attention Vision Mamba Network for Satellite Remote Sensing Single-Image Dehazing

by

Tangyu Sui

^1,2,3

,

Guangfeng Xiang

^1,3,

Feinan Chen

^1,3

,

Yang Li

^1,2,3

,

Xiayu Tao

⁴,

Jiazu Zhou

⁵,

Jin Hong

^1,3 and

Zhenwei Qiu

^1,3,*

¹

Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China

²

Graduate School of Science Island, University of Science and Technology of China, Hefei 230026, China

³

Key Laboratory of Optical Calibration and Characterization, Chinese Academy of Sciences, Hefei 230031, China

⁴

National Synchrotron Radiation Laboratory, University of Science and Technology of China, Hefei 230029, China

⁵

Singapore-ETH Centre, 1 Create Way, CREATE Tower, Singapore 138602, Singapore

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 1055; https://doi.org/10.3390/rs17061055

Submission received: 28 January 2025 / Revised: 7 March 2025 / Accepted: 13 March 2025 / Published: 17 March 2025

Download

Browse Figures

Versions Notes

Abstract

In remote sensing single-image dehazing (RSSID), adjacency effects and the multi-scale characteristics of the land surface–atmosphere system highlight the importance of a network’s effective receptive field (ERF) and its ability to capture multi-scale features. Although multi-scale hybrid models combining convolutional neural networks and Transformers show promise, the quadratic complexity of Transformer complicates the balance between ERF and efficiency. Recently, Mamba achieved global ERF with linear complexity and excelled in modeling long-range dependencies, yet its design for sequential data and channel redundancy limits its direct applicability to RSSID. To overcome these challenges and improve performance in RSSID, we present a novel Mamba-based dehazing network, U-shaped Dual Attention Vision Mamba Network (UDAVM-Net) for Satellite RSSID, which integrates multi-path scanning and incorporates dual attention mechanisms to better capture non-uniform haze features while reducing redundancy. The core module, Residual Vision Mamba Blocks (RVMBs), are stacked within a U-Net architecture to enhance multi-scale feature learning. Furthermore, to enhance the model’s applicability to real-world remote sensing data, we abandoned overly simplified haze image degradation models commonly used in existing works, instead adopting an atmospheric radiative transfer model combined with a cloud distortion model to construct a submeter-resolution satellite RSSID dataset. Experimental results demonstrate that UDAVM-Net consistently outperforms competing methods on the StateHaze1K dataset, our newly proposed dataset, and real-world remote sensing images, underscoring its effectiveness in diverse scenarios.

Keywords:

remote sensing image; haze removal; structured state space model; attention mechanisms

1. Introduction

Satellite remote sensing images are increasingly vital across diverse industries. However, the Earth’s near-surface atmosphere contains molecules and aerosol particles that absorb and scatter light, leading to blurring and attenuation effects. These interactions degrade and contaminate image signals, resulting in detail loss, edge blurring, reduced contrast, and color distortion. Such degradation significantly hampers machine vision and image interpretation tasks, including object recognition, surveillance, and land-cover classification. Consequently, satellite image dehazing (SID) has emerged as a crucial preprocessing step to enhance image quality and ensure the accuracy of subsequent analytical processes.

Image dehazing is inherently a highly ill-posed problem, as illustrated in Figure 1, where two parameters must be estimated simultaneously. In recent years, numerous learning-based methods [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25] have been introduced to the dehazing domain, demonstrating significantly better performance compared to traditional approaches [26,27,28,29,30,31,32,33]. These methods typically build upon the framework depicted in Figure 1 or incorporate priors for additional constraints and can be categorized into convolutional neural network-based (CNN-based) methods [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] and vision Transformers-based (ViT-based) methods [20,21,22,23]. It is noteworthy that satellite images (SIs) possess unique characteristics. In SIs, blurring and contrast degradation are predominantly caused by stray light from neighboring terrain scattering into the target pixel’s field of view. Therefore, SID must account for the influence of surrounding terrain over an extensive area. Studies indicate that the broader the range of neighboring effects considered, the more effective the atmospheric compensation performance of the image [34]. Additionally, the performance improvement of deep learning models heavily relies on increasing network depth and expanding the effective receptive field (ERF) [35,36], enabling more pixels to contribute to inferring the values of anchor pixels. This requirement closely aligns with the challenges posed by atmospheric compensation in satellite image dehazing.

Early SID methods primarily combined the high computational efficiency of convolutional neural networks (CNNs). with multi-scale architectures. However, expanding the ERF in these models often necessitates larger convolution (Conv) kernels, significantly increasing computational complexity. In contrast, vision Transformers (ViT) [37] eliminate the reliance on convolutional structures by utilizing self-attention mechanisms to capture long-range dependencies in images. Consequently, ViT-based methods that integrate multi-scale features are well suited for SID tasks. Despite their advantages, Transformers [38] inherently suffer from quadratic computational complexity,

O (n^{2})

, and the patch embedding operation in ViT-based models limits pixel interactions between patches, which is suboptimal for image data. The recently proposed Swin Transformer (Swin-T) [39] partially addresses this issue by restricting the attention mechanism to local windows, thereby achieving linear scalability to some extent. However, Swin-T still fails to provide a true global ERF, and the trade-off between ERF and computational efficiency remains unresolved. A promising alternative lies in the recent advancements of structured state space models (SSMs), which have evolved from S4 [40] to selective state space models (S6), also known as Mamba [41,42]. By combining the strengths of recurrent neural networks (RNNs) and CNNs, and drawing inspiration from classic SSMs concepts such as Kalman filtering [43,44], Mamba introduces dynamic parameter adjustment mechanisms, and optimizations through hardware-level scanning. This approach has demonstrated outstanding performance in long-sequence tasks. Furthermore, several Mamba-based models [45,46,47,48] have successfully applied SSMs to vision-related tasks, showcasing the tremendous potential of Mamba in low-level vision applications.

In this work, we explore the potential of using the improved Mamba module to the SID task, aiming to achieve superior dehazing performance by leveraging an expanded network ERF. The core component of our network is designated as Residual Visual Mamba Block (RVMB). Initially, we integrate the multi-path scanning mechanism with the SSMs in Mamba, thereby proposing the multi-path scanning Selective SSMs (MPS7) module. This module replaces the multi-head attention mechanism commonly employed in Transformers to perform intra-channel spatial attention extraction. Subsequently, we utilize the residual channel attention block (RCAB) module to substitute the multi-layer perceptron (MLP) layer for inter-channel attention extraction. Multiple RVMB modules are stacked following the U-Net architecture to extract the multi-scale features of surface objects and the atmosphere. The proposed U-shaped Dual Attention Vision Mamba Network (UDAVM-Net) attains a global ERF while maintaining linear scalability, thereby enabling high-quality dehazing.

Furthermore, it is noteworthy that most synthetic remote sensing dehazing datasets are generated using simple synthesis methods based on the formula depicted in Figure 1. Some extended datasets [10,21] perform band-to-band transformations using an exponent index [49]. For the synthesis of clouds and thin haze, relevant information from the cirrus cloud band is merely overlaid without considering the coupling mechanisms between aerosols and clouds. To address this limitation, we adopt the 6S radiative transfer model [50] to simulate the contamination process of atmosphere molecules and aerosols on images, followed by a cloud distortion model [51,52], thereby creating a physically consistent synthetic dataset named remote sensing simulation of top of atmosphere (RSSTOA). RSSTOA comprises satellite remote sensing images of atmospheric molecules and aerosol particles simulated using the 6S forward radiative transfer model. Additionally, it generates heterogeneous image pairs for the simulation of thin and cirrus clouds based on a cloud distortion model. This dataset is designed to provide realistic and physically consistent scenarios for evaluating dehazing models under conditions that closely resemble real-world atmospheric phenomena.

2. Related Work

2.1. Image Dehazing Algorithms

Traditional dehazing methods, such as those based on histogram equalization, Retinex theory, and filtering techniques [27,28], aim to enhance image clarity by addressing these specific degradations. However, these top–down approaches fail to fully consider the physical mechanisms of degradation, frequently resulting in over-enhancement and disruption of grayscale relationships. The category of model-based image dehazing methods incorporates more physical processes. These approaches leverage the physical mechanisms of light scattering and transmission in the atmosphere and propose simplified radiative transfer models [26]. Due to the ill-posed nature of image dehazing, subsequent studies have focused on estimating the transmission map and atmospheric light. Early works integrated a series of priors, such as DCP [29], CAP [32], and polarization-based methods [30], achieving notable results. Additionally, some algorithms based on image restoration models [31,33] effectively perform dehazing by segmenting scene depth through transmission maps and applying multi-layer deconvolution methods.

In recent years, a series of advanced deep learning methods have been introduced into the field of image dehazing, which can be broadly categorized into two main approaches: CNN-based and Transformer-based methods. Learning-driven dehazing algorithms, due to their strong representation capabilities, have demonstrated superior performance compared to traditional methods. Early approaches, such as DehazeNet [1], utilized CNNs to estimate the atmospheric light and transmission maps, achieving significant dehazing performance by combining these estimations with the scattering-transmission model. Subsequent methods, such as Light-DehazeNet [8], further improved performance by jointly estimating the transmission map and atmospheric light, while introducing color visibility restoration techniques to reduce color distortion, achieving excellent results across multiple datasets. Another category of methods directly adopts end-to-end prediction strategies. For example, AOD-Net [2] combines the transmission matrix and atmospheric light into a single parameter and optimizes the model by minimizing the reconstruction error between the output and the ground-truth haze-free image, effectively avoiding the error accumulation present in traditional methods. Inspired by residual networks [53], GridDehazeNet [5] leverages residual structures to predict haze-free images. FFA-Net [6], constructed on a feature fusion architecture, achieves high-quality dehazing by leveraging adaptive attention mechanisms and integrating multi-level features. This approach significantly enhances detail preservation and overall image clarity. Additionally, some networks attempt to incorporate physical priors to guide network construction, integrating simplified degradation models into the network. For example, Trinity-Net [20], based on gradient-guided Swin-T, incorporates the dark channel prior into the deep learning framework. It employs three sub-networks to jointly optimize the estimation of haze thickness, ambient light, and medium transmission, demonstrating excellent performance in remote sensing image enhancement and underwater image dehazing. Similarly, AU-Net [19] utilizes a two-stage network structure, where an initial dehazing image is generated using the atmospheric light map and transmittance map. This image is then refined by an asymmetric Unet, which integrates an attention mechanism, effectively removing haze from both remote sensing and natural images. The network’s superiority is validated across multiple datasets.

2.2. State Space Models

SSMs [43,44] are a class of recurrent architectures designed to efficiently model spatiotemporal dependencies in long-sequence data through dynamic state updates and output mappings. The development of SSMs has progressed from traditional linear time-invariant (LTI) systems to more complex dynamically parameterized models [42]. Initially, SSMs were primarily applied in control systems and signal processing, focusing on describing the time-varying behavior of dynamic systems through state transition equations. In recent years, advancements in deep learning have led to the increasing adoption of SSMs in sequence modeling tasks. For instance, S4 introduced the HiPPO projection operator, enabling the efficient retention of historical information from long sequences while significantly improving computational efficiency. Building on this, S6 or Mamba [42] further optimizes the model by incorporating dynamic parameter adjustment mechanisms, allowing for the selective filtering of contextual information. This enhancement makes S6 well suited for tasks such as long-sequence reasoning and complex logical inference. While the selective mechanism successfully overcomes the limitations of traditional LTI SSMs, it introduces computational complexity challenges. To address this, Mamba implements a hardware-aware parallel computation algorithm, replacing traditional Conv operations with cyclic scanning. This approach significantly improves computational efficiency while maintaining robust modeling capabilities for long-sequence dependencies. These advancements have laid the foundation for the broader application of SSMs across various domains. In the realm of low-level vision tasks, a series of Mamba-based models have recently been proposed, achieving outstanding performance. Vmamba [45] was the first to explore the application of Mamba in vision tasks, demonstrating Transformer-level performance in image classification and semantic segmentation. MambaIR [46] successfully applied the Mamba model to tasks such as image super-resolution and denoising, exhibiting excellent results. Inspired by U-Mamba [48] and Mamba-Unet [54], UVM-Net [47] proposed a simple yet effective network architecture, showcasing the great potential of visual Mamba in image dehazing tasks.

3. Proposed Method

In this chapter, we first review the basic theories underlying SSMs and the Mamba model. Following this, we present an overview of the overall architecture of UDAVM-Net. Finally, we provide a comprehensive explanation of the core network module, the RVMB, along with detailed descriptions of its constituent components, including the MPS7, RCAB, visual state space module (VSSM), and channel attention (CA) modules.

3.1. SSMs and Mamba Model

All state space models (SSM, S4, and Mamba) [40,41,42,43,44] are derived from the classical LTI state space model described in (1). This model projects the N-dimensional latent state

h (t) \in R^{N}

to a one-dimensional (1D) input signal

x (t) \in R

, mapping it to a 1D output signal

y (t) \in R

, typically expressed as the following system of linear differential equations:

\begin{matrix} h^{'} (t) & = A h (t) + B x (t) \\ y (t) & = C h (t) + D x (t) \end{matrix}

(1)

where

A \in R^{N \times N}

is the state transition matrix,

B \in R^{N \times 1}

is the input control matrix, which together with the latent state and input signal forms the state equation.

C \in R^{1 \times N}

is the output matrix, and

D \in R

is the direct transmission matrix, which forms the observation equation together with the latent state and input signal. To facilitate computation, the continuous state equations are discretized using the zero-order hold method, yielding the following recurrence form:

\begin{matrix} \bar{A} & = e^{Δ A} \\ \bar{B} & = {(Δ A)}^{- 1} (e^{Δ A} - I) \cdot Δ B \\ h_{k} & = \bar{A} h_{k - 1} + \bar{B} x_{k} \\ y_{k} & = C h_{k - 1} + D x_{k} \end{matrix}

(2)

In the equation,

Δ

is the step size, and k denotes the sequence index. Ignoring the jump connection, the state and observation equations can be computed sequentially and are represented in Conv form as follows:

\begin{matrix} \bar{K} & = (C B, C A B, \dots, C A^{k - 1} B) \\ y & = x * \bar{K} \end{matrix}

(3)

Based on the theory above, S4 performs Conv computations using static matrix parameters (

\bar{A}, \bar{B}, C

), restricting it to processing only LTI models. In contrast, Mamba utilizes dynamic parameter matrices (

\bar{B}, C, Δ

), integrating simplified selective SSMs into end-to-end networks while incorporating hardware-aware and parallel acceleration optimizations (as shown in Figure 2). This approach overcomes the limitations of LTI systems, enabling Mamba to filter input data, highlight important information, and discard irrelevant content, thereby significantly enhancing performance in content-intensive tasks. Moreover, Mamba adopts the same recursive computation mode as (2), optimizing GPU memory usage to achieve linear time complexity and avoiding the high memory overhead associated with traditional SSMs. By eliminating reliance on conventional attention mechanisms or MLPs, Mamba achieves a throughput five times higher than Transformers, with sequence lengths scaling linearly. Its performance is particularly notable when handling real-world data sequences with lengths reaching millions.

3.2. Overview of Proposed UDAVM-Net Architecture

The proposed UDAVM-Net architecture is depicted in Figure 3. Its core module, the RVMB, employs a dual attention mechanism to effectively fuse information across both channel and spatial dimensions. Multiple RVMBs are stacked to form an RVMB Group (RVMBG), as illustrated in Figure 4, which serves as the backbone of the U-shaped network structure. Given a haze-polluted image

I_{H a z e} \in R^{H \times W \times 3}

, the input first passes through a 3 × 3 Conv layer for embedding, reshaping image patches into vectors in

R^{H \times W \times C}

. Here, H and W represent the height and width of the input image, respectively, and C denotes the number of channels. After dimensional expansion, the data are processed by the RVMBG in the encoder to capture long-range dependencies and salient features across multiple scales. As the RVMBG does not alter the feature matrix dimensions, a Downsampling Block follows each group to adjust spatial dimensions. The Downsampling Block integrates Conv and pixel unshuffle operations, ensuring maximal preservation of feature information.

In the decoder, the latent features extracted by the encoder are reconstructed and refined to restore the haze-free image. Similar to the encoder, the upsampling block integrates operations such as Conv and pixel shuffle to increase spatial resolution. Additionally, skip connections are established between corresponding encoder and decoder layers to cascade fine-grained information directly from the encoder, effectively preserving details like edges and textures that might otherwise be lost during downsampling. After feature cascading, a channel reduce operation is applied to maintain dimensional consistency (the green block shown in Figure 4). Finally, the features extracted by the network backbone are further refined through an RVBG and a convolutional layer to generate a residual image. This residual image is then added element-wise to the original image, resulting in a high-quality haze-free image.

The UDAVM-Net model employs the L1 loss function to capture pixel-level discrepancies, providing an effective measure of performance in SID tasks. Additionally, the selective state space mechanism of the Mamba model enhances the network’s ability to highlight relevant information while filtering out noise and irrelevant content. By maximizing the ERF and utilizing multi-scale feature extraction, UDAVM-Net accurately models the large-scale characteristics of haze while preserving fine-grained local details such as edges and textures.

3.3. Residual Visual Mamba Block

The design of RVMB is inspired by Transformer-based image restoration networks. However, because Mamba was originally developed for 1D sequential data, directly substituting the traditional attention mechanism with Mamba does not yield optimal performance in two-dimensional (2D) image restoration tasks. Consequently, new network modules are introduced: MPS7 and the RCAB. As illustrated in Figure 4, for the i-th RVMB in the N-layer RVMBG, the extracted feature matrix

F_{R V M B_{N_{i}}} \in R^{H_{N_{i}} W_{N_{i}} \times C_{N_{i}}}

is first passed through a normalization layer and then fed into the VSSM block to capture long-range dependencies. A residual connection with the input feature produces the intermediate feature

F_{R V M B_{N_{i}}}

. Subsequently, after another normalization layer and the CAB module, as well as an additional residual connection, the updated feature

F_{R V M B_{N_{i + 1}}}

is generated and passed to the next RVMB layer for deeper feature extraction. This process is described by Equation (4), and the intermediate feature

F_{R V M B_{N_{i}}}^{M_{i}}

produced by the i-th RVMB in the N-layer RVMBG is input to the RCAB module,

W_{1}

and

W_{2}

represent learnable weight parameters, while ⊕ denotes element-wise addition.

\begin{matrix} F_{R V M B_{N_{i}}}^{M_{i}} & = W_{1} • F_{R V M B_{N_{i}}} \oplus M P S 7 (F_{R V M B_{N_{i}}}), \\ F_{R V M B_{N_{i + 1}}} & = W_{2} • F_{R V M B_{N_{i}}}^{M_{i}} \oplus R C A B (F_{R V M B_{N_{i}}}^{M_{i}}) \end{matrix}

(4)

3.4. MPS7 Block

MPS7 is a dual-branch visual selective state space module (shown in blue in Figure 4), integrating Hungry Hungry Hippos (H3), a gated MLP, and an enhanced SSM. To capture long-range spatial dependencies, a multi-path scanning mechanism is incorporated into the conventional SSM, coupled with a selective mechanism. Specifically, the input feature undergoes layer normalization (LN) before being fed into two branches. In the main branch, the VSSM pre-module utilizes a linear-Conv-activation structure to enhance local feature extraction and mitigate information loss before the features are passed into VSSM. Within VSSM, global modeling is performed on the sequences obtained from the multi-path scanning mechanism, capturing long-range dependencies that improve the network’s capacity to extract haze-relevant features. Meanwhile, the secondary branch applies a simpler operation, passing the features through a linear layer and a SiLU activation, serving as a gating mechanism akin to a gated MLP. This branch adaptively selects and enhances the features captured by VSSM. Finally, a fully connected layer adjusts the channel dimensions, and the output features are fused with the input features through a residual connection, as described by Equation (5). By adopting this dual-branch design, MPS7 effectively captures global information while preserving local feature enhancement and global information modeling, thereby improving both dehazing performance and the model’s generalization ability.

\begin{matrix} F_{R V M B_{N_{i}}}^{M P S 7} & = M P S 7 (F_{R V M B_{N_{i}}}) \\ = L i n e a r (S i L U (L N (F_{R V M B_{N_{i}}})) ⊙ V S S M (F_{R V M B_{N_{i}}}^{L C S})) \end{matrix}

(5)

In which

F_{R V M B_{N_{i}}}^{L C S} \in R^{H_{N_{i}} \times W_{N_{i}} \times d}

is given by:

\begin{matrix} F_{R V M B_{N_{i}}}^{L C S} & = S i L U (C o n V (L N (F_{R V M B_{N_{i}}}))) \end{matrix}

(6)

where

F_{R V M B_{N_{i}}}^{M P S 7}

represents the feature matrix extracted by the MPS7 block,

C o n V

represents the Conv layer,

L i n e a r

refers to linear projection, and ⊙ signifies element-wise multiplication.

3.5. RCAB Block

We adopt the RCAB from residual channel attention networks (RCANs) [56], which incorporate a channel attention mechanism into a classical residual network. This design enables the transmission of primary features while capturing inter-channel dependencies, thereby enhancing the network’s discriminative capacity. RCAB is depicted as the pink block in Figure 4 and

F_{R V M B_{N_{i}}}^{R C A B}

denotes the output of the RCAB module in the i-th RVMB within the N-layer RVMBG, as defined in Equation (7). The residual feature

R F_{R V M B_{N_{i}}}^{M_{i}}

, as defined in Equation (8), is subsequently processed through a channel attention mechanism to capture inter-channel dependencies.

\begin{matrix} F_{R V M B_{N_{i}}}^{R C A B} & = R C A B (F_{R V M B_{N_{i}}}^{M_{i}}) = C A (R F_{R V M B_{N_{i}}}^{M_{i}}) \end{matrix}

(7)

where

C A

represents the channel attention module.

In RCAB, the preceding network follows the traditional Conv-activation-Conv residual structure. This processing flow can be mathematically expressed by the following equation.

\begin{matrix} R F_{R V M B_{N_{i}}}^{M_{i}} & = C o n V (A c t (C o n V (L N (F_{R V M B_{N_{i}}}^{M_{i}})))) \end{matrix}

(8)

The CA block generates channel descriptors via global average pooling, enabling the model to focus on the most salient features within the input. By employing a gating mechanism, the CA block effectively captures inter-channel dependencies, allowing the model to assess the importance of each channel. Through the use of a sigmoid function, the scaling factor for each channel is dynamically adjusted, thereby amplifying relevant features while suppressing less informative ones. This selective attention mechanism enhances the model’s capacity to focus on key features, significantly improving the performance in image restoration tasks. By preserving and emphasizing critical information, the CA module plays a crucial role in refining the model’s output. This process is formally represented in (9).

\begin{matrix} C A (R F_{R V M B_{N_{i}}}^{M_{i}}) & = R F_{R V M B_{N_{i}}}^{M_{i}} ⊙ A c t (C o n V (A c t (C o n V (G P (R F_{R V M B_{N_{i}}}^{M_{i}}))))) \end{matrix}

(9)

where

G P

represents global average pooling.

3.6. VSSM Block

The standard Mamba was originally designed for 1D sequential data; however, 2D images contain edge textures and strong spatial context, posing unique challenges for a 1D-SSM framework. For example, spatially adjacent pixels may be separated when the 2D feature map is flattened into a 1D sequence, leading to “local pixel forgetting”. In addition, remote sensing image dehazing involves the adjacency effect [34], wherein the spatial distance and pixel intensity of neighboring pixels strongly influence the central pixel. Consequently, the strategy for flattening 2D features significantly impacts dehazing performance.

To address these issues, a deep convolutional layer is first introduced before the VSSM, expanding the local ERF and fully capturing each anchor pixel’s immediate surroundings. As shown in Figure 5, the 2D feature map is then scanned in four directions: horizontally and vertically from the top-left to the bottom-right, and then in reverse from the bottom-right back to the top-left. Stacking these four directions yields

F_{R V M B_{N_{i}}}^{S C A N_{j}} \in R^{H_{N_{i}} W_{N_{i}} \times C_{N_{i}} \times 4}

, where

j \in [1, 2, 3, 4]

indicates the path index. This bidirectional “S-shaped” scanning pattern preserves spatial continuity, effectively avoiding unintended pixel skipping. Each of the four 1D sequences is then passed through a linear projection layer into a latent space. Independent initialization is applied to obtain the input feature sequence

X \in R^{H_{N_{i}} W_{N_{i}} \times C_{N_{i}} \times 4}

, the state transition matrix

A \in R^{4 d \times d_{SSM}}

(

d_{SSM}

denotes the SSM dimension), the control matrix

B \in R^{H_{N_{i}} W_{N_{i}} \times d_{SSM} \times 4}

, the output matrix

C \in R^{H_{N_{i}} W_{N_{i}} \times d_{SSM} \times 4}

, the direct transmission matrix

D \in R^{4 d}

, and the dynamic time step

Δ_{t} \in R^{H_{N_{i}} W_{N_{i}} \times 4 d}

.

Subsequently, following Equation (2), each pixel’s historical information is dynamically integrated with current features via the pixel-wise recursive state update and the state-to-output mapping equations. This procedure yields a final feature representation for every pixel in each direction, enabling direction-aware feature propagation under dynamic temporal control. Because the state space operations are carried out independently along the four directions, four directional output features are obtained. A reverse scanning process aligns the feature positions, resulting in

y_{S} \in R^{H_{N_{i}} W_{N_{i}} \times d \times 4}

, where

S \in [1, 2, 3, 4]

. Finally, the outputs in the four directions are summed, and an LN layer is applied to ensure stable feature distributions. As depicted in Figure 5, the features:

\begin{matrix} F_{R V M B_{N_{i}}}^{V S S M} \in R^{H_{N_{i}} \times W_{N_{i}} \times d} = L N (y_{1} + y_{2} + y_{3} + y_{4}) \end{matrix}

(10)

captured by the VSSM subsequently interact with the secondary branch’s gating mechanism to further enhance feature extraction, thereby improving dehazing performance.

4. RSSTOA Dataset

The performance of dehazing models is heavily influenced by the quality of the dataset. To date, numerous studies have introduced various dehazing datasets. Conventional real-world dehazing datasets are primarily generated using fog generators, which produce a set of images that often lack naturalness and uniformity [57,58,59,60]. On the other hand, conventional synthetic datasets are created based on the simplified model depicted in Figure 1, where images are synthesized directly through transmission maps [61,62]. In the domain of remote sensing, obtaining haze-free images is practically impossible. Therefore, synthetic datasets are typically generated by applying the model in Figure 1, with transmission maps transformed across spectral bands based on the exponential law [49], resulting in a range of remote sensing dehazing datasets [21,63]. However, this highly simplified model does not accurately capture the complex imaging processes inherent in remote sensing. Furthermore, most synthetic datasets rely on Landsat data with a 30 m spatial resolution, which limits their generalizability and application. To overcome these limitations, this section first outlines the theoretical framework for generating uniform haze using a radiative transfer model. Building upon this, we propose a high-resolution remote sensing dehazing dataset, RSSTOA, derived from Gao Fen Duo Mo (GFDM) [64] data with sub-meter spatial resolution. This new dataset addresses the shortcomings of existing datasets by providing higher spatial resolution and more realistic haze characteristics, thereby offering improved potential for training and evaluating dehazing models in the context of remote sensing.

4.1. Synthetic Model for Dehazing Based on Radiative Transfer Theory

This section presents a synthetic model of top-of-atmosphere (TOA) reflectance, grounded in radiative transfer theory, which accounts for the effects of atmospheric molecules, aerosols, and clouds. The radiative transfer model offers a physically consistent framework for simulating the variations in reflectance resulting from atmospheric interactions, including molecular and aerosol scattering and absorption, as well as cloud-related influences. These processes play a critical role in determining TOA reflectance and are essential for accurately modeling remote sensing data under real-world atmospheric conditions.

4.1.1. TOA Apparent Reflectance Without Cloud

In remote sensing image processing, atmospheric scattering and absorption significantly influence the signals received by satellites. Accurately simulating these atmospheric effects is crucial for image correction and restoration. To simplify the complex atmospheric radiative transfer process, a common assumption is that the atmosphere is in a plane-parallel state and that the surface is Lambertian. This assumption is widely adopted in remote sensing applications as it facilitates the development of analytically solvable radiative transfer models. Based on this approximation, the TOA reflectance received by satellites can be expressed as:

\begin{matrix} T O A^{λ} (θ_{v}, θ_{s}, φ) = (ρ_{0}^{λ} (θ_{v}, θ_{s}, φ) + \frac{T_{↓}^{λ} (θ_{s}) T_{↑}^{λ} (θ_{v}) ρ_{surf}^{λ}}{1 - ρ_{surf}^{λ} S_{λ}}) \cdot t_{g}^{λ} (θ_{v}, θ_{s}) \end{matrix}

(11)

The parameters

θ_{v}

,

θ_{s}

and

φ

describes the observation geometry, including the satellite zenith angle, solar zenith angle, and relative azimuth angle.

T_{↓}^{λ} (θ_{s})

and

T_{↑}^{λ} (θ_{v})

represent the total atmospheric transmittance for downward and upward paths, respectively, while

t_{g}^{λ} (θ_{v}, θ_{s})

denotes the total absorption transmittance of atmospheric molecules.

S_{λ}

indicates the hemispherical albedo of the atmosphere,

ρ_{surf}^{λ}

represents the real surface albedo, and

ρ_{0}^{λ} (θ_{v}, θ_{s}, φ)

refers to the intrinsic atmospheric reflectance. These parameters can be derived using radiative transfer models, which simulate TOA reflectance under varying atmospheric conditions by forward modeling of remote sensing images.

4.1.2. TOA Reflectance Contaminated by Cloud

In practical scenarios, SI is often affected by cloud cover, yet accurately modeling the radiative transfer process of cloud interference remains challenging. In this study, we adopt the cloud distortion model [51,52] to simulate its impact on image contamination. The apparent reflectance, considering the interference of clouds and aerosols, can be derived from (12):

\begin{matrix} T O A_{c}^{λ} (θ_{v}, θ_{s}, φ) = T O A^{λ} (θ_{v}, θ_{s}, φ) \cdot t_{c}^{λ} + w \cdot ρ_{c}^{λ} \end{matrix}

(12)

\begin{matrix} t_{c}^{λ} = 1 - w \cdot ρ_{c}^{λ} \end{matrix}

(13)

Here,

t_{c}^{λ}

represents the cloud transmittance, which can be expressed using (13).

ρ_{c}^{λ}

denotes the cirrus cloud band reflectance, interpolated across spectral bands, and its generation method is detailed in the subsequent sections. Additionally, w is an adjustment factor used for cloud intensity modulation.

4.1.3. Pipeline for Generating Cloud Transmission Maps

The process depicted in Figure 1 can be described using the Beer–Lambert law, as shown in (14). Additionally, the optical thickness

τ

of the scattering medium is defined according to (15) as follows.

I_{λ} (s_{1}) = I_{λ} (0) exp (- k_{λ} u), u = \int_{0}^{s_{1}} ρ d_{s}

(14)

τ = \int_{0}^{s_{1}} k_{λ} ρ d_{s}

(15)

Thus, the direct transmittance of the optical signal in the scattering medium can be expressed as:

t (x) = e^{- τ}

(16)

According to the Ångström relationship [49], there is a power-law relationship between optical thickness and the wavelength of light. Therefore, when the exponent index is known, the optical thickness of the wavelength

λ

can be interpolated based on the optical thickness of a single band:

τ_{λ} = τ_{λ_{0}} {(\frac{λ}{λ_{0}})}^{- α}

(17)

Thus, the direct transmittance formula for any wavelength

λ

can be derived as:

t_{λ} (x) = t_{λ_{0}} {(x)}^{{(\frac{λ_{0}}{λ})}^{α}}

(18)

Cloud masks are generated using oceanic data, as oceans provide an ideal dark background for target identification. By excluding dark pixels [65], background interference can be minimized. Given the sensitivity of the cirrus cloud band (1380 nm) to water vapor absorption, the influence of the surface or ocean background on the synthesis process is further reduced. Therefore, reflectance data from the cirrus cloud band of Landsat OLI over oceans are selected, with dark pixels removed to accurately characterize the spatial distribution of clouds. The exponent

α

for varying cloud intensities is computed based on the formula provided in [63]. Finally, the generated cloud mask can be interpolated to any desired wavelength.

4.2. Synthesis Pipeline

First, 374 clean atmospheric condition image patches with the dimensions of

512 \times 512

were extracted from GFDM satellite data, and hazy SID synthesis was performed following the workflow outlined in Figure 6. Specifically, the patches were first atmospherically corrected to obtain the true surface reflectance. Six aerosol optical thickness (AOT) levels (ranging from 0 to 1.0 with a step size of 0.2) were defined to represent varying atmospheric conditions. The necessary parameters for Equation (11) were then derived using the 6S radiative transfer model, and the forward modeling of remote sensing images was employed to generate TOA reflectance images. Subsequently, a transmission map was constructed using the Landsat cirrus cloud band, and the final hazy synthesis was carried out using Equation (12). The resulting RSSTOA dataset contains 78,540 image pairs with varying levels of cloud and aerosol pollution (as shown in Figure 7). Furthermore, we consider that excessively high cloud optical depth (COD) can obstruct ground signals from reaching the satellite. Dehazing such scenarios would introduce false information, which would be detrimental to satellite image interpretation. Therefore, we only simulated scenarios where the ground is faintly visible.

5. Experimental Results

In this section, we first introduce the datasets, experimental setup, and evaluation methods in Section 5.1. We then present both the quantitative and qualitative evaluation results of our proposed method, comparing it with seven state-of-the-art dehazing approaches: DCP [29], AOD-Net [2], Light-DehazeNet [8], FFA-Net [6], GridDehazeNet [5], Trinity-Net [20] and AU-Net [19], on the StateHaze1K [66] and RSSTOA datasets. Finally, to further demonstrate the outstanding performance of UDAVM-Net, we applied the model trained on the RSSTOA dataset to dehazing tests on GFDM real image data.

5.1. Experimental Settings

5.1.1. Dataset

Our experiments were primarily conducted on the StateHaze1K dataset and the synthetic remote sensing dataset RSSTOA, which was generated using a radiative transfer model. Additionally, real remote sensing scenarios were utilized to evaluate the model’s performance under actual haze conditions. The StateHaze1K dataset [66] consists of 1200 pairs of GF-2 satellite RGB images with a resolution of

512 \times 512

, divided into three subsets: StateHaze1K-thin, StateHaze1K-moderate, and StateHaze1K-thick. Each subset contains 400 pairs of synthetic hazy remote sensing images, with 320 pairs used for training, 35 pairs for validation, and 45 pairs for performance evaluation. The RSTOA dataset, as detailed in Section 4, consists of 78,540 pairs of images with varying levels of cloud and aerosol pollution. Of these, 74,540 pairs are used for training, 2000 pairs for validation, and 2000 pairs for performance evaluation. For real remote sensing data, we selected 170 images for testing.

5.1.2. Parameter Settings

The proposed UDAVM-Net was trained on the Ubuntu 20.04 operating system using the PyTorch framework with a single NVIDIA RTX 4090 GPU. For optimization, we employed the Adam optimizer [67] (

β_{1} = 0.9, β_{2} = 0.999

) with an initial learning rate of 0.001. A cosine annealing scheduler [68] was used, gradually reducing the learning rate to

1 \times 10^{- 6}

. For the RVMBG modules, the number of modules in the encoder stage was set to [2, 4, 4, 6], while the decoder stage was configured as [4, 4, 2, 2]. Additionally, data augmentation was applied during training to enhance the performance on smaller datasets.

5.1.3. Evaluation Metric and Benchmark Methods

The performance of the proposed model in the remote sensing image dehazing task is demonstrated using image quality evaluation metrics such as peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM). Comparisons are made with classical algorithms, including DCP [29], CNN-based models such as AOD-Net [2], Light-DehazeNet [8], FFA-Net [6], GridDehazeNet [5], AU-Net [19] as well as Swin-T-based dehazing models such as Trinity-Net [20]. For a fair comparison, we used the official code provided by the authors to train the deep learning-based methods. Furthermore, to evaluate the effectiveness of unsupervised remote sensing image dehazing methods in real-world scenarios, we employed no-reference natural image quality evaluation metrics: integrated local NIQE (IL-NIQE) [69]. Lower values of IL-NIQE metrics are associated with higher perceptual image quality.

5.2. Qualitative Comparisons

In this section, we first qualitatively compared the ERFs of UDAVM-Net and several baselines. Subsequently, we present the dehazing results of the proposed method and seven other dehazing methods across various datasets, along with corresponding qualitative analyses.

(1) Effective receptive field comparisons: We adopted the method provided in [36] to generate the ERF map. The visualization results of the ERF are presented in Figure 8. It can be observed that our method achieves a global ERF compared to the CNN-based methods [5,20] and Swin-T-based method [20] for dehazing.

(2) Comparisons on StateHaze1K dataset: Figure 9 shows the visualization results of our proposed method and baselines on the StateHaze1K Thin dataset. In the dehazed images produced by DCP, there are color-abnormal blotches, which are caused by incorrect atmospheric light estimation due to haze and highly reflective targets. AOD-Net demonstrates decent dehazing performance, but the overall image appears greener compared to the ground truth (GT). Light-DehazeNet still leaves some residual haze in the dehazed images. FFA-Net, GridDehazeNet, Trinity-Net and AU-Net show relatively superior dehazing performance, although there are slight color discrepancies compared to the GT. In contrast, our method achieves the results closest to the GT in terms of color consistency, with no residual haze present.

Figure 10 shows dehazing results on three scenes with varying levels of uniformity and haze thickness selected from the StateHaze1K moderate dataset. In regions with relatively uniform and thin haze, all algorithms perform satisfactorily. However, in areas characterized by thicker and more non-uniform haze, DCP, AOD-Net, and Light-DehazeNet leave noticeable residual haze. Although FFA-Net, GridDehazeNet, and Trinity-Net exhibit promising performance, they display significant color deviations from the GT in extremely thick haze regions where ground information is fully obscured. In contrast, UDAVM-Net achieves haze-free results even under heavy and highly non-uniform haze conditions, maintaining minimal color discrepancy relative to the GT. Notably, AU-Net produces dehazing outcomes that are closely comparable to UDAVM-Net under moderate haze scenarios.

Figure 11 presents dehazing results on the StateHaze1K Moderate dataset, where three images with the thickest haze and highest non-uniformity were specifically chosen to evaluate the performance of various dehazing algorithms. It is apparent that DCP, AOD-Net, and Light-DehazeNet struggle under these severe conditions, leaving extensive hazy regions and uneven color patches in their outputs. In contrast, GridDehazeNet, Trinity-Net, and AU-Net demonstrate more robust dehazing capabilities, producing largely haze-free images with smoother color transitions. However, color distortions remain evident, and regions exhibiting extremely thick haze appear as blurry blocks, indicating limitations in handling extreme haze. By comparison, the proposed model successfully recovers ground surface information while delivering the most faithful structural textures and highest color fidelity relative to the GT, without leaving substantial residual haze.

(3) Comparisons on RSSTOA dataset: Figure 12 and Figure 13 present visualization results from the RSSTOA dataset, focusing on scenes with sparse ground objects (Figure 12) and dense ground objects (Figure 13) to evaluate the dehazing performance of various algorithms. DCP, AOD-Net, and light-DehazeNet struggle to remove thick haze effectively, leaving noticeable haze residues in their final outputs. Moreover, AOD-Net and Light-DehazeNet exhibit over-enhanced edges, and overall, these three methods perform better in sparse ground object areas than in denser regions. In contrast, FFA-Net, GridDehazeNet, Trinity-Net, and AU-Net demonstrate relatively strong dehazing performance in thin haze regions. Yet, they leave varying amounts of residual haze where the haze is thicker. Additionally, Trinity-Net and AU-Net perform slightly worse in dense-object scenes than those with sparse ground objects. Notably, FFA-Net and GridDehazeNet show noticeable color deviations from the ground truth (GT), often producing a yellowish tint, particularly in Scene 1 of Figure 12 and Scenes 2 and 4 of Figure 13. By comparison, the proposed UDAVM-Net and AU-Net yield results with the colors closest to the GT. Trinity-Net introduces some texture blurring, whereas FFA-Net, GridDehazeNet, AU-Net, and UDAVM-Net exhibit robust detail preservation. Furthermore, upon zooming in on Scene 4 of Figure 12, faint residual haze remains visible in the GT magnified region, yet both AU-Net and UDAVM-Net successfully remove nearly all of this thin haze. Lastly, the error maps for UDAVM-Net display the deepest blue across both sparse and dense ground object scenes, indicating minimal residual haze and underscoring the superior dehazing capability of the proposed method.

5.3. Quantitative Comparison

The quantitative evaluation results of UDAVM-Net and baselines are shown in Table 1, Table 2 and Table 3. Table 1 and Table 2 present the results on the StateHaze1K and RSSTOA datasets, respectively, using full-reference image quality evaluation metrics SSIM and PSNR. Bold text and underlining indicate the best and second-best results, respectively. Table 3 provides a quantitative evaluation of dehazing performance on real remote sensing data, where the results are described using no-reference image quality evaluation metrics: IL-NIQE.

(1) Results and analysis on StateHaze1K dataset: As shown in Table 1, UDAVM-Net achieved the highest PSNR on the StateHaze1K dataset, demonstrating excellent dehazing performance. Notably, UDAVM-Net achieved the second-best result in SSIM, trailing the top-performing Trinity-Net by only 0.0023 in average. This result is predictable, as Swin-T-based dehazing methods, such as Trinity-Net, leverage the shift-window mechanism to enable pixel interaction between windows, which helps capture long-range contextual dependencies and achieve superior performance. However, UDAVM-Net improves the average PSNR by 2.97 dB compared to Trinity-Net. The other notable results come from GridDehazeNet and AU-Net, which utilizes an attention mechanism. Compared to these two networks, UDAVM-Net outperforms them by taking advantage of Mamba’s hardware-aware mechanism for intra-channel attention extraction, while the RCAB module extracts inter-channel attention. By stacking modules in a U-Net architecture, UDAVM-Net achieves a superior dehazing performance, with PSNR improvements of 1.25 dB and 2.11 dB and SSIM improvements of 0.017 and 0.027 in the entire StateHaze1k dataset, respectively.

(2) Comparison and analysis on RSSTOA dataset: As shown in Table 2, UDAVM-Net achieves the highest PSNR and SSIM on the RSSTOA dataset, followed by the Swin-T-based Trinity-Net and the attention-based FFA-Net, and GridDehazeNet. Compared to the second-best results, UDAVM-Net improves the PSNR and SSIM by 2.99 dB and 0.053, respectively, demonstrating the model’s strong competitiveness on more realistic remote sensing datasets.

5.4. Results and Analysis on Real Remote Sensing Data

In this section, we conducted inference on real remote sensing data (derived from GFDM satellite data, as previously described) using the model trained on the RSSTOA dataset. Table 3 presents a quantitative comparison between UDAVM-Net and baseline models on real remote sensing data. As mentioned earlier, IL-NIQE was employed to quantitatively assess the images before and after dehazing. The results show that UDAVM-Net outperforms the other methods, achieving the best performance on real remote sensing data. Figure 14 showcases several dehazing examples on real remote sensing data. It is evident from the figure that AOD-Net and LightDehazeNet struggle to handle regions with thick haze, while UDAVM-Net delivers remarkable dehazing results. The proposed model effectively removes haze even in areas with dense haze and excels in preserving fine details. Notably, AU-Net demonstrates strong detail restoration capabilities; however, its color rendering shows a slight greenish shift. Additionally, as shown in the upper-right corner of the first scene, FFA-Net erroneously conflates buildings with green vegetation, resulting in incorrect restoration compared to other algorithms. In contrast, UDAVM-Net avoids introducing false information in large, uniform regions, further highlighting its robustness and superior performance on real remote sensing data.

6. Discussion

In this section, we first present an ablation study on the StateHaze1K thin dataset to assess the effectiveness of the modules in the proposed network architecture. We then discuss the computational complexity of the algorithm.

6.1. Ablation Study

To validate the effectiveness of the proposed modules, the baseline framework was modified by removing the VSSM module from MPS7 and the CA module from RCAB, while all other configurations remained unchanged. Table 4 presents the corresponding quantitative evaluation results, and the specific module removals and modifications are outlined as follows:

(1): Base: UDAVM-Net without the VSSM module in MSP7 and without the CA module in RCAB.
(2): Base + CA: The base framework with CA in RCAB.
(3): Base + VSSM_SHS + CA: The base framework with VSSM using single-horizontal-scanning (SHS), and with CA in RCAB.
(4): Base + VSSM_HVS + CA: The base framework with VSSM using horizontal-vertical-scanning (HVS), and with CA in RCAB.
(5): Base + VSSM + CA: The base framework with both VSSM in MSP7 and CA in RCAB, corresponding to the proposed UDAVM-Net.

In the ablation experiments, only the core modules were removed or modified to evaluate their impact on performance. Specifically, integrating the CA module into the baseline framework (Base + CA) improved PSNR by 0.33 dB, accompanied by a slight increase in SSIM. Next, incorporating the incomplete VSSM module (using the SHS mechanism) into the framework (Base + VSSM_SHS + CA) led to a 0.64 dB improvement in PSNR, with a slight gain in SSIM compared to using CA alone, thus demonstrating the effectiveness of VSSM. Further extending this mechanism to include vertical scanning (Base + VSSM_HVS + CA) yielded an additional 0.27 dB increase in PSNR, while SSIM remained essentially unchanged, indicating that the vertical scanning path is not redundant. When the complete VSSM module was employed (Base + VSSM + CA), which corresponds to the proposed UDAVM-Net, the performance reached its maximum level, yielding a 1.28 dB increase in PSNR over the baseline and underscoring the critical that role these components play in maintaining image quality. Moreover, the gain in SSIM across these experiments was less increased, potentially due to the minimal effect of thin haze on image structure and the inherent strength of the U-Net architecture in preserving structural similarity. Overall, these results underscore the importance of our proposed modules in enhancing image dehazing quality.

6.2. Model Complexity

Figure 15 illustrates the relationship between computational complexity and input scale for the proposed algorithm, compared with the full-attention Transformer [38] under baseline parameter settings. The Transformer’s global attention mechanism incurs prohibitively high computational costs when applied directly to computer vision tasks. Additionally, the patch-based structure of ViT [37] limits pixel interactions across patches, while Swin-T [39] constrains its attention mechanism within shifted windows, enabling linear scalability at the cost of losing a truly global ERF. Both methods effectively avoid quadratic computational complexity at the expense of the full-attention mechanism. Notably, UDAVM-Net exhibits computational scaling characteristics similar to ViT and Swin-T but without relying on patch operations, thereby avoiding the quadratic overhead associated with full-attention Transformers. Furthermore, UDAVM-Net achieves a truly global ERF, matching the performance level of standard full-attention mechanisms, as shown in Figure 8.

Model complexity is typically reflected by both the number of parameters and computational cost. In this study, we use the number of parameters (#Params) and floating-point operations (FLOPs) as evaluation metrics. Table 5 compares Params and FLOPs for the proposed algorithm and baseline methods, all evaluated with an input image size of 128 × 128. Additionally, Transformer [38], Vision Transformer-Base/16 (ViT-B/16) [37], and Swin Transformer-Base (Swin-B) [39] are included for completeness. Although the proposed algorithm does not achieve the lowest #Params or FLOPs (#Params lower than Trinity-Net and FLOPs lower than FFA-Net), it effectively balances the receptive field and computational efficiency, ultimately delivering superior prediction performance.

7. Conclusions

In this paper, we propose a U-shaped Dual Attention Vision Mamba Network for remote sensing image dehazing, along with a sub-meter scale remote sensing dehazing dataset that better aligns with the radiative transfer theory. The core module of UDAVM-Net, RVMB, integrates both spatial and channel dual attention mechanisms. The spatial attention module, extracted by the MPS7 block, adaptively processes unevenly distributed haze over large areas in remote sensing images while preserving fine-grained features. The channel attention module, extracted by the RCAB block, captures the relative relationships of haze across different spectral bands. The RVMB blocks are stacked based on hyperparameters, incorporating the multidimensional information of ground objects and haze into the U-Net backbone. The results demonstrate that UDAVM-Net effectively captures relevant haze information and achieves superior dehazing performance. The VSSM further enhances the filtering of adaptive feature information, improving the quality of the dehazed remote sensing images. Experiments show that UDAVM-Net outperforms other baseline methods and performs exceptionally well on real remote sensing data. In future work, we plan to integrate the radiative transfer model with remote sensing dehazing networks and construct a larger, more diverse remote sensing dehazing dataset to enhance the model’s stability and generalization capability.

Author Contributions

Conceptualization, T.S. and Z.Q.; Funding acquisition, Z.Q.; Methodology, T.S., G.X., and X.T.; Project administration, G.X., J.H., and Z.Q.; Supervision, X.T. and J.H. Validation, T.S., F.C., and Y.L.; Visualization, T.S.; Writing—original draft, T.S.; Writing—review and editing, T.S., G.X., and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The study was funded by the National Civil Aerospace Project of China (No.D040102).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cai, B.; Xu, X.; Jia, K.; Qing, C.; Tao, D. DehazeNet: An End-to-End System for Single Image Haze Removal. IEEE Trans. Image Process. 2016, 25, 5187–5198. [Google Scholar] [CrossRef] [PubMed]
Li, B.; Peng, X.; Wang, Z.; Xu, J.; Feng, D. AOD-Net: All-in-One Dehazing Network. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4780–4788. [Google Scholar] [CrossRef]
Jiang, H.; Lu, N. Multi-Scale Residual Convolutional Neural Network for Haze Removal of Remote Sensing Images. Remote Sens. 2018, 10, 945. [Google Scholar] [CrossRef]
Gu, Z.; Zhan, Z.; Yuan, Q.; Yan, L. Single Remote Sensing Image Dehazing Using a Prior-Based Dense Attentive Network. Remote Sens. 2019, 11, 3008. [Google Scholar] [CrossRef]
Liu, X.; Ma, Y.; Shi, Z.; Chen, J. GridDehazeNet: Attention-Based Multi-Scale Network for Image Dehazing. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7313–7322. [Google Scholar] [CrossRef]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature Fusion Attention Network for Single Image Dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11908–11915. [Google Scholar] [CrossRef]
Li, Y.; Chen, X. A Coarse-to-Fine Two-Stage Attentive Network for Haze Removal of Remote Sensing Images. IEEE Geosci. Remote. Sens. Lett. 2021, 18, 1751–1755. [Google Scholar] [CrossRef]
Ullah, H.; Muhammad, K.; Irfan, M.; Anwar, S.; Sajjad, M.; Imran, A.S.; de Albuquerque, V.H. Light-DehazeNet: A Novel Lightweight CNN Architecture for Single Image Dehazing. IEEE Trans. Image Process. 2021, 30, 8968–8982. [Google Scholar] [CrossRef] [PubMed]
Jiang, B.; Chen, G.; Wang, J.; Ma, H.; Wang, L.; Wang, Y.; Chen, X. Deep Dehazing Network for Remote Sensing Image with Non-Uniform Haze. Remote Sens. 2021, 13, 4443. [Google Scholar] [CrossRef]
Guo, J.; Yang, J.; Yue, H.; Tan, H.; Hou, C.; Li, K. RSDehazeNet: Dehazing Network With Channel Refinement for Multispectral Remote Sensing Images. IEEE Trans. Geosci. Remote. Sens. 2021, 59, 2535–2549. [Google Scholar] [CrossRef]
Wei, J.; Wu, Y.; Chen, L.; Yang, K.; Lian, R. Zero-Shot Remote Sensing Image Dehazing Based on a Re-Degradation Haze Imaging Model. Remote Sens. 2022, 14, 5737. [Google Scholar] [CrossRef]
Li, J.; Chen, M.; Hou, S.; Wang, Y.; Luo, Q.; Wang, C. An Improved S2A-Net Algorithm for Ship Object Detection in Optical Remote Sensing Images. Remote Sens. 2023, 15, 4559. [Google Scholar] [CrossRef]
Sun, H.; Luo, Z.; Ren, D.; Hu, W.; Du, B.; Yang, W.; Wan, J.; Zhang, L. Partial Siamese With Multiscale Bi-Codec Networks for Remote Sensing Image Haze Removal. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 4106516. [Google Scholar] [CrossRef]
Wei, J.; Cao, Y.; Yang, K.; Chen, L.; Wu, Y. Self-Supervised Remote Sensing Image Dehazing Network Based on Zero-Shot Learning. Remote Sens. 2023, 15, 2732. [Google Scholar] [CrossRef]
Dong, W.; Wang, C.; Sun, H.; Teng, Y.; Liu, H.; Zhang, Y.; Zhang, K.; Li, X.; Xu, X. End-to-End Detail-Enhanced Dehazing Network for Remote Sensing Images. Remote Sens. 2024, 16, 225. [Google Scholar] [CrossRef]
Fang, J.; Wang, X.; Li, Y.; Zhang, X.; Zhang, B.; Gade, M. GLUENet: An Efficient Network for Remote Sensing Image Dehazing with Gated Linear Units and Efficient Channel Attention. Remote Sens. 2024, 16, 1450. [Google Scholar] [CrossRef]
He, Y.; Li, C.; Li, X.; Bai, T. A Lightweight CNN Based on Axial Depthwise Convolution and Hybrid Attention for Remote Sensing Image Dehazing. Remote Sens. 2024, 16, 2822. [Google Scholar] [CrossRef]
Zhou, H.; Wang, L.; Li, Q.; Guan, X.; Tao, T. Multi-Dimensional and Multi-Scale Physical Dehazing Network for Remote Sensing Images. Remote Sens. 2024, 16, 4780. [Google Scholar] [CrossRef]
Du, Y.; Li, J.; Sheng, Q.; Zhu, Y.; Wang, B.; Ling, X. Dehazing Network: Asymmetric Unet Based on Physical Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5607412. [Google Scholar] [CrossRef]
Chi, K.; Yuan, Y.; Wang, Q. Trinity-Net: Gradient-Guided Swin Transformer-Based Remote Sensing Image Dehazing and Beyond. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4702914. [Google Scholar] [CrossRef]
Song, Y.; He, Z.; Qian, H.; Du, X. Vision Transformers for Single Image Dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef]
Zhang, X.; Xie, F.; Ding, H.; Yan, S.; Shi, Z. Proxy and Cross-Stripes Integration Transformer for Remote Sensing Image Dehazing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5640315. [Google Scholar] [CrossRef]
Nie, J.; Xie, J.; Sun, H. Remote Sensing Image Dehazing via a Local Context-Enriched Transformer. Remote Sens. 2024, 16, 1422. [Google Scholar] [CrossRef]
Yang, L.; Cao, J.; Wang, H.; Dong, S.; Ning, H. Hierarchical Semantic-Guided Contextual Structure-Aware Network for Spectral Satellite Image Dehazing. Remote Sens. 2024, 16, 1525. [Google Scholar] [CrossRef]
Wang, Y.; Zhao, J.; Yao, L.; Fu, C. Depth-Guided Dehazing Network for Long-Range Aerial Scenes. Remote Sens. 2024, 16, 2081. [Google Scholar] [CrossRef]
Narasimhan, S.G.; Nayar, S.K. Removing weather effects from monochrome images. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), Kauai, HI, USA, 8–14 December 2001; p. 2. [Google Scholar] [CrossRef]
Wang, J.; Lu, K.; Xue, J.; He, N.; Shao, L. Single Image Dehazing Based on the Physical Model and MSRCR Algorithm. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 2190–2199. [Google Scholar] [CrossRef]
Zhu, R.; Wang, L.J. Improved wavelet transform algorithm for single image dehazing. Optik 2014, 125, 3064–3066. [Google Scholar] [CrossRef]
He, K.; Sun, J.; Tang, X. Single Image Haze Removal Using Dark Channel Prior. Trans. Pattern Anal. Mach. Intell. 2011, 33, 2341–2353. [Google Scholar] [CrossRef]
Fang, S.; Xia, X.; Huo, X.; Chen, C. Image Dehazing Using Polarization Effects of Objects and Airlight. Opt. Express 2014, 22, 19523–19537. [Google Scholar] [CrossRef]
Wang, R.; Wang, G. Single Image Recovery in Scattering Medium by Propagating Deconvolution. Opt. Express 2014, 22, 8114–8119. [Google Scholar] [CrossRef]
Zhu, Q.; Mai, J.; Shao, L. A Fast Single Image Haze Removal Algorithm Using Color Attenuation Prior. IEEE Trans. Image Process. 2015, 24, 3522–3533. [Google Scholar] [CrossRef]
Shi, X.; Huang, F.; Ju, L.; Fan, Z.; Zhao, S.; Chen, S. Hierarchical Deconvolution Dehazing Method Based on Transmission Map Segmentation. Opt. Express 2023, 31, 43234–43249. [Google Scholar] [CrossRef]
Wang, T.; Du, L.; Yi, W.; Hong, J.; Zhang, L.; Zheng, J.; Li, C.; Ma, X.; Zhang, D.; Fang, W.; et al. An adaptive atmospheric correction algorithm for the effective adjacency effect correction of submeter-scale spatial resolution optical satellite images: Application to a WorldView-3 panchromatic image. Remote Sens. Environ. 2021, 259, 112412. [Google Scholar] [CrossRef]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. arXiv 2017, arXiv:1701.04128. [Google Scholar]
Ding, X.; Zhang, X.; Zhou, Y.; Han, J.; Ding, G.; Sun, J. Scaling Up Your Kernels to 31 × 31: Revisiting Large Kernel Design in CNNs. arXiv 2022, arXiv:2203.06717. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Advances in Neural Information Processing Systems, Proceedings of the Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 4 December 2017).
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. Available online: http://doi.ieeecomputersociety.org/10.1109/ICCV48922.2021.00986 (accessed on 28 February 2022).
Gu, A.; Dao, T.; Ermon, S.; Rudra, A.; Ré, C. HiPPO: Recurrent Memory with Optimal Polynomial Projections. In Advances in Neural Information Processing Systems, Proceedings of the Annual Conference on Neural Information Processing Systems 2020, Online, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 1474–1487. Available online: https://proceedings.neurips.cc/paper_files/paper/2020/file/102f0bb6efb3a6128a3c750dd16729be-Paper.pdf (accessed on 6 December 2020).
Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; Ré, C. Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers. In Advances in Neural Information Processing Systems, Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Online, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; pp. 572–585. Available online: https://proceedings.neurips.cc/paper_files/paper/2021/file/05546b0e38ab9175cd905eebcc6ebb76-Paper.pdf (accessed on 6 December 2021).
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Kalman, R.E. A new approach to linear filtering and prediction problems. ASME J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
Chen, Z.; Brown, E.N. State space model. Scholarpedia 2013, 8, 30868. Available online: http://www.scholarpedia.org/article/State_space_model (accessed on 1 March 2024). [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Available online: https://openreview.net/forum?id=ZgtLQQR1K7 (accessed on 26 September 2024).
Guo, H.; Li, J.; Dai, T.; Ouyang, Z.; Ren, X.; Xia, S.-T. MambaIR: A Simple Baseline for Image Restoration with State-Space Model. In Computer Vision—ECCV 2024, Proceedings of the 18th European Conference, Milan, Italy, 29 September–4 October 2024; Aleš, L., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2025; pp. 222–241. [Google Scholar] [CrossRef]
Zheng, Z.; Wu, C. U-shaped Vision Mamba for Single Image Dehazing. arXiv 2024, arXiv:2402.04139. [Google Scholar]
Ma, J.; Li, F.; Wang, B. U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
Ångström, A. The Parameters of Atmospheric Turbidity. Tellus 1964, 16, 64–75. [Google Scholar] [CrossRef]
Vermote, E.F.; Tanre, D.; Deuze, J.L.; Herman, M.; Morcette, J.-J. Second Simulation of the Satellite Signal in the Solar Spectrum, 6S: An Overview. IEEE Trans. Geosci. Remote Sens. 1997, 35, 675–686. [Google Scholar] [CrossRef]
Mitchell, O.R.; Delp, E.J.; Chen, P.L. Filtering to Remove Cloud Cover in Satellite Imagery. IEEE Trans. Geosci. Electron. 1977, 15, 137–141. [Google Scholar] [CrossRef]
Li, J.; Wu, Z.; Hu, Z.; Zhang, J.; Li, M.; Mo, L.; Molinier, M. Thin Cloud Removal in Optical Remote Sensing Images Based on Generative Adversarial Networks and Physical Model of Cloud Distortion. ISPRS J. Photogramm. Remote Sens. 2020, 166, 373–389. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Wang, Z.; Zheng, J.-Q.; Zhang, Y.; Cui, G.; Li, L. Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation. arXiv 2024, arXiv:2402.05079. [Google Scholar]
Fu, D.Y.; Dao, T.; Saab, K.K.; Thomas, A.W.; Rudra, A.; Ré, C. Hungry Hungry Hippos: Towards Language Modeling with State Space Models. arXiv 2023, arXiv:2212.14052. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Ancuti, C.O.; Ancuti, C.; Timofte, R. NH-HAZE: An Image Dehazing Benchmark with Non-Homogeneous Hazy and Haze-Free Images. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1798–1805. [Google Scholar] [CrossRef]
Ancuti, C.; Ancuti, C.O.; Timofte, R.; De Vleeschouwer, C. I-HAZE: A Dehazing Benchmark with Real Hazy and Haze-Free Indoor Images. In Advanced Concepts for Intelligent Vision Systems, Proceedings of the 19th International Conference, ACIVS 2018, Poitiers, France, 24–27 September 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 620–631. [Google Scholar] [CrossRef]
Ancuti, C.O.; Ancuti, C.; Timofte, R.; De Vleeschouwer, C. O-HAZE: A Dehazing Benchmark with Real Hazy and Haze-Free Outdoor Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 867–8678. [Google Scholar] [CrossRef]
Ancuti, C.O.; Ancuti, C.; Sbert, M.; Timofte, R. Dense-Haze: A Benchmark for Image Dehazing with Dense-Haze and Haze-Free Images. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1014–1018. [Google Scholar] [CrossRef]
Ancuti, C.; Ancuti, C.O.; De Vleeschouwer, C. D-HAZY: A Dataset to Evaluate Quantitatively Dehazing Algorithms. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 2226–2230. [Google Scholar] [CrossRef]
Li, B.; Ren, W.; Fu, D.; Tao, D.; Feng, D.; Zeng, W.; Wang, Z. Benchmarking Single-Image Dehazing and Beyond. IEEE Trans. Image Process. 2019, 28, 492–505. [Google Scholar] [CrossRef]
Qin, M.; Xie, F.; Li, W.; Shi, Z.; Zhang, H. Dehazing for Multispectral Remote Sensing Images Based on a Convolutional Neural Network With the Residual Architecture. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 1645–1655. [Google Scholar] [CrossRef]
Li, Z.; Hou, W.; Qiu, Z.; Ge, B.; Xie, Y.; Hong, J.; Ma, Y.; Peng, Z.; Fang, W.; Zhang, D.; et al. Preliminary On-Orbit Performance Test of the First Polarimetric Synchronization Monitoring Atmospheric Corrector (SMAC) On-Board High-Spatial Resolution Satellite Gao Fen Duo Mo (GFDM). IEEE Trans. Geosci. Remote Sens. 2022, 60, 4104014. [Google Scholar] [CrossRef]
Hadjimitsis, D.G.; Clayton, C.R.I.; Retalis, A. On the Darkest Pixel Atmospheric Correction Algorithm: A Revised Procedure Applied Over Satellite Remotely Sensed Images Intended for Environmental Applications. In Remote Sensing for Environmental Monitoring, GIS Applications, and Geology III, Proceedings of the Remote Sensing, Barcelona, Spain, 8–12 September 2003; SPIE: Bellingham, WA, USA, 2004; Volume 5239, pp. 464–471. [Google Scholar] [CrossRef]
Huang, B.; Li, Z.; Yang, C.; Sun, F.; Song, Y. Single Satellite Optical Imagery Dehazing using SAR Image Prior Based on Conditional Generative Adversarial Networks. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020; pp. 1795–1802. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2017, arXiv:1608.03983. [Google Scholar]
Zhang, L.; Zhang, L.; Bovik, A.C. A Feature-Enriched Completely Blind Image Quality Evaluator. IEEE Trans. Image Process. 2015, 24, 2579–2591. [Google Scholar] [CrossRef]

Figure 1. The figure illustrates the simplified degradation model of hazy images. The signal at the entrance pupil irradiance consists of the scattered skylight, the forward scattering of the light source, and the transmission from the target pixel. The simplified model is generally expressed as

L (x) = J (x) t (x) + (1 - t (x)) A

. Recovering the transmission map

t (x)

and atmospheric light A from a single image is an extremely ill-posed problem (the figure represents a general scenario, with SID corresponding to the vertical direction).

Figure 1. The figure illustrates the simplified degradation model of hazy images. The signal at the entrance pupil irradiance consists of the scattered skylight, the forward scattering of the light source, and the transmission from the target pixel. The simplified model is generally expressed as

L (x) = J (x) t (x) + (1 - t (x)) A

. Recovering the transmission map

t (x)

and atmospheric light A from a single image is an extremely ill-posed problem (the figure represents a general scenario, with SID corresponding to the vertical direction).

Figure 2. Mamba adopts a streamlined module design that integrates Hungry Hungry Hippos (H3) [55] and multi-layer perceptron (MLP), the core components of most structured state space model (SSM) architectures. Unlike the traditional approach alternating between these two modules, the Mamba module achieves efficient modeling by uniformly repeating itself. Specifically, Mamba simplifies its structure by replacing the first multiplicative gate in the H3 block with an activation function. Additionally, it incorporates SSM into the main branch, distinguishing it from conventional MLP blocks. For activation functions, Mamba utilizes SiLU or Swish to enhance its nonlinear expressive capabilities.

Figure 3. The proposed U-shaped Dual Attention Vision Mamba Network (UDAVM-Net) architecture builds upon the classic U-Net framework. First, the input data undergoes shallow feature extraction before being flattened and fed into multiple Residual Visual Mamba Block Groups (RVMBGs) for deeper feature representation. The data are then processed through the encoder–decoder pathway of U-Net, ultimately being reshaped to its original spatial dimensions. Within each Residual Visual Mamba Block (RVMB), Conv operations, self-attention mechanisms, linear projection, and nonlinear activation functions are combined to capture local and global features effectively.

Figure 4. The RVMBG module is composed of multiple stacked RVMB blocks. Each RVMB contains two main components: a spatial attention component multi-path scanning Selective SSMs (MPS7) centered on visual state space module (VSSM), and a channel attention component residual channel attention block (RCAB).

Figure 5. VSSM block, which adopts a multi-path scanning mechanism that enhances the ability to capture the contextual relationships between pixels around the anchor point.

Figure 6. Simplified pipeline for synthesizing the remote sensing simulation of top of atmosphere (RSSTOA) dataset. Here, MUX denotes multispectral, Atm. represents atmospheric, and Refs refers to reflectance.

Figure 7. Examples of synthesized hazy images from the RSSTOA dataset.

Figure 8. Visualization results of the effective receptive field (ERF) [35,36]. A larger peach region indicates a larger ERF. It can be seen that our method achieves a global ERF.

Figure 9. Visual comparisons of dehazed results by different methods on the StateHaze1k Thin dataset. The even-numbered rows show the error maps between the dehazed images and the ground truth, where smaller errors are indicated by deeper shades of blue. (a) Hazy image; (b) DCP [29]; (c) AOD-Net [2]; (d) Light-DehazeNet [8]; (e) FFA-Net [6]; (f) GridDehazeNet [5]; (g) Trinity-Net [20]; (h) AU-Net [19]; (i) UDAVM-Net (ours); and (j) Ground truth.

Figure 10. Visual comparisons of dehazed results by different methods on the StateHaze1k moderate dataset. Smaller errors are indicated by deeper shades of blue. (a) Hazy image; (b) DCP [29]; (c) AOD-Net [2]; (d) Light-DehazeNet [8]; (e) FFA-Net [6]; (f) GridDehazeNet [5]; (g) Trinity-Net [20]; (h) AU-Net [19]; (i) UDAVM-Net (ours); (j) Ground truth.

Figure 11. Visual comparisons of dehazed results by different methods on the StateHaze1k thick dataset. Smaller errors are indicated by deeper shades of blue. (a) Hazy image; (b) DCP [29]; (c) AOD-Net [2]; (d) Light-DehazeNet [8]; (e) FFA-Net [6]; (f) GridDehazeNet [5]; (g) Trinity-Net [20]; (h) AU-Net [19]; (i) UDAVM-Net (ours); (j) Ground truth.

Figure 12. Visual comparisons of the dehazed results for sparse ground objects by different methods on the RSSTOA dataset. Smaller errors are indicated by deeper shades of blue.(a) Hazy image; (b) DCP [29]; (c) AOD-Net [2]; (d) Light-DehazeNet [8]; (e) FFA-Net [6]; (f) GridDehazeNet [5]; (g) Trinity-Net [20]; (h) AU-Net [19]; (i) UDAVM-Net (ours); and (j) ground truth.

Figure 13. Visual comparisons of the dehazed results for dense ground objects by different methods on the RSSTOA dataset. Smaller errors are indicated by deeper shades of blue. (a) Hazy image; (b) DCP [29]; (c) AOD-Net [2]; (d) Light-DehazeNet [8]; (e) FFA-Net [6]; (f) GridDehazeNet [5]; (g) Trinity-Net [20]; (h) AU-Net [19]; (i) UDAVM-Net (ours); and (j) ground truth.

Figure 14. Visual comparisons of dehazed results by different methods on real remote sensing data. (a) Hazy image; (b) AOD-Net [2]; (c) Light-DehazeNet [8]; (d) FFA-Net [6]; (e) GridDehazeNet [5]; (f) Trinity-Net [20]; (g) AU-Net [19]; (h) UDAVM-Net (ours).

Figure 15. Comparison of computational complexity at varying input image scales. The left y axis corresponds to UDAVM-Net, while the right y axis corresponds to Transformer. As the exact floating-point operations (FLOPs) values could not be directly obtained, they were estimated based on the network architecture and parameters. Notably, the proposed algorithm demonstrates a linear scaling trend, effectively avoiding the prohibitive quadratic computational complexity of the Transformer with full-attention mechanism.

Table 1. Comparison of quantitative results on SateHaze 1K datasets. Bold and underline indicate the best and second-best results.

Methods		Thin Haze		Moderate Haze		Thick Haze		Average
Methods		PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
Classical	DCP [29]	16.84	0.8215	19.04	0.8948	14.32	0.6931	16.73	0.8031
CNN Based	AOD-Net [2]	18.90	0.8521	18.50	0.8640	16.43	0.7505	17.94	0.8222
	Light-DehazeNet [8]	18.20	0.8523	18.99	0.8876	15.49	0.7381	17.56	0.8260
	FFA-Net [6]	23.86	0.9183	24.57	0.9344	21.38	0.8506	23.27	0.9011
	GridDehazeNet [5]	25.18	0.9208	26.02	0.9390	21.81	0.8477	24.34	0.9025
	AU-Net [19]	22.72	0.9062	26.47	0.9410	21.25	0.8318	23.48	0.8930
Swin-T based	Trinity-Net [20]	22.77	0.9367	24.78	0.9400	20.31	0.8895	22.62	0.9221
Ours	UDAVM-Net	26.79	0.9340	27.53	0.9521	23.48	0.8732	25.59	0.9198

Table 2. Comparison of quantitative results on RSSTOA datasets. Bold and underline indicate the best and second-best results.

Methods		PSNR	SSIM
Classical	DCP [29]	19.89	0.7429
CNN Based	AOD-Net [2]	21.48	0.7843
	Light-DehazeNet [8]	22.64	0.8006
	FFA-Net [6]	34.95	0.9619
	GridDehazeNet [5]	35.04	0.9681
	AU-Net [19]	32.36	0.9353
Swin-T based	Trinity-Net [20]	33.25	0.9356
Ours	UDAVM-Net	38.03	0.9734

Table 3. Comparison of quantitative results on real remote sensing data. Bold and underline indicate the best and second-best results, respectively.

Methods	Hazed Images	AOD- Net [2]	Light- DehazeNet [8]	FFA- Net [6]	Grid- DehazeNet [5]	Trinity- Net [20]	AU- Net [19]	UDAVM- Net
IL-NIQE	58.94	64.80	73.92	44.62	49.22	45.43	44.45	43.82

Table 4. Ablation experiments on StateHaze1k thin dataset. Bold numbers represent the best results.

Model	PSNR	SSIM
Base	24.32	0.9207
Base + CA	24.65	0.9217
Base + VSSM_SHS + CA	25.29	0.9260
Base + VSSM_HVS + CA	25.56	0.9249
Base + VSSM + CA	25.60	0.9273

Table 5. Comparison of number of parameters (#Params) and FLOPs across various models. For models lacking direct FLOPs data, estimates were derived based on their respective frameworks and are marked with ≈ in the table.

Methods		#Params	FLOPs
Classical	DCP [29]	-	-
CNN based	AOD-Net [2]	0.02 M	0.03 G
	Light-DehazeNet [8]	0.03 M	0.49 G
	FFA-Net [6]	4.68 M	75.54 G
	GridDehazeNet [5]	0.96 M	5.36 G
	AU-Net [19]	7.14 M	13.01 G
Transformer based	Transformer [38]	65 M	≈ 15.88 T
	ViT-B/16 [37]	86 M	≈ 11.2 G
	Swin-B [39]	88 M	≈ 5.03 G
	Trinity-Net [20]	20.14 M	7.61 G
Ours	UDAVM-Net	19.57 M	21.34 G

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sui, T.; Xiang, G.; Chen, F.; Li, Y.; Tao, X.; Zhou, J.; Hong, J.; Qiu, Z. U-Shaped Dual Attention Vision Mamba Network for Satellite Remote Sensing Single-Image Dehazing. Remote Sens. 2025, 17, 1055. https://doi.org/10.3390/rs17061055

AMA Style

Sui T, Xiang G, Chen F, Li Y, Tao X, Zhou J, Hong J, Qiu Z. U-Shaped Dual Attention Vision Mamba Network for Satellite Remote Sensing Single-Image Dehazing. Remote Sensing. 2025; 17(6):1055. https://doi.org/10.3390/rs17061055

Chicago/Turabian Style

Sui, Tangyu, Guangfeng Xiang, Feinan Chen, Yang Li, Xiayu Tao, Jiazu Zhou, Jin Hong, and Zhenwei Qiu. 2025. "U-Shaped Dual Attention Vision Mamba Network for Satellite Remote Sensing Single-Image Dehazing" Remote Sensing 17, no. 6: 1055. https://doi.org/10.3390/rs17061055

APA Style

Sui, T., Xiang, G., Chen, F., Li, Y., Tao, X., Zhou, J., Hong, J., & Qiu, Z. (2025). U-Shaped Dual Attention Vision Mamba Network for Satellite Remote Sensing Single-Image Dehazing. Remote Sensing, 17(6), 1055. https://doi.org/10.3390/rs17061055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

U-Shaped Dual Attention Vision Mamba Network for Satellite Remote Sensing Single-Image Dehazing

Abstract

1. Introduction

2. Related Work

2.1. Image Dehazing Algorithms

2.2. State Space Models

3. Proposed Method

3.1. SSMs and Mamba Model

3.2. Overview of Proposed UDAVM-Net Architecture

3.3. Residual Visual Mamba Block

3.4. MPS7 Block

3.5. RCAB Block

3.6. VSSM Block

4. RSSTOA Dataset

4.1. Synthetic Model for Dehazing Based on Radiative Transfer Theory

4.1.1. TOA Apparent Reflectance Without Cloud

4.1.2. TOA Reflectance Contaminated by Cloud

4.1.3. Pipeline for Generating Cloud Transmission Maps

4.2. Synthesis Pipeline

5. Experimental Results

5.1. Experimental Settings

5.1.1. Dataset

5.1.2. Parameter Settings

5.1.3. Evaluation Metric and Benchmark Methods

5.2. Qualitative Comparisons

5.3. Quantitative Comparison

5.4. Results and Analysis on Real Remote Sensing Data

6. Discussion

6.1. Ablation Study

6.2. Model Complexity

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI