FSSM: Frequency-Enhanced State Space Modeling with FFT-Based Two-Sided Non-Causal Convolution for Image Dehazing

Zeng, Li; Huang, Yinqing

doi:10.3390/jimaging12060260

Open AccessArticle

FSSM: Frequency-Enhanced State Space Modeling with FFT-Based Two-Sided Non-Causal Convolution for Image Dehazing

by

Li Zeng

and

Yinqing Huang

^*

School of Mechatronics and Vehicle Engineering, Chongqing Jiaotong University, Chongqing 400074, China

^*

Author to whom correspondence should be addressed.

J. Imaging 2026, 12(6), 260; https://doi.org/10.3390/jimaging12060260 (registering DOI)

Submission received: 19 April 2026 / Revised: 7 June 2026 / Accepted: 8 June 2026 / Published: 13 June 2026

(This article belongs to the Topic Computer Vision and Image Processing, 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

Image dehazing is a fundamental visual restoration task for improving visual perception under low-visibility weather conditions, especially in UAV-based remote sensing, traffic monitoring, and surveillance scenarios. Existing convolutional neural networks are effective in local feature extraction but remain limited in long-range dependency modeling, while Transformer-based methods improve global modeling at the cost of high computational complexity. To address these issues, this paper proposes an efficient image-dehazing framework termed FSSM, which integrates frequency-enhanced State Space Modeling with a hierarchical encoder–decoder architecture. Specifically, an FFT-based State Space Block (FFTSSB) is designed to reformulate state propagation as frequency-domain two-sided non-causal convolution, enabling efficient bidirectional global dependency modeling without explicit recursive scanning. Furthermore, a Frequency-Aware Discriminative Enhancement Block (FDEB) is introduced to enhance local textures, edges, and structural details through spatial gating and lightweight block-wise frequency modulation. Based on these two components, a Frequency-Aware State Interaction (FASI) block is constructed to progressively couple global state propagation and local frequency-aware enhancement. Experimental results on the HazyDet dataset demonstrate that FSSM achieves favorable restoration accuracy, structural consistency, and perceptual quality compared with representative dehazing methods. Ablation studies further validate the effectiveness of the proposed two-sided FFT-based state modeling, frequency-aware enhancement, and hierarchical multi-scale design.

Keywords:

image dehazing; state space model; FFT-based convolution; frequency-domain enhancement; UAV images

1. Introduction

Image dehazing is a fundamental visual perception task under adverse weather conditions and plays a critical role in a wide range of applications, including autonomous driving, video surveillance, and remote sensing [1,2]. In such scenarios, image dehazing not only improves visual quality but also enhances the stability and robustness of downstream vision systems. Traditional dehazing methods are typically derived from the atmospheric scattering model and benefit from clear physical interpretability. Although these prior-based methods can achieve satisfactory results in relatively controlled scenarios, their performance often deteriorates in complex real-world scenes with non-uniform illumination and spatially varying haze, thereby limiting their generalization capability [3,4].

With the rapid development of deep learning, convolutional neural networks (CNNs) and vision Transformers have significantly advanced the performance of image dehazing. CNN-based methods mainly rely on sophisticated architectural designs and module innovations, such as residual connections [5], attention mechanisms [6], and large-kernel convolutions [7,8], to enhance feature representation. However, the inherently local receptive field of convolution limits the ability of CNNs to model long-range dependencies, often leading to insufficient global consistency and suboptimal detail recovery [9]. In contrast, Transformer-based models introduce self-attention mechanisms to capture global contextual information and have achieved remarkable progress in image dehazing [10,11]. Nevertheless, the quadratic computational complexity of self-attention with respect to input resolution severely restricts their efficiency, especially for high-resolution images and real-time or mobile deployment scenarios [12,13].

Recently, State Space Models (SSMs) [14,15] have attracted considerable attention because of their ability to model long-range dependencies with linear computational complexity. By parameterizing sequence dynamics through exponential kernels, SSMs have demonstrated impressive performance in time-series modeling, natural language processing, and signal processing tasks. Notably, the improved SSM architecture Mamba [16,17] introduces a selective scanning mechanism that enables the model to efficiently retain task-relevant information while suppressing irrelevant components, offering an attractive balance between modeling capacity and computational efficiency. These properties motivate the exploration of SSMs for image restoration tasks, including image dehazing, where effective non-local modeling is crucial.

However, directly applying SSMs to image dehazing poses several challenges. Naïve adaptations often introduce redundant computations and directional bias, particularly when modeling bidirectional spatial dependencies inherent in images [18,19,20]. Moreover, existing SSM formulations are not explicitly designed to exploit frequency-domain characteristics, which are highly informative for haze-related degradation patterns.

To address these limitations, this paper proposes a novel and efficient image-dehazing framework termed FSSM (Frequency-Enhanced State Space Model). The proposed approach integrates the continuous dynamic modeling capability of State Space Models with frequency-domain enhancement mechanisms, enabling collaborative optimization of global dependency modeling and local detail restoration. Specifically, the main contributions of this work are summarized as follows:

We propose an FFT-based State Space Block (FFTSSB), which performs two-sided non-causal convolution in the frequency domain to replace conventional recursive state propagation. This design enables implicit state propagation with reduced computational redundancy, facilitating efficient global dependency modeling while preserving physical interpretability.
We design a Frequency-Aware State Interaction (FASI) Block, which tightly couples FFTSSB with the Frequency-Aware Discriminative Enhancement Block (FDEB). This unified spatial–frequency modeling unit enhances texture and edge restoration and improves robustness to complex haze structures.
We construct a hierarchical multi-scale encoder–decoder framework with skip connections, allowing effective cross-scale feature interaction and fusion. This design significantly improves detail recovery and structural consistency in dehazed images.
Extensive experiments on the HazyDet dataset demonstrate that FSSM achieves favorable quantitative and qualitative dehazing performance, with improvements in PSNR and SSIM over representative comparison methods. Ablation studies further verify the effectiveness of the proposed FFTSSB and FDEB modules in enhancing global structural consistency and local detail restoration.

2. Related Work

2.1. Traditional Image-Dehazing Methods

Traditional image-dehazing methods are mainly based on physical models or handcrafted statistical priors. The dark channel prior (DCP) proposed by He et al. [4] assumes that at least one color channel in non-sky regions has very low intensity, and has become a classical approach for single-image dehazing. Zhu et al. proposed the color attenuation prior (CAP) [21], which estimates scene depth via linear regression to recover haze-free images. In addition, Berman et al. introduced the non-local dehazing (NLD) method [22], which reconstructs the transmission map by exploiting color clustering properties in natural images.

Although these prior-based methods exhibit strong physical interpretability and can produce visually pleasing results under certain conditions, they generally rely on idealized assumptions. As a result, their performance often degrades in complex environments, such as nighttime scenes, non-uniform haze, or strong illumination conditions [23]. Moreover, due to the lack of adaptive learning capability, these methods struggle to capture fine-grained texture details, frequently leading to over-smoothed results or color distortions, which limits their practical applicability and generalization ability.

2.2. Deep Learning-Based Dehazing Methods

Deep learning techniques have significantly advanced the development of image dehazing. AOD-Net [24] was among the first to embed the atmospheric scattering model into a neural network, enabling end-to-end haze removal. FFA-Net [25] introduced feature attention mechanisms to enhance texture restoration, while MixDehaze-Net [26] adopted multi-scale dense architectures to improve detail reconstruction. OKNet [27] proposed an omni-kernel module that integrates global, large-scale, and local branches, achieving effective multi-scale feature learning for image dehazing, desnowing, and deblurring tasks.

Furthermore, DWTMA-Net [28] combined discrete wavelet transform (DWT) with multi-dimensional attention mechanisms to jointly recover spatial- and frequency-domain features, significantly improving detail fidelity and color consistency. Transformer-based methods, such as MB-TaylorFormer [29], U²-Former [30], and SelfPromer [31] leverage self-attention mechanisms to capture long-range dependencies and have achieved strong dehazing performance. However, their high computational complexity makes them difficult to deploy in real-time systems. Recently, hybrid architectures, including Trinity-Net [32] and Two-Branch Image Dehazing [33], have attempted to combine the local representation capability of CNNs with the global modeling strength of Transformers, achieving a more favorable trade-off between performance and efficiency.

More recently, researchers have further explored real-world and diverse haze removal from the perspectives of domain adaptation, generative modeling, degradation diversity, and lightweight deployment. UCL-Dehaze [34] introduced an unsupervised contrastive learning paradigm to alleviate the domain gap between synthetic and real hazy images. Dehaze-RetinexGAN [35] exploited Retinex-based decomposition and generative adversarial learning for real-world dehazing with unpaired data. MDST-Dehaze [36] considered the diversity of haze distributions across different domains and employed multi-domain haze-style transfer to improve robustness under heterogeneous haze conditions. From the perspective of generative restoration, PRDiff-Dehaze [37] adopted a progressive refinement diffusion framework to handle non-homogeneous haze degradation. Meanwhile, WaveLiteDehaze-Network [38] explored lightweight wavelet-domain processing for real-time dehazing with a small number of parameters.

In addition, recent Transformer-based and State Space Model-based image restoration methods have shown strong potential for long-range dependency modeling. Methods such as DehazeFormer, Restormer, UVM-Net, and Mamba-style image restoration networks provide useful references for improving global feature interaction and restoration quality. However, attention-based Transformer models usually introduce relatively high computational cost when processing high-resolution images, while many SSM/Mamba-style methods still rely on directional scanning or do not explicitly exploit haze-related frequency-domain characteristics. In contrast, the proposed FSSM reformulates state propagation as FFT-based two-sided non-causal convolution and further integrates frequency-aware discriminative enhancement. This design aims to jointly improve global structural consistency and local detail restoration while maintaining computational efficiency.

Overall, these recent studies indicate that image dehazing is moving toward real-world adaptability, degradation-aware modeling, generative restoration, long-range dependency modeling, and efficient deployment. Nevertheless, most existing methods still face challenges in simultaneously modeling long-range structural dependencies, enhancing frequency-domain details, and maintaining computational efficiency. This motivates us to explore a frequency-enhanced State-Space Modeling framework that can jointly improve global structural consistency and local detail restoration while preserving efficiency.

2.3. State Space Models for Image Dehazing

State Space Models (SSMs) have recently demonstrated great potential for long-range dependency modeling with linear or near-linear computational complexity, particularly in natural language processing tasks. The improved SSM architecture Mamba [39] introduces a selective scanning mechanism that adaptively allocates memory to task-relevant features, enabling efficient dynamic sequence modeling and achieving remarkable performance gains in various NLP and signal processing applications.

In the vision domain, researchers have begun to explore the application of SSMs to image restoration and dehazing. VMamba and Wave-MambaAD model implicit global dependencies in the spatial domain through parameterized convolutional kernels, improving structural consistency and dehazing quality. UVM-Net [40] proposed a bidirectional state-space module (Bi-SSM) that integrates convolutional local feature extraction with global modeling capability of SSMs, achieving a balance between performance and computational complexity for single-image dehazing. Laplace-Mamba [41] further combines Laplacian pyramid decomposition with a Mamba–CNN hybrid architecture to separately process low-frequency global structures and high-frequency details, significantly enhancing image hierarchy and edge sharpness.

Despite these advances, existing SSM-based dehazing methods still suffer from relatively high computational overhead and directional redundancy. This motivates the development of more efficient spatial–frequency joint modeling mechanisms to further improve dehazing performance and deployment efficiency.

3. Methodology

3.1. Notation

To improve mathematical clarity and readability, we adopt unified notation conventions throughout this paper. Bold uppercase symbols, such as

I

,

X

, and

F

, denote image or feature tensors. Bold uppercase symbols such as

A

,

B

,

C

, and

D

denote learnable matrices in the state-space formulation. Bold lowercase symbols, such as

x

,

h

, and

y

, denote vectors or feature sequences. Lowercase italic symbols, such as t, H, W, C, R, and

λ

, denote scalar variables. Calligraphic symbols, such as

F (\cdot)

,

F^{- 1} (\cdot)

, and

N (\cdot)

, denote operators or network mappings. The notation ⊙ denotes element-wise multiplication.

3.2. Overall Architecture

This paper proposes a frequency-enhanced State Space Modeling framework based on FFT-based two-sided non-causal convolution, termed FSSM, for efficient single-image dehazing. The core idea is to leverage frequency-domain convolution to efficiently model global dependencies while enhancing state-space representations, thereby achieving high-quality image dehazing.

As illustrated in Figure 1, the proposed FSSM adopts a hierarchical encoder–decoder architecture [42]. Given a hazy input image

I_{hazy} \in R^{H \times W \times 3}

, a

3 \times 3

convolution is first applied to extract shallow features

F_{s} \in R^{H \times W \times C}

, where

H \times W

denotes the spatial resolution and C represents the number of feature channels. The shallow features are then fed into a three-level symmetric encoder–decoder network, in which each encoder and decoder stage consists of stacked Frequency-Aware State Interaction (FASI) blocks for spatial–frequency feature modeling at different scales.

For the encoder and decoder at the l-th level, the input features are progressively transformed through stacked FASI blocks, generating intermediate representations

F_{enc}^{l}

and

F_{dec}^{l}

, where

F_{enc}^{l}, F_{dec}^{l} \in R^{\frac{H}{2^{l - 1}} \times \frac{W}{2^{l - 1}} \times C_{l}}

and

l = 1, 2, 3

. Spatial resolution changes and channel alignment are achieved via

1 \times 1

convolutions combined with bilinear interpolation for downsampling and upsampling, respectively. Skip connections are introduced between corresponding encoder and decoder stages to facilitate the fusion of shallow fine-grained details and deep semantic information, thereby preserving structural boundaries and texture details.

Finally, a

3 \times 3

convolution is applied to the output features of the last decoder stage to predict the residual image

R \in R^{H \times W \times 3}

. The residual image is defined as:

R = N (I_{hazy}),

(1)

where

N (\cdot)

denotes the proposed FSSM dehazing network. The final dehazed image is obtained by adding the predicted residual image to the hazy input:

I_{dehaze} = I_{hazy} + R .

(2)

This residual formulation is also explicitly illustrated in Figure 1a, where the network output is first treated as the residual image and then added to the input hazy image to generate the final dehazed result.

3.3. State Space Modeling and Its Extensions

3.3.1. State Space Model Foundation

State Space Models (SSMs) provide a general framework for modeling dynamic systems and have been widely applied in time-series prediction and signal processing. The relationship among the input, latent state, and output can be formulated as:

h^{'} (t) = A h (t) + B x (t), y (t) = C h (t) + D x (t),

(3)

where

x (t)

,

h (t)

, and

y (t)

denote the input, latent state, and output at continuous time t, respectively.

A

,

B

,

C

, and

D

are learnable system matrices.

By applying zero-order hold (ZOH) discretization, the continuous formulation can be rewritten as:

h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t}, y_{t} = C h_{t} + D x_{t},

(4)

where

x_{t}

,

h_{t}

, and

y_{t}

denote the discretized input, hidden state, and output at index t, respectively.

\bar{A}

and

\bar{B}

are the discretized state transition and input matrices.

Mamba [16] introduces a selective scanning mechanism (S6) that enables linear-complexity sequence modeling by adaptively controlling information flow. However, when applied to vision tasks, two major limitations arise: (1) unidirectional state propagation restricts the modeling of bidirectional spatial dependencies; and (2) explicit recursive updates incur considerable computational overhead for high-resolution feature maps.

3.3.2. FFT-Based State Space Block (FFTSSB)

To efficiently model long-range dependencies in vision tasks, we propose an FFT-Based State Space Block (FFTSSB). The core idea of this block is to reformulate state-space propagation as a convolution process and then implement the resulting two-sided non-causal convolution in the frequency domain, thereby enabling efficient implicit state propagation with reduced computational overhead.

For a discrete State Space Model, the hidden-state update can be written as a weighted summation over historical inputs. Under a zero-initial-state assumption, the state response can be unfolded as:

h_{t} = \sum_{τ = 0}^{t} {\bar{A}}^{t - τ} \bar{B} x_{τ},

(5)

where

\bar{A}

and

\bar{B}

denote the discretized state transition and input matrices, respectively, and

x_{τ}

denotes the input feature sequence at index

τ

. This formulation indicates that the State-Space Model can be interpreted as a discrete convolution process between the input sequence and the system impulse response.

Accordingly, the impulse response kernel can be defined as:

k [t - τ] = {\bar{A}}^{t - τ} \bar{B},

(6)

where

k [t - τ]

denotes the state-space impulse response kernel corresponding to the relative index

t - τ

. The state update can then be uniformly rewritten as:

h_{t} = \sum_{τ = 0}^{t} k [t - τ] x_{τ} .

(7)

To enable effective long-range dependency modeling with a compact parameterization, we adopt an exponential kernel generator (EKG) to parameterize the impulse response. Specifically, following the exponential kernel parameterization strategy in prior State Space Models [43], the i-th exponential basis function is defined as:

ϕ_{i} (t) = e^{- a_{i} t}, a_{i} = softplus ({\tilde{a}}_{i}) + ϵ, t \in {0, 1, \dots, T} .

(8)

where

a_{i}

denotes the learnable decay rate,

{\tilde{a}}_{i}

is the learnable parameter before activation,

ϵ

is a small positive constant for numerical stability, and T denotes the maximum sequence length.

Based on these exponential bases, the channel-wise single-sided impulse response for the c-th channel is constructed as:

h^{(c)} [t] = \sum_{i = 1}^{S} w_{i}^{(c)} e^{- a_{i} t},

(9)

where

h^{(c)} [t]

denotes the single-sided impulse response at index t in the c-th channel,

w_{i}^{(c)}

denotes the linear combination coefficient of the i-th basis in the c-th channel, and S is the number of basis functions. This parameterization corresponds to the exponential decay response of a continuous-time linear system and provides an interpretable and flexible mechanism for modeling dependencies at different scales.

However, conventional State Space Models generally adopt a causal propagation manner and only utilize historical information for state updating, which is insufficient for capturing the bidirectional spatial dependencies inherent in images. To address this limitation, we further extend the single-sided kernel to a symmetric two-sided non-causal kernel through mirror expansion:

k^{(c)} [τ] = \{\begin{matrix} h^{(c)} [- τ], & τ < 0, \\ h^{(c)} [τ], & τ \geq 0 . \end{matrix}

(10)

where

k^{(c)} [τ]

denotes the two-sided non-causal convolution kernel in the c-th channel, and

τ

represents the relative spatial offset.

Compared with one-sided FFT-based convolution, which only aggregates information from one direction, the proposed two-sided non-causal convolution incorporates both preceding and following spatial contexts. This is more consistent with the non-causal nature of image restoration, where the recovery of each pixel or patch should depend on surrounding structures from multiple directions rather than only historical positions. Therefore, the two-sided design reduces directional bias and enhances global structural consistency, which is particularly important for UAV dehazing scenes with large sky regions, long-range structures, and spatially non-uniform haze.

Given the input feature sequence

x [t] \in R^{C}

, where

x^{(c)} [t]

denotes the feature at position t in the c-th channel, the corresponding convolution output is expressed as:

y^{(c)} [t] = \sum_{τ = - R}^{R} k^{(c)} [τ] x^{(c)} [t - τ],

(11)

where R denotes the convolution radius and

y^{(c)} [t]

denotes the output feature at position t in the c-th channel. In this way, FFTSSB is able to simultaneously capture forward and backward spatial dependencies, thereby enhancing global structure modeling capability.

To further improve computational efficiency, the above convolution is implemented in the frequency domain according to the convolution theorem:

Y^{(c)} = F {x^{(c)}} ⊙ F {k^{(c)}}, y^{(c)} = F^{- 1} {Y^{(c)}},

(12)

where

F (\cdot)

and

F^{- 1} (\cdot)

denote the Fourier transform and inverse Fourier transform, respectively,

Y^{(c)}

denotes the frequency-domain representation of the convolution result, and ⊙ represents element-wise multiplication. By performing convolution in the frequency domain, FFTSSB avoids explicit recursive state updates and significantly reduces the computational burden while preserving strong global modeling capability.

From a theoretical perspective, the proposed FFTSSB can be viewed as a convolutional realization of the discrete solution to the state-space equations. Through the interpretable exponential kernel parameterization provided by the EKG and the symmetric two-sided non-causal extension, FFTSSB enables efficient bidirectional state propagation in the frequency domain. Compared with the unidirectional modeling strategies adopted in S4 [43] and Hyena [44], the proposed design is more suitable for image dehazing, where jointly modeling long-range structural dependencies and spatially distributed degradation patterns is crucial. Moreover, FFTSSB provides globally coherent feature priors for the subsequent Frequency-Aware State Interaction (FASI) block, facilitating the effective integration of global structure modeling and local detail restoration.

3.3.3. Frequency-Aware Discriminative Enhancement Block (FDEB)

Although State Space Models (SSMs) are effective in modeling long-range dependencies and global structures, accurate recovery of local textures and edge details remains equally important for image restoration tasks such as image dehazing. To compensate for the limited local discriminative modeling capability of SSM-based global representations, we introduce a Frequency-Aware Discriminative Enhancement Block (FDEB), which is designed to refine the features produced by the FFT-Based State Space Block (FFTSSB) through explicit local spatial filtering and lightweight frequency-domain modulation.

In deep neural networks, feed-forward structures are commonly used for nonlinear transformation and local feature enhancement, and have been shown to play an important role in image restoration tasks [45]. Recent studies further indicate that introducing frequency-domain operations into feed-forward networks can improve restoration quality by selectively enhancing informative spectral components. For example, FFTformer [46] employs frequency-domain feature modulation to strengthen texture and detail reconstruction. However, directly performing Fast Fourier Transform (FFT) on high-dimensional intermediate feature maps usually incurs considerable computational overhead, especially when the feature resolution and channel number are large.

To address this issue, FDEB adopts a lightweight spatial-frequency collaborative design. Instead of applying FFT directly to the original high-dimensional feature map, FDEB first performs local spatial screening through gated transformation and then conducts block-wise frequency-domain modulation on the filtered features. In this way, local discriminative enhancement can be achieved while keeping the computational cost under control.

Given an input feature tensor

X \in R^{H \times W \times C}

, FDEB first maps it into a higher-dimensional hidden space through a

1 \times 1

convolution:

X_{1} = {Conv}_{1 \times 1} (X),

(13)

where

X_{1}

denotes the projected hidden feature.

Then, a depth-wise convolution is applied to capture local spatial contextual information, and the resulting features are evenly split into two branches along the channel dimension:

X_{2}, X_{3} = Split (DConv (X_{1})),

(14)

where

DConv (\cdot)

denotes the depth-wise convolution operation, and

Split (\cdot)

represents channel-wise splitting.

X_{2}

and

X_{3}

denote two feature branches generated from

X_{1}

with the same spatial resolution. The former is activated by GELU to generate a nonlinear gating signal, while the latter provides local spatial features for modulation.

A gated activation mechanism is then introduced to perform local feature selection:

X_{4} = GELU (X_{2}) ⊙ X_{3},

(15)

where

GELU (\cdot)

is the Gaussian Error Linear Unit and ⊙ represents element-wise multiplication.

After local gating, a

1 \times 1

convolution is used to project the feature back to the original channel dimension, yielding the spatial feature for subsequent frequency enhancement:

X_{s} = {Conv}_{1 \times 1} (X_{4}),

(16)

where

X_{s}

denotes the spatially gated feature after channel projection.

To avoid the computational burden caused by directly applying frequency transformation to the entire feature map, FDEB performs block-wise frequency modeling. Specifically, the spatial feature

X_{s}

is first partitioned into a set of local patches:

X_{p} = P (X_{s}),

(17)

where

P (\cdot)

denotes the patch unfolding operation, and

X_{p}

represents the unfolded local patch features.

Then, a two-dimensional Fast Fourier Transform (2D-FFT) is applied to each local patch:

X_{f} = F_{2 D} (X_{p}),

(18)

where

F_{2 D} (\cdot)

denotes the two-dimensional Fourier transform, and

X_{f}

represents the complex-valued frequency-domain representation of local patch features.

To explicitly model the importance of different frequency components, a learnable frequency-weight matrix

W_{f}

is introduced to adaptively reweight the spectral responses:

{\hat{X}}_{f} = W_{f} ⊙ X_{f} .

(19)

where

W_{f}

denotes the learnable frequency modulation matrix, and

{\hat{X}}_{f}

denotes the enhanced frequency-domain feature. This adaptive reweighting mechanism enables the network to automatically emphasize frequency components that are more relevant to edge, texture, and structural restoration, while suppressing less informative or redundant frequency responses.

Finally, the enhanced frequency-domain feature is transformed back to the spatial domain through the inverse Fourier transform, and the local patches are reassembled to recover the original feature layout:

\hat{X} = P^{- 1} (F_{2 D}^{- 1} ({\hat{X}}_{f})),

(20)

where

F_{2 D}^{- 1} (\cdot)

and

P^{- 1} (\cdot)

denote the inverse two-dimensional Fourier transform and patch folding operation, respectively.

\hat{X}

denotes the reconstructed spatial-domain feature after frequency-aware enhancement.

Through the above process, FDEB forms a compact three-stage enhancement pipeline, namely, spatial projection and gating, block-wise frequency modulation, and spatial reconstruction. This design preserves the computational efficiency of FFT-based operations while enabling explicit discriminative modeling of frequency information. Compared with directly introducing complex attention-based local enhancement modules, FDEB provides a more efficient and controllable solution for improving local texture reconstruction and structural detail restoration, making it particularly suitable for lightweight image dehazing frameworks.

3.3.4. Frequency-Aware State Interaction Block (FASI Block)

To further integrate the advantages of state propagation and frequency-aware enhancement, we propose a Frequency-Aware State Interaction Block (FASI Block). This module is designed to couple the FFT-Based State Space Block (FFTSSB) and the Frequency-Aware Discriminative Enhancement Block (FDEB) within a unified residual interaction framework, so as to achieve collaborative modeling of global structural dependencies and local discriminative details.

In UAV image dehazing, haze degradation not only reduces global contrast and structural consistency over large spatial regions, but also causes local blurring of edges and textures, leading to the loss of high-frequency details. As analyzed above, FFTSSB mainly focuses on long-range dependency modeling through bidirectional non-causal state propagation and is effective in enhancing global structural consistency and cross-position information integration. In contrast, FDEB emphasizes local discriminative enhancement in the frequency domain, adaptively modulating key frequency responses associated with edge, texture, and structural restoration. Since these two modules operate at different modeling levels and focus on different feature properties, directly using either one alone is insufficient for fully exploiting their complementary strengths.

Specifically, if only global state propagation is employed, the fine-grained detail enhancement introduced by local frequency modeling may be weakened during feature integration. Conversely, if only local frequency enhancement is considered without global structural constraints, the coordination of spatial information may become inadequate, which can affect structural continuity and restoration stability. Moreover, in a multi-scale encoder–decoder architecture, different feature resolutions correspond to different restoration objectives. Therefore, a unified interaction mechanism is required to facilitate effective collaboration between global state modeling and local frequency-aware enhancement across different scales.

Given an input feature representation

X \in R^{H \times W \times C}

, FASI first feeds the normalized feature into FFTSSB to perform bidirectional non-causal state propagation, and then fuses the resulting feature with the input through a residual connection:

X_{s} = X + FFTSSB (Norm (X)),

(21)

where

Norm (\cdot)

denotes the normalization operation, and

X_{s}

denotes the state-enhanced feature. This branch mainly models long-range spatial dependencies and imposes global structural consistency constraints on the feature representation.

Based on the state-enhanced feature

X_{s}

, FASI further introduces FDEB to perform frequency-aware discriminative enhancement. Specifically, the normalized feature is fed into FDEB, and the enhanced result is added back through another residual connection:

X_{out} = X_{s} + FDEB (Norm (X_{s})) .

(22)

where

X_{out}

denotes the output feature of the FASI block. Here, FDEB explicitly emphasizes frequency components closely related to image dehazing, such as edges, textures, and contrast-sensitive structures, thereby complementing the global state propagation branch with stronger local detail restoration capability.

Through this serial residual interaction design, FASI forms a progressive feature refinement process of “state propagation enhancement–frequency discriminative enhancement”. Such a design preserves the global consistency modeling capability of state propagation while further improving local structural recovery through frequency-aware enhancement. At the same time, the two residual connections help alleviate optimization difficulty in deep networks, promote stable information transmission during repeated interaction, and preserve the original structural information of the input feature.

Structurally, FASI first performs feature normalization, then conducts global state propagation modeling and local frequency-aware enhancement through the FFTSSB and FDEB branches, respectively, and finally achieves unified feature fusion under residual constraints. In this way, the block is able to maintain the original structural basis of the input while progressively realizing the collaborative optimization of global consistency and local detail enhancement.

Furthermore, FASI is hierarchically embedded into multiple stages of the encoder–decoder architecture. As the feature resolution changes, FASI undertakes different but complementary modeling roles at different scales. In low-resolution encoder stages, FASI mainly focuses on global structure modeling and cross-region dependency enhancement; in high-resolution decoder stages, it places more emphasis on local detail recovery and edge continuity improvement; while in intermediate stages, it jointly balances global structure consistency and local texture optimization. Through this multi-scale embedding strategy, FASI continuously promotes the interaction between state propagation and frequency-aware enhancement throughout the network, enabling tighter collaboration among features at different resolutions and ultimately improving dehazing quality in complex UAV haze scenes.

Different from conventional frequency-domain modules that mainly perform spectral filtering or frequency reweighting, FASI introduces state space propagation into the spatial–frequency interaction process. FFTSSB captures long-range structural dependencies through FFT-based two-sided non-causal convolution, while FDEB restores haze-sensitive local textures and edges through frequency-aware enhancement. Compared with vanilla Mamba or SSM blocks that usually rely on directional scanning, FASI avoids one-directional information propagation and better matches the non-causal nature of image restoration. Therefore, FASI jointly improves global contextual consistency and local detail recovery, distinguishing it from CNN-based local modules, Transformer-based attention modules, conventional frequency-domain modules, and vanilla SSM/Mamba blocks.

3.4. Loss Function

The network is optimized by minimizing the following objective function:

L = ‖ I_{dehaze} - I_{gt} ‖_{1} + λ {∥F (I_{dehaze}) - F (I_{gt})∥}_{1},

(23)

where

I_{dehaze}

and

I_{gt}

denote the restored dehazed image and the ground-truth clear image, respectively.

F (\cdot)

denotes the discrete Fourier transform (DFT),

{‖ \cdot ‖}_{1}

denotes the

L_{1}

norm, and

λ

is a balancing coefficient empirically set to

0.1

. The first term enforces pixel-wise spatial fidelity, while the second term encourages frequency-domain consistency, compensating for the limitations of purely spatial loss in preserving high-frequency structures and textures.

4. Experiments

4.1. Experimental Settings

4.1.1. Dataset

To systematically evaluate the dehazing performance and generalization capability of the proposed FSSM (Frequency-Enhanced State Space Model) under complex haze conditions from UAV viewpoints, experiments are conducted on the HazyDet dataset [47]. HazyDet is a challenging UAV-oriented haze dataset that contains both synthetic and real hazy images, covering diverse haze densities, illumination conditions, and complex urban backgrounds. The dataset provides annotations for approximately 383,000 object instances, exhibiting high scene diversity and considerable difficulty.

In this work, paired hazy images and their corresponding ground-truth clear images are utilized for supervised dehazing training and evaluation. The dataset is split into training and testing sets with a ratio of 8:2. Moreover, the training and testing subsets are constructed to be scene-disjoint, ensuring an objective evaluation of model generalization performance.

It should be noted that this study mainly focuses on UAV-view image dehazing. Therefore, HazyDet is selected as the primary benchmark because it contains UAV-oriented hazy scenes with diverse haze densities, illumination conditions, and complex urban backgrounds, which are consistent with the application scenario considered in this work. Although additional commonly used dehazing benchmarks, such as RESIDE/SOTS and NH-HAZE, are valuable for broader cross-dataset evaluation, they differ from HazyDet in scene viewpoint, data collection conditions, and haze degradation characteristics. In this study, we focus on validating the effectiveness of FSSM under UAV-view haze degradation. More extensive multi-dataset evaluation will be considered in future work to further examine the cross-domain generalization capability and robustness of the proposed framework.

4.1.2. Comparative Methods

To comprehensively evaluate the effectiveness of the proposed FSSM framework, a set of representative image-dehazing methods is selected for comparison, including traditional prior-based approaches and deep learning-based methods with different architectural characteristics. Specifically, the compared methods include DCP [4], AOD-Net [24], GridDehaze-Net [8], FFA-Net [25], MixDehaze-Net [26], OK-Net [27], and DWTMA-Net [28].

These methods cover a wide range of dehazing strategies, from physics-driven priors and early CNN-based dehazing networks to recent deep learning methods with multi-scale, attention-based, and frequency-aware designs. In particular, MixDehaze-Net, OK-Net, and DWTMA-Net are included as recent comparison methods to provide an up-to-date benchmark. Meanwhile, representative Transformer-based and SSM-based restoration methods, such as DehazeFormer, Restormer, UVM-Net, and Mamba-style image restoration networks, are discussed and analyzed in Section 2 to clarify the methodological differences between the proposed FSSM and recent long-range dependency modeling methods. All comparative methods included in the experiments are trained and evaluated under the same dataset split and hardware environment to ensure fair and consistent comparisons.

4.1.3. Implementation Details

During data preprocessing, paired images from the HazyDet dataset are used for training and validation. Input images are cropped into patches of size

128 \times 128

, followed by random horizontal flipping and rotation for data augmentation. All input images are normalized to the range

[0, 1]

before being fed into the network.

The proposed FSSM adopts a three-level encoder–decoder architecture, with channel dimensions set to 48, 96, and 192, respectively. At the three corresponding scales, [6, 6, 12] Frequency-Aware State Interaction (FASI) blocks are stacked, and the feed-forward expansion ratio is set to 3. The network is optimized using the AdamW optimizer with momentum parameters

β_{1} = 0.9

and

β_{2} = 0.9

. A cosine annealing learning rate scheduler is employed, with the initial learning rate set to

2 \times 10^{- 4}

and gradually decayed to

1 \times 10^{- 6}

over 300,000 iterations. No warm-up strategy is applied during training.

4.1.4. Evaluation Metrics

To quantitatively evaluate dehazing performance, three widely used evaluation metrics are adopted:

PSNR (Peak Signal-to-Noise Ratio): measures the pixel-level reconstruction accuracy between the dehazed image and the ground-truth image;
SSIM (Structural Similarity Index): evaluates the consistency of structural information;
NIQE (Natural Image Quality Evaluator): reflects the perceptual naturalness of images, where lower values indicate better visual quality.

In addition to PSNR, SSIM, and NIQE, model complexity is also evaluated to assess the practical efficiency of different dehazing methods. Specifically, FLOPs and parameter count are reported to measure computational cost and model size, respectively. Since the proposed FSSM is motivated by efficient long-range dependency modeling, these complexity metrics provide important evidence for evaluating whether the proposed FFT-based State-Space Modeling strategy can balance dehazing performance and computational efficiency.

4.2. Quantitative Evaluations

The quantitative comparison results of different dehazing methods on the HazyDet dataset are summarized in Table 1.

PSNR, SSIM, and NIQE are employed to comprehensively evaluate dehazing performance from the perspectives of pixel-level reconstruction accuracy, structural consistency, and perceptual image quality, respectively.

As shown in Table 1, the proposed FSSM achieves the highest PSNR (29.97 dB) and SSIM (0.9396) among all compared methods, while also obtaining the lowest NIQE score (11.20). These results indicate that FSSM exhibits strong overall advantages in detail restoration, structural preservation, and perceptual naturalness. Compared with traditional prior-based methods such as DCP and early deep learning approaches like AOD-Net, FSSM achieves substantial improvements across all three metrics, demonstrating its effectiveness in addressing complex haze degradation in UAV-view scenarios.

When compared with CNN-based dehazing methods such as GridDehaze-Net and FFA-Net, FSSM consistently yields higher PSNR and SSIM values and achieves superior NIQE performance, suggesting that it is more capable of preserving natural image appearance while removing haze. Furthermore, compared with recently proposed methods including MixDehaze-Net, OK-Net, and DWTMA-Net, FSSM maintains a clear advantage in overall quantitative performance, highlighting its robustness and stability in challenging UAV haze scenes.

Overall, these quantitative results demonstrate that by incorporating frequency-enhanced State-Space Modeling, FSSM effectively improves the modeling of long-range dependencies and frequency-correlated features, leading to superior dehazing performance on the HazyDet dataset.

4.3. Computational Complexity Analysis

To further evaluate the computational efficiency of the proposed method, we compare the FLOPs and parameter count of different dehazing methods, as shown in Table 2. FLOPs reflect the computational cost of the model during inference, while the number of parameters indicates the model size and storage requirement. These two metrics provide a quantitative basis for evaluating the practical deployment potential of different dehazing models.

As shown in Table 2, different dehazing methods exhibit obvious differences in computational complexity. Traditional prior-based methods such as DCP do not involve learnable parameters, while deep learning-based methods introduce different levels of computational costs. The proposed FSSM contains 16.74 M parameters and requires 246.66 G FLOPs. Although its parameter count is larger than that of the compared CNN-based methods, its FLOPs are lower than those of FFA-Net and comparable to those of several recent high-performance dehazing networks.

Combining Table 1 and Table 2, FSSM achieves the best PSNR, SSIM, and NIQE results among all compared methods, while maintaining a moderate computational cost compared with high-complexity dehazing networks such as FFA-Net. This indicates that the performance improvement of FSSM is not only achieved by increasing model capacity, but also benefits from the proposed frequency-enhanced State-Space Modeling strategy. In particular, the FFT-based state space block improves long-range dependency modeling without relying on dense self-attention over high-resolution feature maps.

The efficiency of FSSM mainly benefits from the FFT-based implementation of state-space convolution. Instead of using explicit recursive state updates or dense self-attention, FFTSSB performs global feature interaction in the frequency domain. This design enables the model to capture long-range dependencies with an efficient convolutional form. Therefore, the proposed method provides a practical trade-off between restoration quality and computational complexity for UAV-view image dehazing tasks that require global structural recovery and detail preservation.

4.4. Qualitative Evaluations

To further provide an intuitive assessment of the dehazing performance of different methods under real UAV haze scenarios, several representative test samples from the HazyDet dataset are selected for qualitative comparison. The visual comparison results are shown in Figure 2.

As illustrated in Figure 2, traditional prior-based methods such as DCP tend to suffer from noticeable over-enhancement and color distortion in certain regions. Some CNN-based approaches, including AOD-Net and GridDehaze-Net, are able to remove a large portion of haze; however, residual haze and blurred details can still be observed in distant regions.

In contrast, the proposed FSSM demonstrates more stable and consistent restoration performance, particularly in sky regions, road boundaries, and long-range structures. FSSM effectively removes haze while preserving natural color distributions and clear structural details. Notably, in UAV scenes with varying haze density and complex backgrounds, FSSM shows superior capability in suppressing residual haze, enhancing local contrast, and maintaining structural consistency. These qualitative observations are highly consistent with the quantitative evaluation results, further validating the effectiveness of the proposed frequency-enhanced State Space Modeling framework.

The visual comparison further focuses on representative UAV-view regions, including distant structures, sky areas, road boundaries, and fine texture regions. Compared with the competing methods, FSSM better suppresses residual haze in distant regions and maintains a more natural color appearance in large sky areas. Meanwhile, it preserves clearer road boundaries and local structural details, reducing over-smoothed artifacts commonly observed in some CNN-based dehazing results. These visual improvements indicate that the proposed FDEB contributes to local detail preservation, while FFTSSB enhances long-range structural consistency. Overall, the qualitative results further support the quantitative improvements reported in Table 1.

4.5. Ablation Study

To further evaluate the effectiveness and necessity of the key components in the proposed framework, ablation studies are conducted on the two core modules of FSSM, namely the FFT-Based State Space Block (FFTSSB) and the Frequency-Aware Discriminative Enhancement Block (FDEB). These two modules directly correspond to the main design motivations of the proposed method: global dependency modeling through FFT-based state space propagation and local detail restoration through frequency-aware discriminative enhancement. All ablation experiments are carried out under the same training strategy, dataset split, and testing settings. Only the corresponding modules are replaced or removed to ensure fair comparison. PSNR and SSIM are adopted as the evaluation metrics, and the test set contains 2000 images.

It should be noted that the Frequency-Aware State Interaction (FASI) block is constructed by integrating FFTSSB and FDEB within a unified residual interaction framework. Therefore, the current ablation design mainly verifies the effectiveness of the two key functional branches inside FASI. The FFTSSB ablation evaluates the contribution of FFT-based State Space Modeling to global structural consistency, while the FDEB ablation evaluates the contribution of frequency-aware local enhancement to texture and edge restoration. The present study focuses on isolating the newly designed modules, namely FFTSSB and FDEB, because they directly correspond to the main technical contributions of the proposed FSSM. This ablation design provides direct evidence for verifying the effectiveness of the proposed core modules.

4.5.1. Ablation Study on FFTSSB

The FFTSSB serves as a core component of the proposed model for global dependency modeling and frequency-domain state propagation. To evaluate its effectiveness, FFTSSB is replaced with two commonly used attention mechanisms, namely standard self-attention and linear attention, while keeping all other components unchanged. The quantitative comparison results are reported in Table 3.

As reported in Table 3, the proposed method achieves the best performance, with PSNR and SSIM reaching 29.9695 dB and 0.9396, respectively. Replacing FFTSSB with standard self-attention results in a slight PSNR decrease of 0.0445 dB, while SSIM drops more noticeably to 0.9236. These results indicate that although self-attention can achieve comparable haze removal in terms of pixel-level error, it is less effective in preserving structural consistency and fine-grained details.

Furthermore, replacing FFTSSB with linear attention results in a more pronounced performance degradation, with PSNR and SSIM decreasing to 29.0133 dB and 0.9295, respectively. This suggests that while linear attention is computationally efficient, its ability to model global dependencies and frequency-related structures is limited in the image-dehazing task. Overall, these results demonstrate that FFTSSB is more suitable for image dehazing, as it effectively balances global modeling capability and structural consistency.

4.5.2. Ablation Study on FDEB

The Frequency-Aware Discriminative Enhancement Block (FDEB) integrates spatial gated convolution with FFT-based frequency modeling to achieve joint spatial–frequency feature enhancement. To quantitatively evaluate the contribution of each component in FDEB, two ablation experiments are conducted, and the corresponding results are summarized in Table 4.

As shown in Table 4, when the FFT frequency branch in FDEB is removed while retaining the spatial gated feed-forward structure, the PSNR and SSIM decrease to 29.7928 dB and 0.9389, respectively. This result indicates that although the spatial gated FFN is capable of modeling local structures to a certain extent, the absence of frequency-domain modeling leads to performance degradation in fine-detail restoration and structural consistency, highlighting the importance of the FFT branch for selective frequency recovery.

Furthermore, when FDEB is completely replaced by a standard feed-forward network (FFN/MLP), the performance further degrades, with PSNR and SSIM dropping to 29.2409 dB and 0.9365, respectively. This observation demonstrates that the performance gains of FDEB are not solely attributed to the frequency-domain branch, but also benefit from the joint design of spatial gating and frequency-aware enhancement. Compared with conventional FFN structures, FDEB more effectively captures complex spatial–frequency characteristics under haze degradation, thereby improving local texture recovery and structural fidelity.

4.5.3. Qualitative Analysis of Ablation Results

In addition to quantitative evaluations, qualitative comparisons under different ablation configurations are further conducted, as illustrated in Figure 3 and Figure 4.

As shown in Figure 3, in the ablation study of the FFT-based State Space Block (FFTSSB), the complete model effectively suppresses residual haze in sky regions, long-range structures, and object boundaries, while preserving clear and stable structural details. In contrast, when FFTSSB is replaced by Self-Attention or Linear Attention, the dehazing results exhibit varying degrees of residual haze, structural blurring, and insufficient contrast, particularly in distant regions. These observations indicate the limitations of conventional attention mechanisms in global dependency modeling and structural consistency under complex UAV haze conditions.

As shown in Figure 4, in the ablation study of the Frequency-Aware Discriminative Enhancement Block (FDEB), removing the frequency-domain branch or replacing FDEB with a standard feed-forward network (FFN) leads to noticeable degradation in detail restoration, characterized by blurred textures and weakened edge structures. By comparison, the complete FDEB achieves more stable enhancement of local contrast and finer structural details, especially in complex backgrounds and low-contrast regions.

Overall, these qualitative observations are highly consistent with the quantitative evaluation trends, further validating the critical role of FFTSSB in global dependency modeling and the effectiveness of FDEB in fine-grained detail enhancement and structural restoration.

5. Conclusions

In this paper, we addressed the challenges of insufficient global dependency modeling and limited computational efficiency in image dehazing under complex UAV-view haze scenarios. To this end, we proposed an efficient frequency-enhanced State Space Modeling framework, termed FSSM (Frequency-Enhanced State Space Model). By adopting State Space Modeling as the overall architectural backbone and introducing FFT-based two-sided non-causal convolution, FSSM enables efficient global dependency modeling while reducing the computational burden caused by explicit recursive scanning and dense self-attention, thereby achieving a favorable balance between dehazing performance and inference efficiency.

From a methodological perspective, we developed a novel FFT-based State Space Block (FFTSSB), which replaces conventional recursive scanning mechanisms with implicit state propagation in the frequency domain, effectively alleviating the limitations of unidirectional modeling and computational redundancy. Furthermore, we introduced a Frequency-Aware State Interaction module (FASI), which integrates FFTSSB with the Frequency-Aware Discriminative Enhancement Block (FDEB). This design facilitates collaborative optimization of global structural consistency and local detail restoration in both spatial and frequency domains. Based on these core components, a multi-scale hierarchical encoder–decoder architecture was constructed to enhance the modeling capability for complex haze degradation patterns.

Experimental results on the HazyDet dataset demonstrate that the proposed FSSM achieves favorable performance in terms of PSNR, SSIM, and NIQE, while also producing more stable and visually pleasing results in real UAV haze scenes. Ablation studies further verify the important roles of FFTSSB and FDEB in improving global structural consistency and local detail restoration, confirming the effectiveness of the proposed framework for challenging image-dehazing tasks.

Despite the promising performance of FSSM, there remains room for further improvement. First, the current framework is primarily designed for single-image dehazing and does not explicitly incorporate temporal consistency constraints. Second, although HazyDet provides diverse UAV-view hazy scenes, more extensive evaluations on commonly used dehazing benchmarks, such as RESIDE/SOTS and NH-HAZE, would further strengthen the analysis of cross-domain generalization. Third, although the proposed method has been evaluated in terms of PSNR, SSIM, NIQE, FLOPs, and parameter count, additional deployment-oriented indicators, such as inference latency, memory consumption, and FPS on edge devices, can provide more comprehensive evidence for practical deployment. Future work will explore extending FSSM to video dehazing and multi-weather image restoration tasks, as well as optimizing the inference pipeline for practical edge-device deployment.

Overall, the proposed FSSM provides an efficient and interpretable solution for applying State Space Models to image dehazing, offering useful insights and a practical foundation for UAV visual perception in complex real-world environments.

Author Contributions

Conceptualization, L.Z. and Y.H.; methodology, L.Z.; software, L.Z.; validation, L.Z. and Y.H.; formal analysis, L.Z.; investigation, L.Z.; resources, Y.H.; data curation, L.Z.; writing—original draft preparation, L.Z.; writing—review and editing, L.Z. and Y.H.; visualization, L.Z.; supervision, Y.H.; project administration, Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Chongqing Natural Science Foundation Joint Fund for Innovation and Development, grant number CSTB2023NSCQ-LZX0127, and by the Research and Innovation Program for Graduate Students in Chongqing Jiaotong University, grant number 2025S0057.

Institutional Review Board Statement

This study did not require ethical approval as it utilized only publicly available datasets without human or animal involvement.

Informed Consent Statement

Not applicable.

Data Availability Statement

The HazyDet dataset used in this study is publicly available at https://github.com/GrokCV/HazyDet, accessed on 19 April 2026. All experimental results and code are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the study design, data analysis, or manuscript preparation.

References

Qiu, Z.; Gong, T.; Liang, Z.; Chen, T.; Cong, R.; Bai, H.; Zhao, Y. Perception-Oriented UAV Image Dehazing Based on Superpixel Scene Prior. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5913519. [Google Scholar]
Liu, Y.; Wang, X.; Hu, E.; Wang, A.; Shiri, B.; Lin, W. VNDHR: Variational Single Nighttime Image Dehazing for Enhancing Visibility in Intelligent Transportation Systems via Hybrid Regularization. IEEE Trans. Intell. Transp. Syst. 2025, 26, 10189–10203. [Google Scholar] [CrossRef]
Shao, Y.; Li, L.; Ren, W.; Gao, C.; Sang, N. Domain Adaptation for Image Dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 2808–2817. [Google Scholar]
He, K.; Sun, J.; Tang, X. Single Image Haze Removal Using Dark Channel Prior. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 2341–2353. [Google Scholar]
Cui, Y.; Ren, W.; Cao, X.; Knoll, A. Focal Network for Image Restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 13001–13011. [Google Scholar]
Cui, Y.; Tao, Y.; Jing, L.; Knoll, A. Strip Attention for Image Restoration. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI 2023), Cape Town, South Africa, 19–25 August 2023; pp. 645–653. [Google Scholar]
Tong, L.; Liu, Y.; Ye, T.; Li, W.; Chen, L.; Chen, E. Parallel Cross Strip Attention Network for Single Image Dehazing. In Proceedings of the 2024 5th International Conference on Computing, Networks and Internet of Things (CNIOT 2024), Tokyo, Japan, 23–25 May 2024; pp. 148–154. [Google Scholar]
Liu, X.; Ma, Y.; Shi, Z.; Chen, J. GridDehazeNet: Attention-Based Multi-Scale Network for Image Dehazing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7314–7323. [Google Scholar]
Dong, H.; Pan, J.; Xiang, L.; Hu, Z.; Yang, M.-H. Multi-Scale Boosted Dehazing Network with Dense Feature Fusion. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Guo, C.-L.; Yan, Q.; Anwar, S.; Cong, R.; Ren, W.; Li, C. Image Dehazing Transformer with Transmission-Aware 3D Position Embedding. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 5812–5820. [Google Scholar]
Song, Y.; He, Z.; Qian, H.; Du, X. Vision Transformers for Single Image Dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef] [PubMed]
Pushpalatha, D.; Prithvi, P. LDFormer: Lightweight Dehazing Transformer. Neural Comput. Appl. 2025, 37, 20049–20068. [Google Scholar] [CrossRef]
Zhang, Z.; Feng, Z.; Long, A.; Wang, Z. LWTD: A Novel Lightweight Transformer-like CNN Architecture for Driving Scene Dehazing. Int. J. Mach. Learn. Cybern. 2025, 16, 1303–1326. [Google Scholar] [CrossRef]
Sun, J.; Liu, H.; Wang, Y.; Zhang, X.P.; Wei, M. WDMamba: When Wavelet Degradation Prior Meets Vision Mamba for Image Dehazing. arXiv 2025, arXiv:2505.04369. [Google Scholar] [CrossRef]
Guo, H.; Guo, Y.; Zha, Y.; Zhang, Y.; Li, W.; Dai, T.; Xia, S.-T.; Li, Y. MambaIRv2: Attentive State Space Restoration. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 28124–28133. [Google Scholar]
Somvanshi, S.; Islam, M.M.; Mimi, M.S.; Polock, S.B.B.; Chhetri, G.; Das, S. From S4 to Mamba: A Comprehensive Survey on Structured State Space Models. arXiv 2025, arXiv:2503.18970. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Proceedings of the 2024 Conference on Language Modeling (COLM), Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Guo, H.; Li, J.; Dai, T.; Ouyang, Z.; Ren, X.; Xia, S.T. MambaIR: A Simple Baseline for Image Restoration with State Space Model. In Proceedings of the 2024 European Conference on Computer Vision (ECCV 2024), Milan, Italy, 29 September–4 October 2024; pp. 222–241. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Shi, Y.; Xia, B.; Jin, X.; Wang, X.; Zhao, T.; Xia, X.; Xiao, X.; Yang, W. VMambaIR: Visual State Space Model for Image Restoration. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 5560–5574. [Google Scholar] [CrossRef]
Zhu, Q.; Mai, J.; Shao, L. A Fast Single Image Haze Removal Algorithm Using Color Attenuation Prior. IEEE Trans. Image Process. 2015, 24, 3522–3533. [Google Scholar] [CrossRef]
Berman, D.; Avidan, S. Non-Local Image Dehazing. In Proceedings of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1674–1682. [Google Scholar]
Su, Y.Z.; He, C.; Cui, Z.G.; Li, A.H.; Wang, N. Physical Model and Image Translation Fused Network for Single-Image Dehazing. Pattern Recognit. 2023, 142, 109700. [Google Scholar] [CrossRef]
Li, B.; Peng, X.; Wang, Z.; Xu, J.; Feng, D. AOD-Net: All-in-One Dehazing Network. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 4770–4778. [Google Scholar]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature Fusion Attention Network for Single Image Dehazing. In Proceedings of the 2020 AAAI Conference on Artificial Intelligence (AAAI-20), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11908–11915. [Google Scholar]
Lu, L.; Xiong, Q.; Xu, B.; Chu, D. MixDehazeNet: Mix Structure Block for Image Dehazing Network. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN 2024), Yokohama, Japan, 30 June–5 July 2024; pp. 1–10. [Google Scholar]
Cui, Y.; Ren, W.; Knoll, A. Learning Efficient Image Restoration with Advanced Attention Mechanisms. In Proceedings of the 2024 AAAI Conference on Artificial Intelligence (AAAI-24), Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 1426–1434. [Google Scholar]
Guan, X.; He, R.; Wang, L.; Zhou, H.; Liu, Y.; Xiong, H. DWTMA-Net: Discrete Wavelet Transform and Multi-Dimensional Attention Network for Remote Sensing Image Dehazing. Remote Sens. 2025, 17, 2033. [Google Scholar] [CrossRef]
Qiu, Y.; Zhang, K.; Wang, C.; Luo, W.; Li, H.; Jin, Z. MB-TaylorFormer: Multi-Branch Efficient Transformer Expanded by Taylor Formula for Image Dehazing. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2–6 October 2023; pp. 12802–12813. [Google Scholar]
Feng, X.; Ji, H.; Pei, W.; Li, J.; Lu, G.; Zhang, D. U²-Former: Nested U-Shaped Transformer for Image Restoration via Multi-View Contrastive Learning. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 168–181. [Google Scholar] [CrossRef]
Wang, C.; Pan, J.; Lin, W.; Dong, J.; Wang, W.; Wu, X.M. SelfPromer: Self-Prompt Dehazing Transformers with Depth Consistency. In Proceedings of the 2024 AAAI Conference on Artificial Intelligence (AAAI-24), Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 5327–5335. [Google Scholar]
Chi, K.; Yuan, Y.; Wang, Q. Trinity-Net: Gradient-Guided Swin Transformer-Based Remote Sensing Image Dehazing and Beyond. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Liu, H.; Li, X.; Tan, T. Interaction-Guided Two-Branch Image Dehazing Network. In Proceedings of the 2024 Asian Conference on Computer Vision (ACCV 2024), Hanoi, Vietnam, 8–12 December 2024; pp. 4069–4084. [Google Scholar]
Wang, Y.; Yan, X.; Wang, F.L.; Xie, H.; Yang, W.; Zhang, X.P.; Wei, M. UCL-Dehaze: Toward Real-World Image Dehazing via Unsupervised Contrastive Learning. IEEE Trans. Image Process. 2024, 33, 1361–1374. [Google Scholar] [CrossRef]
Wang, X.; Yang, G.; Ye, T.; Liu, Y. Dehaze-RetinexGAN: Real-World Image Dehazing via Retinex-Based Generative Adversarial Network. In Proceedings of the 2025 AAAI Conference on Artificial Intelligence (AAAI-25), Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 7997–8005. [Google Scholar]
Huang, C.; Li, S.; Chen, X.; Liu, J.; Li, D. Haze Has Many Faces: Multi-Domain Haze Style Transfer for Diverse Haze Removal. Pattern Recognit. 2026, 175, 113094. [Google Scholar] [CrossRef]
Wang, Y.; Qu, Q.; He, K.; Liu, T.; Du, X.; Lei, T.; Nandi, A.K. PRDiff-Dehaze: Toward Non-Homogeneous Haze Image Restoration via Progressive Refinement Diffusion. Pattern Recognit. 2026, 179, 113688. [Google Scholar] [CrossRef]
Murtaza, A.; Khairuddin, U.; Mohd Fudzi, A.A.; Hamamoto, K.; Fang, Y.; Omar, Z. WaveLiteDehaze-Net: A Low-Parameter Wavelet-Based Method for Real-Time Dehazing. CAAI Trans. Intell. Technol. 2025, 10, 1033–1048. [Google Scholar] [CrossRef]
Zhang, Q.; Shao, M.; Chen, X.; Lv, X.; Xu, K. Wave-MambaAD: Wavelet-Driven State Space Model for Multi-Class Unsupervised Anomaly Detection. In Proceedings of the 2025 IEEE/CVF International Conference on Computer Vision (ICCV 2025), Honolulu, HI, USA, 19–25 October 2025; pp. 20868–20877. [Google Scholar]
Zheng, Z.; Wu, C. U-Shaped Vision Mamba for Single Image Dehazing. arXiv 2024, arXiv:2402.04139. [Google Scholar] [CrossRef]
Wang, Y.; Chen, L.; Hu, B.; Liu, H.; Zhang, X.P.; Wei, M. Laplace-Mamba: Laplace Frequency Prior-Guided Mamba-CNN Fusion Network for Image Dehazing. arXiv 2025, arXiv:2507.00501. [Google Scholar]
Su, X.; Li, S.; Cui, Y.; Cao, M.; Zhang, Y.; Chen, Z.; Wu, Z.; Wang, Z.; Zhang, Y.; Yuan, X. Prior-Guided Hierarchical Harmonization Network for Efficient Image Dehazing. In Proceedings of the 2025 AAAI Conference on Artificial Intelligence (AAAI-25), Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 7042–7050. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Poli, M.; Massaroli, S.; Nguyen, E.; Fu, D.Y.; Dao, T.; Baccus, S.; Bengio, Y.; Ermon, S.; Ré, C. Hyena Hierarchy: Towards Larger Convolutional Language Models. In Proceedings of the 2023 International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, 23–29 July 2023; pp. 28043–28078. [Google Scholar]
Kong, L.; Dong, J.; Tang, J.; Yang, M.H.; Pan, J. Efficient Visual State Space Model for Image Deblurring. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA, 11–15 June 2025; pp. 12710–12719. [Google Scholar]
Kong, L.; Dong, J.; Tang, J.; Yang, M.H.; Pan, J. Efficient Frequency Domain-Based Transformers for High-Quality Image Deblurring. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Feng, C.; Chen, Z.; Li, X.; Wang, C.; Yang, J.; Cheng, M.M.; Dai, Y.; Fu, Q. HazyDet: Open-Source Benchmark for Drone-View Object Detection with Depth Cues in Hazy Scenes. arXiv 2024, arXiv:2409.19833. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed FSSM framework. (a) The overall FSSM architecture; (b) FFT-Based State Space Block (FFTSSB); (c) Frequency-Aware State Interaction (FASI) block; (d) Frequency-Aware Discriminative Enhancement Block (FDEB).

Figure 2. Qualitative comparison of different dehazing methods on the HazyDet dataset.

Figure 3. Qualitative ablation results of the FFT-based State-Space Block (FFTSSB).

Figure 4. Qualitative ablation results of the Frequency-Aware Discriminative Enhancement Block (FDEB).

Table 1. Quantitative comparison on the HazyDet dataset, where the best results are highlighted in bold.

Methods	Publication	PSNR (dB)	SSIM	NIQE
DCP	TPAMI 2011	17.03	0.8024	12.30
AOD-Net	ICCV 2017	18.99	0.7808	12.27
GridDehaze-Net	ICCV 2019	26.66	0.8801	11.33
FFA-Net	AAAI 2020	27.12	0.8782	11.31
MixDehaze-Net	IJCNN 2024	28.75	0.9068	11.30
OK-Net	AAAI 2024	27.76	0.8875	11.36
DWTMA-Net	Sensors 2025	29.02	0.9108	11.23
Ours	–	29.97	0.9396	11.20

Table 2. Comparison of computational complexity among different dehazing methods.

Method	FLOPs	Parameters
DCP	–	–
AOD-Net	457.70 M	1.76 K
GridDehaze-Net	85.72 G	955.75 K
FFA-Net	624.20 G	4.68 M
MixDehaze-Net	114.30 G	3.17 M
OK-Net	158.20 G	4.43 M
DWTMA-Net	188.72 G	8.34 M
Ours	246.66 G	16.74 M

Table 3. Comparison of ablation results for FFTSSB.

Method	Global Modeling Module	PSNR (dB)	SSIM
Linear Attention	Linear Attention	29.0133	0.9295
Multi-Head Attention	Self-Attention (MHSA)	29.9250	0.9236
Proposed Method	FFTSSB	29.9695	0.9396

Table 4. Comparison of ablation results for FDEB.

Method	FFT Branch	FFN Type	PSNR (dB)	SSIM
w/o FFT Branch	×	Spatial Gated FFN	29.7928	0.9389
Replace FDEB with Standard FFN	×	Standard FFN (MLP)	29.2409	0.9365
Proposed Method	✓	FDEB	29.9695	0.9396

×: Not included; ✓: Included.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeng, L.; Huang, Y. FSSM: Frequency-Enhanced State Space Modeling with FFT-Based Two-Sided Non-Causal Convolution for Image Dehazing. J. Imaging 2026, 12, 260. https://doi.org/10.3390/jimaging12060260

AMA Style

Zeng L, Huang Y. FSSM: Frequency-Enhanced State Space Modeling with FFT-Based Two-Sided Non-Causal Convolution for Image Dehazing. Journal of Imaging. 2026; 12(6):260. https://doi.org/10.3390/jimaging12060260

Chicago/Turabian Style

Zeng, Li, and Yinqing Huang. 2026. "FSSM: Frequency-Enhanced State Space Modeling with FFT-Based Two-Sided Non-Causal Convolution for Image Dehazing" Journal of Imaging 12, no. 6: 260. https://doi.org/10.3390/jimaging12060260

APA Style

Zeng, L., & Huang, Y. (2026). FSSM: Frequency-Enhanced State Space Modeling with FFT-Based Two-Sided Non-Causal Convolution for Image Dehazing. Journal of Imaging, 12(6), 260. https://doi.org/10.3390/jimaging12060260

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

FSSM: Frequency-Enhanced State Space Modeling with FFT-Based Two-Sided Non-Causal Convolution for Image Dehazing

Abstract

1. Introduction

2. Related Work

2.1. Traditional Image-Dehazing Methods

2.2. Deep Learning-Based Dehazing Methods

2.3. State Space Models for Image Dehazing

3. Methodology

3.1. Notation

3.2. Overall Architecture

3.3. State Space Modeling and Its Extensions

3.3.1. State Space Model Foundation

3.3.2. FFT-Based State Space Block (FFTSSB)

3.3.3. Frequency-Aware Discriminative Enhancement Block (FDEB)

3.3.4. Frequency-Aware State Interaction Block (FASI Block)

3.4. Loss Function

4. Experiments

4.1. Experimental Settings

4.1.1. Dataset

4.1.2. Comparative Methods

4.1.3. Implementation Details

4.1.4. Evaluation Metrics

4.2. Quantitative Evaluations

4.3. Computational Complexity Analysis

4.4. Qualitative Evaluations

4.5. Ablation Study

4.5.1. Ablation Study on FFTSSB

4.5.2. Ablation Study on FDEB

4.5.3. Qualitative Analysis of Ablation Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI