GMUD-Net: Global Modulated Unbalanced Dual-Branch Network for Image Restoration in Various Degraded Environments

Wang, Shengchun; Liu, Yingjie; Zhu, Huijie

doi:10.3390/app16062854

Open AccessArticle

GMUD-Net: Global Modulated Unbalanced Dual-Branch Network for Image Restoration in Various Degraded Environments

by

Shengchun Wang

¹

,

Yingjie Liu

¹ and

Huijie Zhu

^2,*

¹

College of Information Science and Engineering, Hunan Normal University, Changsha 410081, China

²

National Key Laboratory of Science and Technology on Near-Surface Detection, Wuxi 214035, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2854; https://doi.org/10.3390/app16062854

Submission received: 22 February 2026 / Revised: 11 March 2026 / Accepted: 13 March 2026 / Published: 16 March 2026

(This article belongs to the Special Issue AI-Driven Image and Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

Image restoration has wide applications in the field of computer vision, yet existing methods suffer from limitations. CNNs struggle to capture long-range dependencies, while transformers exhibit insufficient performance in handling local details and high computational complexity. Additionally, existing dual-branch networks fail to define a clear dominant–auxiliary role between branches, leading to redundancy and high computational costs. This paper proposes a Global Modulated Unbalanced Dual-Branch Network (GMUD-Net), which innovatively adopts an unbalanced structure with a CNN as the main branch and a transformer as the auxiliary branch. Specifically, the CNN branch achieves strong restoration capability by integrating the global–local hybrid backbone block (GLBB) and the frequency-based global attention module (FGAM). As the key building block in the CNN branch, GLBB integrates a local backbone branch, a global Fourier branch, and a residual branch to fuse local details with global context. Meanwhile, FGAM leverages the fast Fourier transform at the bottleneck to enhance cross-channel interaction and improve global restoration performance. In addition, the lightweight transformer branch employs efficient cross-channel attention to provide complementary global cues, which are filtered and injected into the CNN branch via the global attention guidance block (GAG). These designs integrate the advantages of both CNNs and transformers while significantly reducing computational burden, offering a new paradigm to address the limitations of traditional dual-branch architectures. Experimental results demonstrate that compared with existing algorithms, the proposed method achieves state-of-the-art or highly competitive performance in both quantitative evaluations and qualitative results across nine datasets.

Keywords:

image restoration; image dehazing; optical contamination; unbalanced dual-branch

1. Introduction

With the widespread application of computer vision technology in smart cities, autonomous driving, remote sensing monitoring, and other fields, image quality directly affects the accuracy and reliability of task execution. However, natural images are easily degraded by interference from complex environmental factors during acquisition. Under severe weather conditions, haze [1] reduces image contrast, while snow [2], rain [3], and water [4] introduce artifact noise. Defects in optical equipment cause lens defocus blur [5], relative motion during shooting results in motion blur [6], and insufficient light leads to low-light imaging problems [7]. These degradations not only damage the visual effect of images, but also seriously affect the accuracy of downstream visual tasks, such as object detection [8] and semantic segmentation [9]. To solve this problem, image restoration technology, which aims to restore degraded images to high-quality, clear images, has become a research hotspot in academia over the past few decades.

Current image restoration methods are mainly divided into traditional prior-based methods [1,10,11] and deep data-driven methods [2,12,13]. Traditional methods usually formulate image restoration as an optimization problem, relying on physical assumptions, such as atmospheric scattering models [14] and sparse priors [15], to constrain the solution space. Although they achieve certain results in simple scenarios, they tend to suffer from detail loss or artifact residues when textures and structures are complex. Data-driven methods represented by convolutional neural networks, with their end-to-end learning mode, have significantly improved restoration performance. In recent years, models based on the transformer architecture, with their strong global modeling capabilities, have refreshed the performance ceiling in multiple image restoration tasks and become the frontier direction of current research. The following describes these two data-driven models.

Traditional convolutional neural networks have long dominated image restoration [16,17,18]. With their local feature extraction and hierarchical feature learning capabilities, they perform well in tasks such as edge texture restoration. However, owing to the inherently local receptive field of convolutional kernels, CNNs have limited ability to capture long-range dependencies across an image. As a result, when addressing degradations that demand strong global consistency, such as dense haze or large-area snow accumulation, CNN-based methods often produce restored images with discontinuous textures or distorted structures. In addition, transformer-based models break through the limitations of local modeling through self-attention mechanisms and show advantages in capturing global context. Recently, many transformer models for image restoration [13,19,20] have appeared, which are close to or surpass CNN methods in objective indicators. However, the quadratic complexity of the transformer limits its application in real-time scenarios, and excessive reliance on global features can easily lead to blurring of local details, such as edge smoothing or texture loss.

To combine the advantages of the above two models, we consider adopting a dual-branch structure [21,22,23]. Existing dual-branch networks attempt to combine the advantages of CNN and Transformer, but mostly stay at the level of feature concatenation or simple superposition, lacking in-depth exploration of the complementary relationship between global and local features. Moreover, there is no significant distinction between the primary and secondary roles of the CNN branch and the transformer branch, resulting in a large amount of redundant information in the fused features and a huge amount of computation. To combine the advantages of CNN and transformer and alleviate the inherent problems of traditional dual-branch networks, we adopt a CNN-transformer dual-branch structure that distinguishes between primary and secondary branches for image restoration tasks and name this new dual-branch network the Global Modulated Unbalanced Dual-branch Network, namely GMUD-Net.

Different from conventional dual-branch architectures, our network adopts an unbalanced design in which the CNN branch serves as the primary branch and the transformer branch acts as an auxiliary branch. This choice is motivated by the observation that image restoration relies more heavily on recovering local structures, fine textures, and sharp boundaries than on long-range semantic dependency modeling. Although transformer-based designs are effective in capturing global context, assigning them a dominant role in high-resolution restoration usually introduces substantial computational cost and does not necessarily lead to proportional gains in local detail reconstruction. Therefore, instead of treating the two branches equally, we allocate most restoration and reconstruction responsibilities to the CNN branch, while keeping the transformer branch lightweight to provide complementary global context with low computational overhead. Specifically, the CNN branch is responsible for most restoration processing and thus accounts for the majority of computation, focusing on extracting degradation-aware local features and reconstructing clean images. In contrast, the transformer branch is designed to be lightweight and parameter-efficient, providing complementary global context with low computational overhead to guide the CNN branch. In the primary CNN branch, we incorporate the Global–Local Hybrid Backbone Block (GLBB) and the Frequency-Based Global Attention Module (FGAM). GLBB is placed in the main body of the CNN branch, while FGAM is inserted at the bottleneck, enabling the network to enhance global context modeling while preserving fine local details. In the auxiliary transformer branch, we stack computationally efficient global transformer blocks as the backbone. Moreover, compared with prior transformer-based image restoration networks [13,24], we adopt markedly fewer blocks per stage, which substantially reduces the parameter footprint of the transformer branch. To improve the quality of global cues passed from the transformer branch, we introduce the Global Attention Guidance Block (GAG) to filter redundant features before fusing them into the CNN branch.

The main contributions of this work are as follows:

We propose GMUD-Net, an unbalanced dual-branch architecture where the CNN branch performs the primary restoration, and the lightweight transformer branch provides filtered global guidance.
We design FGAM, which leverages the fast Fourier transform to enhance cross-channel interaction in the frequency domain and improve global restoration capability.
We introduce GLBB to integrate local detail modeling with global context aggregation, improving robustness under complex degradations.
Extensive experiments on nine widely used benchmarks demonstrate that GMUD-Net achieves state-of-the-art or highly competitive performance on representative restoration tasks, including defocus deblurring, image dehazing, and image desnowing.

2. Related Work

2.1. Single Image Restoration

Traditional image restoration methods are mainly designed based on physical imaging models and handcrafted priors, aiming to constrain the solution space of degradation problems through mathematical modeling. In early studies, the dark channel prior proposed by He et al. [1] realizes single image dehazing by assuming that the dark channel pixel values of fog-free images are close to zero. This method achieves significant results in clear sky areas but is prone to color distortion in high-brightness areas, such as the sky. The color attenuation prior proposed by Zhu et al. [25] constructs a mapping model between fog-free images and foggy images by analyzing the linear relationship between the RGB three channels and depth of field. Such prior-based methods rely on scene statistical characteristics and have limited generalization ability in complex environments.

The rise of convolutional neural networks has completely revolutionized the research paradigm of image restoration. The early DehazeNet [26] was the first to use an end-to-end CNN to directly estimate the transmission map, avoiding the dependence of traditional methods on physical models. However, limited by the representation ability of shallow networks, the restoration effect in complex scenes is limited. Subsequently, MSBDN [27], through multi-scale feature fusion and dense connection mechanism, and FFA-Net [28] introducing pixel and channel attention modules, gradually improved the performance of CNN in detail preservation and global consistency. To break through the limitation of local connection in convolution operations, researchers began to explore large-kernel convolution and multi-scale receptive field design. In addition, FocalNet [29] introduces a dual-domain selection mechanism to focus on key regions and combines it with multi-scale residual blocks to provide multi-scale receptive fields. This provides a new solution for efficiently addressing image degradation problems. NAFNet [30] adopts a simple activation-free design and provides a strong baseline with an excellent efficiency–performance trade-off. For defocus deblurring, DRBNet [31] improves restoration robustness by jointly exploiting light-field-generated and real defocused images during training. Building upon the need for more explicit blur characterization, Quan et al. [32] further introduced GGKMNet, which models spatially varying blur through a Gaussian kernel mixture learning strategy in a degradation-aware manner. However, such methods are still limited by the local inductive bias of convolution operations and are difficult to model long-range dependencies.

2.2. Global Modulation

The combination of frequency domain processing and CNN provides new ideas for global feature modeling. Mao et al. [33] embedded the fast Fourier transform (FFT) [34] into residual blocks to realize the collaborative learning of low-frequency and high-frequency signals in image deblurring. Cui et al. [35] further proposed SFNet, which guides reconstruction by dynamically selecting the most informative frequency components. TransWeather [36] presents an all-in-one adverse weather restoration framework that employs a unified encoder-decoder architecture to handle multiple degradations, such as rain, fog, and snow. Cui et al. [37] further proposed EENet, showing that efficient feature modeling can be well combined with strong dehazing performance.

The introduction of the transformer’s self-attention mechanism has significantly improved the global modeling ability of image restoration. Restormer [13] realizes the modeling of long-range pixel dependencies through channel-wise self-attention and multi-layer perceptron structure, and shows better global consistency than CNN in high-resolution image restoration. DehazeFormer [20] introduces reflective padding and inter-window convolution aggregation on the basis of swin transformer to reduce the loss of edge information, achieving a significant breakthrough in dehazing tasks. To balance the computational complexity of the transformer, researchers have proposed various optimization strategies. Fourmer [38] reduces the global modeling complexity from O(N²) to O(NlogN) through Fourier space modeling and channel evolution prior. Uformer [39] adopts a U-shaped transformer architecture to reduce the computational load of self-attention through hierarchical feature extraction. MB-TaylorFormer [24] reduces attention complexity through Taylor-expansion-based approximation while retaining effective global dependency modeling for image dehazing. However, a transformer is prone to texture blurring when processing local details.

2.3. Dual-Branch Networks

Dual-branch networks, which attempt to combine CNN’s ability of local detail extraction with the transformer’s advantage in global modeling, have become a research hotspot in image restoration in recent years. The interactive guidance dual-branch network proposed by Liu et al. [22] generates a global attention map through the transformer branch to guide CNN to focus on effective feature regions, achieving improvements in both subjective and objective indicators in dehazing tasks. Along a similar line, our GAG module derives global attention weights from full-image features to guide feature fusion and better capture long-range context. DeHamer [21] proposes a new image dehazing architecture that modulates CNN features by learning modulation matrices (i.e., coefficient matrices and bias matrices) conditioned on transformer features, rather than simple feature addition or concatenation, which solves the problem of feature inconsistency between the two. However, due to the lack of a clear division of primary and secondary roles between CNN and transformer, when processing high-resolution images, the high-load operation of the dual branches makes the overall computational overhead still too large, making it difficult to achieve real-time applications. In contrast, GMUD-Net uses an unbalanced dual-branch design. It retains a strong CNN branch for restoration while allocating far fewer parameters to the transformer branch, which reduces the computational overhead common in hybrid architectures.

3. Methodology

In this section, we first introduce the overall workflow of the network. Then, we elaborate on the four core components: (1) global–local hybrid backbone block, (2) frequency-based global attention module, (3) global transformer block, and (4) global attention guidance block. Finally, we introduce the loss function in detail.

3.1. Overall Workflow

Figure 1 illustrates the proposed framework. Given an input hazy image, we feed it into a CNN branch and a transformer branch to extract local and global representations, respectively. In the CNN branch, the encoder and decoder are built with N residual blocks and GLBB, and the FGAM is inserted at the bottleneck. In the transformer branch, we adopt global transformer blocks as the backbone to model long-range pixel dependencies, which is crucial for global context modeling in high-resolution images. Moreover, the features extracted at each transformer stage guide the corresponding CNN stage via the GAG, encouraging the CNN blocks to attend to informative global cues. To avoid redundant computation, we downsample the features before feeding them into each Global transformer stage. Therefore, the transformer inputs have a smaller spatial resolution than the feature maps processed by the CNN blocks at the same stage. Finally, the multi-stage information is fused back into the CNN branch, and the CNN decoder restores fine details to produce the final restored image. Extensive experiments demonstrate the effectiveness of our network across multiple image restoration benchmarks. Next, we describe the key components in detail.

3.2. Global-Local Hybrid Backbone Block

To enhance the restoration capability of the CNN branch, we propose a global–local hybrid backbone block, which differs from traditional blocks. Compared with the ResBlock, GLBB leverages the properties of FFT to achieve an image-level receptive field, enabling it to easily capture the global differences between blurred and clear image pairs. Additionally, the design of the local backbone branch strengthens the ability to capture information in local details, facilitating the learning of high-frequency differences. As shown in Figure 1d, the GLBB consists of three parts from top to bottom: the global Fourier branch, the local backbone branch, and the residual connection. The global Fourier branch is dedicated to global feature extraction in the GLBB. It comprises 2D real FFT, inverse 2D real FFT, two

1 \times 1

convolutions, and the GELU activation function [42]. The specific processing flow of this branch is as follows: Given the input feature, let

x \in R^{C \times H \times W}

represent the shape of the input feature, where H, W, and C denote the height, width, and number of channels of the feature volume, respectively. Apply 2D real FFT to x:

F (x) \in R^{H \times W / 2 \times C} .

(1)

Next, the real part

R (F (x))

and the imaginary part

I (F (x))

are concatenated along the channel dimension to facilitate subsequent processing, where

⨀_{C}

denotes concatenation operation along the channel dimension:

Z = R (F (x)) ⊙_{C} I (F (x)) \in R^{H \times W / 2 \times C} .

(2)

Subsequently, the concatenated feature Z is fed into two convolution layers with an intermediate GELU activation. This operation enhances GLBB by strengthening channel interactions in the frequency domain. Here,

W_{1 \times 1}

denotes a standard

1 \times 1

convolution layer:

f = W_{1 \times 1} (GELU (W_{1 \times 1} Z)) \in R^{H \times W / 2 \times 2 C} .

(3)

Finally, we apply the inverse 2D real FFT to convert f back to the spatial domain. Here,

R (f)

and

I (f)

denote the real and imaginary parts obtained by splitting f along the channel dimension, each with size

R^{H \times \frac{W}{2} \times C}

, yielding:

Y^{f f t} = F^{- 1} (R (f) + j I (f)) \in R^{H \times W \times C} .

(4)

The above steps constitute the complete processing flow of the global Fourier branch.

Next, we introduce the local backbone branch, which is dedicated to local feature extraction in the GLBB. The design of this branch includes two key components:

First, central difference convolution (CDC) [40], pixel difference convolution (PDC) [41], and standard

3 \times 3

convolution (VC) are incorporated into the local backbone branch for parallel computation, which is termed mixed detail convolution (MDC). Here, CDC and PDC are introduced to enhance local structure representation from different perspectives. CDC augments standard convolution by explicitly modeling central-difference responses, thereby emphasizing gradient-level and edge-aware information. PDC captures pixel-wise local intensity differences and is effective in describing fine-grained structural variations. The standard

3 \times 3

convolution is used to obtain intensity-level information, while the difference convolutions (CDC and PDC) enhance gradient-level information. Parallel computation reduces computational cost and further improves the deblurring capability for local details. The formula is as follows:

MDC (X) = \sum_{i = 1}^{3} X * K_{i} = X * (\sum_{i = 1}^{3} K_{i})

(5)

where

MDC (\cdot)

denotes the proposed mixed detail convolution operation, and

K_{i = 1 : 3}

represent the kernels of VC, CDC, and PDC, respectively.

Second, as shown in Figure 2, the local backbone branch employs channel attention and pixel attention as feature attention modules. Assume the input feature at this stage is

\tilde{X} \in R^{C \times H \times W}

, where GAP denotes the global average pooling layer and ⊙ denotes element-wise multiplication.

The channel attention is calculated as follows:

C A = Sigmoid (W_{1 \times 1} (R e L U (W_{1 \times 1} G A P (X))))

(6)

F_{C A} = C A ⊙ \tilde{X} .

(7)

The pixel attention is calculated as follows:

P A = Sigmoid (W_{1 \times 1} (ReLU (W_{1 \times 1} F_{C A})))

(8)

F_{P A} = P A ⊙ F_{C A} .

(9)

By cascading the two key components mentioned above, the formula for the main network (local backbone branch) is as follows:

Y^{m a i n} = W_{3 \times 3} (GELU (MDC (X)) + X) ⊙ C A ⊙ P A .

(10)

Finally, the final output of the GLBB is obtained by summing the residual X, the output of the global Fourier branch

Y^{fft}

, and the output of the local backbone branch

Y^{main}

:

Y = X + Y^{fft} + Y^{main} .

(11)

3.3. Frequency-Based Global Attention Module

The Frequency-Based Global Attention Module (FGAM) is illustrated in Figure 1c. We apply FFT in two ways across the channel dimension to enhance cross-channel interaction, enabling each pixel to more comprehensively integrate information from the same spatial position across all channels in multiple spectral spaces.

In the first strategy, FGAM generates adaptive weights in the frequency domain from the global spatial statistics of the inputs. The weights are applied pointwise to the input spectrum, and the result is transformed back to the spatial domain, enabling adaptive enhancement of channel-wise features. Specifically, given an input feature

X \in R^{C \times H \times W}

, the transformation is defined as follows:

\tilde{X} = F^{- 1} (F (X) ⊙ W_{1 \times 1} (G A P (X)))

(12)

where

W_{1 \times 1} (\cdot)

denotes a

1 \times 1

convolution layer and

GAP (\cdot)

represents global average pooling. Here, FFT transforms the input feature into a complex-valued spectrum, and the modulation is performed on this complex representation. After inverse FFT, the resulting complex-valued response is converted into a real-valued feature map by taking its magnitude before subsequent processing.

The second strategy focuses on channel-wise weighted fusion in the frequency domain to further refine global representations. Concretely, we apply cross-channel FFT to

\hat{X}

to obtain its real and imaginary parts, concatenate them along the channel dimension, and modulate the concatenated spectrum using a

1 \times 1

convolution. Finally, a cross-channel inverse FFT is used to transform the features back to the spatial domain. This process can be written as follows:

Y = F_{C}^{- 1} (W_{1 \times 1} (R (F_{C} (\tilde{X})) ⊙_{C} I (F_{C} (\tilde{X}))) .

(13)

Here, FFT along the channel dimension also produces a complex-valued spectrum. To enable standard convolution on real-valued tensors, its real and imaginary parts are concatenated along the channel dimension. After convolution, they are recombined into a complex-valued spectrum and transformed back by inverse FFT. The inverse FFT output is then converted into a real-valued feature map by taking its magnitude.

3.4. Global Transformer Block

The proposed transformer branch is constructed by stacking multiple global transformer blocks. As illustrated in Figure 3, each global transformer block follows a design similar to Restormer [13] and consists of a global self-attention module and a global depthwise separable feed-forward network. These design choices keep the transformer branch lightweight.

In the global self-attention module, the input feature map first undergoes multi-scale convolution processing, where feature extraction is performed using

1 \times 1

vanilla convolution and

3 \times 3

depthwise separable convolution to form an initial local representation. Subsequently, global attention is introduced to capture long-range dependencies, and finally, feature optimization is completed through residual connection and normalization layers. The global self-attention adopts a cross-channel computation mechanism, converting attention calculation in the spatial dimension to interaction in the channel dimension via matrix transposition, thereby reducing computational complexity. Specifically, the input feature

X \in R^{H \times W \times C}

first generates query (Q), key (K), and value (V) matrices through linear projection:

\begin{matrix} \{\begin{matrix} Q = D W_{3 \times 3}^{Q} W_{1 \times 1}^{Q} (L N (X)) \\ K = D W_{3 \times 3}^{K} W_{1 \times 1}^{K} (L N (X)) \\ V = D W_{3 \times 3}^{V} W_{1 \times 1}^{V} (L N (X)) \end{matrix} \end{matrix}

(14)

where

LN (\cdot)

denotes Layer Normalization and

D W_{3 \times 3}

denotes

3 \times 3

depthwise separable convolution.

The tensor shapes of Q, K, and V are reshaped from the original

R^{H \times W \times C}

to

R^{C \times H W}

via rearrange operation. The global attention map is computed using cross-channel matrix multiplication (⊗), where

d_{k}

is a learnable scaling factor:

A t t e n t i o n (Q, K, V) = Softmax (\frac{Q \otimes K^{T}}{\sqrt{d_{k}}}) \otimes V .

(15)

The global self-attention module integrates the original input and transformed features through residual connection, forming a skip-connection mechanism:

X^{Λ} = W_{1 \times 1} (A t t e n t i o n (Q, K, V)) + X .

(16)

This design reduces the quadratic complexity of conventional self-attention,

O (H^{2} W^{2})

, to a channel-wise complexity of

O (C^{2})

. As a result, the proposed global transformer block is better suited to large-resolution image restoration. In addition, residual connections preserve the original feature information and help stabilize optimization by mitigating gradient vanishing. Similar to standard multi-head self-attention, we split the channel dimension into multiple heads and learn separate global attention maps in parallel.

Subsequently, the output of the global self-attention module,

X^{Λ}

, is fed into the global feed-forward network, and the final result Y can be obtained step by step:

y = D W_{3 \times 3} W_{1 \times 1} (L N (X^{Λ}))

(17)

Y = W_{1 \times 1} (GELU (y) ⊙ y) .

(18)

3.5. Global Attention Guidance Block

As illustrated in Figure 1b, a global attention guidance block (GAG) is designed between the CNN branch and the transformer branch. This block extracts features from each corresponding layer of the two branches and sums them to obtain effective information about the entire image space, which is used to guide the CNN. The merged features are referred to as global image features. After the global image features pass through the GAG, a weight matrix is generated to guide the CNN. The GAG employs global average pooling (GAP) and global max pooling (GMP) to extract global information from the global image features, and then uses two

1 \times 1

convolutions to generate a global attention weight feature map.

Given the input global image feature

X \in R^{C \times H \times W}

, the formula for global attention is defined as follows:

X_{G L O B A L} = W_{1 \times 1} (ReLU (W_{1 \times 1} (GMP (X) + GAP (X))))

(19)

{GLOBAL}_{a t t} = Sigmoid (X_{G L O B A L}) .

(20)

{GLOBAL}_{a t t}

is the global attention weight of the feature map. GAG captures cross-channel global context from both branches by estimating channel-wise importance from the full-image feature map. This suppresses redundant responses and strengthens attention to the most informative features. Then, the final output of GAG is obtained by multiplying the global attention weights with the input feature map:

Y = G L O B A L_{a t t} ⊙ X .

(21)

3.6. Loss Function

In image restoration tasks, the design of the loss function plays a crucial role in model performance. This model adopts

L 1

loss (mean absolute error) as the core optimization objective, whose mathematical expression is as follows:

L_{L 1} = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\tilde{y}}_{i} |

(22)

where n represents the number of samples,

y_{i}

is the true value of the i-th sample, and

{\tilde{y}}_{i}

is the predicted value of the model.

L 1

loss directly measures the pixel-level reconstruction error by calculating the average of the absolute differences between the predicted and true values, which can effectively prompt the model to focus on detail restoration and avoid over-smoothing.

Considering that the high-frequency details (such as textures and contours) and low-frequency structures (such as illumination and global layout) of an image have different manifestations in the frequency domain, we further introduce the frequency-domain

L 1

loss:

L_{L 1}^{f r e q u e n c y} = \frac{1}{n} \sum_{i = 1}^{n} | F (y_{i}) - F ({\tilde{y}}_{i}) |

(23)

where

F (\cdot)

denotes FFT operation. The frequency-domain loss can guide the model to align the spectral distribution of the predicted image and the true image in the frequency domain space, which is particularly helpful for restoring high-frequency detail information destroyed by the degradation process and improving the clarity and naturalness of the image.

The final dual-domain

L 1

loss function is constructed by weighted combination of the spatial-domain and frequency-domain losses:

L_{d u a l - d o m a i n} = L_{L 1} + λ L_{L 1}^{f r e q u e n c y}

(24)

where the weight parameter

λ

is set to 0.1 to balance the contribution of the two loss terms. This design allows the model to focus on both the accurate reconstruction in the pixel space and the consistency of frequency-domain features, forming a complementary optimization mechanism. The spatial-domain loss ensures the correctness of the basic content of the image, while the frequency-domain loss enhances the ability to express details, especially showing significant advantages when textures and edge structures are complex.

4. Experiments

In this section, experiments are conducted on nine different benchmark datasets to verify the effectiveness of our network in three representative image restoration tasks: image dehazing, image defocus deblurring, and image desnowing. Ablation experiments are presented in the final part.

4.1. Implementation Details and Evaluation Protocols

In the transformer branch, the training model remains unchanged for different datasets, and the number of global transformer blocks B from front to back is [1, 2, 4, 2, 1]. In the CNN branch, we train separate models for different datasets. We scale the model by setting different numbers of residual blocks in each encoder and decoder according to the task complexity, i.e.,

N = 3

for dehazing and desnowing, and

N = 15

for defocus deblurring. Unless otherwise specified, the following default hyperparameters shall be used. The model is trained using the Adam optimizer with

β 1 = 0.9

and

β 2 = 0.999

. The batch size is set to eight. The learning rate is initially set to 2

\times 10^{- 4}

and then gradually reduced to 1

\times 10^{- 6}

using a cosine annealing decay strategy. For data augmentation, cropped patches of size

256 \times 256

are randomly horizontally flipped with a probability of 0.5. All experiments are conducted on an NVIDIA RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA).

We measure the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) between the predicted results and the real images of all datasets as criteria to evaluate the model performance. The defocus deblurring task also uses mean absolute error (MAE) and learned perceptual image patch similarity (LPIPS) for evaluation. FLOPs are measured on

256 \times 256

patches. In the tables, the best and second-best results are marked in bold and underlined.

4.2. Datasets

Image Dehazing. We train and evaluate GMUD-Net on both synthetic and real-world datasets for image dehazing. Following common practice, we train separate models on the RESIDE indoor and outdoor training sets [43], i.e., ITS and OTS, and evaluate them on the corresponding RESIDE test sets, namely SOTS-indoor and SOTS-outdoor. Specifically, the indoor model is trained on ITS for 1000 epochs. For the outdoor setting, the model is trained on OTS for 30 epochs with an initial learning rate of 1

\times 10^{- 4}

, and default hyperparameters were used for the rest.

To further assess robustness in challenging real-world scenarios, we additionally conduct experiments on two real-world datasets, Dense-Haze [44] and NH-Haze [45]. For these datasets, models are trained for 5000 epochs with a patch size of 800 × 1200 and a batch size of two.

Besides daytime dehazing, we also evaluate GMUD-Net on the nighttime dehazing dataset NHR [46]. For NHR, the model is trained for 300 epochs with an initial learning rate of 1

\times 10^{- 4}

.

Image Defocus Deblurring. The DPDD dataset [5] is used to train and evaluate our model, following the protocol adopted by prior defocus deblurring methods [5,31,33].

The DPDD dataset consists of 350 training scenes, 74 validation scenes, and 76 test scenes. Each scene includes four images, labeled as the central view, left view, right view, and fully focused ground truth image. The model is trained by taking the central view image as input and calculating the loss between the predicted clean image and the ground truth. Training is conducted for 4000 epochs with an initial learning rate of 1

\times 10^{- 4}

.

Image Desnowing. For the image desnowing task, our model is trained and evaluated on three commonly used datasets: SRRS [47], CSD [48], and Snow100K [2]. The dataset selection and testing methods are consistent with recent algorithms [30,36,48], and the model is trained separately on each dataset for 150 epochs, 4000 epochs, and 2000 epochs, respectively, with an initial learning rate of 1

\times 10^{- 4}

.

4.3. Image Dehazing Results

Quantitative Comparison. For daytime scenes, Table 1 reports the quantitative results on the RESIDE benchmark [43] (SOTS subset) and two real-world datasets, Dense-Haze [44] and NH-Haze [45]. GMUD-Net achieves the best performance on most metrics across both synthetic and real-world evaluations. In particular, on SOTS-indoor, GMUD-Net surpasses DEA-Net-CR [17], a recent dehazing method tailored for image dehazing, by 0.39 dB in PSNR. Moreover, compared with OKM [18], a state-of-the-art restoration model, GMUD-Net is more effective at handling dense haze in real-world scenarios, delivering a 0.21 dB PSNR gain on Dense-Haze. These results indicate that GMUD-Net is particularly robust under heavy haze conditions.

In addition, we evaluated the effectiveness of GMUD-Net on the nighttime dehazing dataset NHR [46]. As shown in Table 2, GMUD-Net achieves a 0.60 dB PSNR improvement over the recent FocalNet [29]. These results confirm the effectiveness of our proposed design.

Visual Comparison. This paper compares the visual effects of image dehazing across multiple methods for both daytime and nighttime scenes. Figure 4 presents comparisons between indoor and outdoor daytime environments, while Figure 5 shows low-light comparisons across four distinct visual scenarios. The visual results intuitively demonstrate that our method is more effective than other algorithms in removing haze blur from daytime (indoor and outdoor) and nighttime scenes, producing results that are closer to the ground truth images.

4.4. Defocus Image Deblurring Results

Quantitative Comparison. The quantitative results on DPDD are reported in Table 3. Across indoor, outdoor, and combined settings, GMUD-Net achieves consistently competitive performance against eleven representative methods. In terms of MAE, GMUD-Net matches the best-performing method (FocalNet [29]) across all splits. Moreover, GMUD-Net yields lower LPIPS than FocalNet in all cases, indicating improved perceptual quality. Although its PSNR is slightly lower than FocalNet, GMUD-Net remains comparable to other strong restoration models while using fewer parameters than Restormer [13], demonstrating a favorable accuracy–efficiency trade-off for defocus deblurring.

Visual Comparison. Visual comparisons are shown in Figure 6. GMUD-Net restores sharper structures and finer textures in both indoor and outdoor scenes, while reducing blurring artifacts and detail loss compared with competing methods.

4.5. Image Desnowing Results

Quantitative Comparison. The quantitative results of GMUD-Net on CSD [48], SRRS [47], and Snow100K [2] are summarized in Table 4. Compared with IRNeXt [63], GMUD-Net improves PSNR by 0.39 dB on CSD, 0.34 dB on SRRS, and 0.19 dB on Snow100K, while achieving comparable or higher SSIM. These results indicate a clear advantage of GMUD-Net for snow removal, particularly in complex scenes.

Visual Comparison. The visual comparisons on the CSD dataset are shown in Figure 7. As highlighted in the zoomed-in patches (yellow boxes), GMUD-Net produces cleaner desnowing results and better preserves fine details in challenging regions. For example, the zoomed-in results of FocalNet [29] still exhibit residual snow-induced blur, whereas GMUD-Net yields sharper structures that are closer to the ground truth.

4.6. Ablation Study

To thoroughly validate the effectiveness of the proposed GMUD-Net, we conducted comprehensive ablation studies on the SOTS-indoor dataset. These experiments were designed to evaluate the contribution of individual components and to justify the configuration of the unbalanced dual-branch architecture.

4.6.1. Effectiveness of Individual Components

We investigate the contributions of the global–local hybrid backbone block (GLBB), the frequency-based global attention module (FGAM), and the global attention guidance block (GAG). As reported in Table 5 and visualized in Figure 8, the baseline model (M1) is a simplified architecture in which the corresponding modules are replaced by standard convolutions. Features from the two branches are fused by simple feature addition (FA) without attention guidance.

M1 vs. M2. To verify the contribution of frequency domain modeling, we integrated the FGAM into the baseline M1, resulting in model M2. Quantitative results in Table 5 show that M2 achieves a PSNR of 33.26 dB, an improvement of 1.28 dB over M1. This gain suggests that applying FFT-based channel interaction at the bottleneck enhances global information aggregation. Visual comparisons in Figure 8 further support this, where M2 exhibits clearer structural details compared to the blurry outputs of M1.

M1 vs. M3. We then assess the effect of GLBB by replacing the standard residual blocks in M1 with the proposed GLBB, resulting in M3. This change delivers the largest improvement, increasing PSNR by 6.97 dB and SSIM by 0.0093, albeit at the expense of higher complexity. The parameter counts and FLOPs increase by 1.91M and 3.75G, respectively. The substantial gain suggests that GLBB effectively combines local details and global context through its local backbone branch and the global Fourier branch. As shown in Figure 8, M3 noticeably reduces haze artifacts and improves color fidelity compared with M1.

M4 vs. M5. Finally, we evaluated the effect of the global attention guidance module by comparing M4 (GLBB + FGAM without GAG) with the complete component M5. While M4 already achieves a high PSNR of 39.20 dB, the inclusion of GAG in M5 further improves performance to 39.31 dB and achieves the highest SSIM of 0.9947. The GAG effectively filters redundant features from the transformer branch before fusion. Visually, Figure 8 demonstrates that M5 restores the most natural textures and edge details, closest to the Ground Truth, thereby validating the necessity of the GAG in optimizing feature integration.

4.6.2. Analysis of Dual-Branch Configuration

We further analyze the rationale behind our unbalanced dual-branch design by varying the depth of the transformer branch. Table 6 summarizes the trade-off between restoration performance and computational cost in terms of parameters and FLOPs on SOTS-indoor.

Necessity of the Dual-Branch. The single-branch baseline (F1), which uses only the CNN branch, achieves 39.06 dB PSNR. After introducing the lightweight transformer branch (F2), PSNR increases to 39.31 dB, indicating that the auxiliary transformer branch provides complementary global cues that are difficult to capture with the CNN branch alone.

Advantages of the Unbalanced Design. We further compare transformer-branch configurations with different depths. As shown in Table 6, F4 adopts the same block numbers as Restormer [13], substantially increasing transformer depth. However, it yields only a marginal improvement over F3, while increasing the parameter count from 4.95 M to 6.41 M and FLOPs from 23.85 G to 29.47 G. In contrast, F3 intentionally uses fewer transformer blocks, resulting in an unbalanced parameter allocation between the CNN and transformer branches while maintaining strong restoration performance. Therefore, we adopt the transformer block setting in F3 as the final configuration of GMUD-Net.

5. Conclusions

This paper proposes GMUD-Net, an image restoration network with an unbalanced dual-branch architecture designed to handle multiple types of degradations. The model employs a CNN-based main branch to model local structures and fine textures, while an auxiliary transformer branch is introduced to capture global dependencies. By integrating the global–local hybrid backbone block (GLBB), the frequency-based global attention module (FGAM), and the global attention guidance block (GAG), GMUD-Net enables effective feature interaction and fusion across the spatial and frequency domains. Within this unified framework, GMUD-Net addresses representative restoration tasks, including image dehazing, defocus deblurring, and snow removal. Experiments on multiple benchmark datasets demonstrate that GMUD-Net achieves superior or comparable performance to existing methods, suggesting that a CNN-dominated, transformer-assisted dual-branch design with frequency-aware modeling can improve restoration quality and practical applicability across diverse scenarios.

Despite achieving a reasonable balance between restoration quality and efficiency, GMUD-Net remains relatively complex, leaving room for improvement in strictly real-time and resource-constrained settings. Moreover, current evaluations are limited to a restricted set of degradation types and datasets. Future work will focus on simplifying the architecture to reduce computational overhead, for example by exploring more lightweight attention and feature fusion mechanisms to streamline GLBB. We also plan to extend the framework to additional tasks, such as low-light enhancement, deraining, and video restoration, with the goal of improving generalization across diverse scenarios and datasets.

Author Contributions

S.W. was responsible for conceptualization and formal analysis. The methodology was developed by S.W. and H.Z. Software implementation, validation, and visualization were carried out by Y.L. Investigation was conducted by S.W. and Y.L. Resources, supervision, project administration, and funding acquisition were provided by S.W. and H.Z. Data curation was performed by H.Z. The original draft was prepared by S.W., H.Z. and Y.L., while review and editing were completed by S.W. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Funded by Basic Research Program of Jiangsu (Grant No. BK20220226).

Data Availability Statement

Data are available upon request due to restrictions (privacy and ethical).

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 2341–2353. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.F.; Jaw, D.W.; Huang, S.C.; Hwang, J.N. Desnownet: Context-aware deep network for snow removal. IEEE Trans. Image Process. 2018, 27, 3064–3073. [Google Scholar] [CrossRef] [PubMed]
Qian, R.; Tan, R.T.; Yang, W.; Su, J.; Liu, J. Attentive generative adversarial network for raindrop removal from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2482–2491. [Google Scholar]
Li, C.; Guo, C.; Ren, W.; Cong, R.; Hou, J.; Kwong, S.; Tao, D. An underwater image enhancement benchmark dataset and beyond. IEEE Trans. Image Process. 2019, 29, 4376–4389. [Google Scholar] [CrossRef]
Abuolaim, A.; Brown, M.S. Defocus deblurring using dual-pixel data. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 111–126. [Google Scholar]
Nah, S.; Hyun Kim, T.; Mu Lee, K. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 3883–3891. [Google Scholar]
Cai, Y.; Bian, H.; Lin, J.; Wang, H.; Timofte, R.; Zhang, Y. Retinexformer: One-stage retinex-based transformer for low-light image enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023, Paris, France, 1–6 October 2023; pp. 12504–12513. [Google Scholar]
Zhang, H.; Xiao, L.; Cao, X.; Foroosh, H. Multiple adverse weather conditions adaptation for object detection via causal intervention. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 46, 1742–1756. [Google Scholar] [CrossRef] [PubMed]
Huang, S.C.; Le, T.H.; Jaw, D.W. DSNet: Joint semantic learning for object detection in inclement weather conditions. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2623–2633. [Google Scholar] [CrossRef]
Berman, D.; Treibitz, T.; Avidan, S. Air-light estimation using haze-lines. In Proceedings of the 2017 IEEE International Conference on Computational Photography (ICCP); IEEE: New York, NY, USA, 2017; pp. 1–9. [Google Scholar]
Yi, Q.; Li, J.; Dai, Q.; Fang, F.; Zhang, G.; Zeng, T. Structure-preserving deraining with residue channel prior guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 4238–4247. [Google Scholar]
Li, B.; Peng, X.; Wang, Z.; Xu, J.; Feng, D. Aod-net: All-in-one dehazing network. In Proceedings of the IEEE International Conference on Computer Vision 2017, Venice, Italy, 22–29 October 2017; pp. 4770–4778. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
McCartney, E.J. Optics of the Atmosphere: Scattering by Molecules and Particles; Wiley: New York, NY, USA, 1976. [Google Scholar]
Tipping, M.E. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 2001, 1, 211–244. [Google Scholar]
Chen, D.; He, M.; Fan, Q.; Liao, J.; Zhang, L.; Hou, D.; Yuan, L.; Hua, G. Gated context aggregation network for image dehazing and deraining. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV); IEEE: New York, NY, USA, 2019; pp. 1375–1383. [Google Scholar]
Chen, Z.; He, Z.; Lu, Z.M. DEA-Net: Single image dehazing based on detail-enhanced convolution and content-guided attention. IEEE Trans. Image Process. 2024, 33, 1002–1015. [Google Scholar] [CrossRef]
Cui, Y.; Ren, W.; Knoll, A. Omni-kernel modulation for universal image restoration. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 12496–12509. [Google Scholar] [CrossRef]
Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 12299–12310. [Google Scholar]
Song, Y.; He, Z.; Qian, H.; Du, X. Vision transformers for single image dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef]
Guo, C.L.; Yan, Q.; Anwar, S.; Cong, R.; Ren, W.; Li, C. Image dehazing transformer with transmission-aware 3d position embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 5812–5820. [Google Scholar]
Liu, H.; Li, X.; Tan, T. Interaction-Guided Two-branch image dehazing network. In Proceedings of the Asian Conference on Computer Vision 2024, Hanoi, Vietnam, 8–12 December 2024; pp. 4069–4084. [Google Scholar]
Wang, S.; Li, H.; Liu, L.; Cai, R.; Yin, Z.; Zhu, H. TSFI-fusion: A dual-branch decoupled infrared and visible image fusion network based on transformer and spatial-frequency interaction. Opt. Lasers Eng. 2025, 195, 109287. [Google Scholar] [CrossRef]
Qiu, Y.; Zhang, K.; Wang, C.; Luo, W.; Li, H.; Jin, Z. Mb-taylorformer: Multi-branch efficient transformer expanded by taylor formula for image dehazing. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023, Paris, France, 1–6 October 2023; pp. 12802–12813. [Google Scholar]
Zhu, Q.; Mai, J.; Shao, L. A fast single image haze removal algorithm using color attenuation prior. IEEE Trans. Image Process. 2015, 24, 3522–3533. [Google Scholar] [CrossRef] [PubMed]
Cai, B.; Xu, X.; Jia, K.; Qing, C.; Tao, D. Dehazenet: An end-to-end system for single image haze removal. IEEE Trans. Image Process. 2016, 25, 5187–5198. [Google Scholar] [CrossRef] [PubMed]
Dong, H.; Pan, J.; Xiang, L.; Hu, Z.; Zhang, X.; Wang, F.; Yang, M.H. Multi-scale boosted dehazing network with dense feature fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 2157–2167. [Google Scholar]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature fusion attention network for single image dehazing. Proc. AAAI Conf. Artif. Intell. 2020, 34, 11908–11915. [Google Scholar] [CrossRef]
Cui, Y.; Ren, W.; Cao, X.; Knoll, A. Focal network for image restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023, Paris, France, 1–6 October 2023; pp. 13001–13011. [Google Scholar]
Chen, L.; Chu, X.; Zhang, X.; Sun, J. Simple baselines for image restoration. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 17–33. [Google Scholar]
Ruan, L.; Chen, B.; Li, J.; Lam, M. Learning to deblur using light field generated and real defocus images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 16304–16313. [Google Scholar]
Quan, Y.; Wu, Z.; Xu, R.; Ji, H. Deep single image defocus deblurring via gaussian kernel mixture learning. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11361–11377. [Google Scholar] [CrossRef]
Mao, X.; Liu, Y.; Shen, W.; Li, Q.; Wang, Y. Deep residual fourier transformation for single image deblurring. arXiv 2021, arXiv:2111.11745. [Google Scholar]
Nussbaumer, H.J. The fast Fourier transform. In Fast Fourier Transform and Convolution Algorithms; Springer: Berlin/Heidelberg, Germany, 1981; pp. 80–111. [Google Scholar]
Cui, Y.; Tao, Y.; Bing, Z.; Ren, W.; Gao, X.; Cao, X.; Huang, K.; Knoll, A. Selective frequency network for image restoration. In Proceedings of the Eleventh International Conference on Learning Representations 2023, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Valanarasu, J.M.J.; Yasarla, R.; Patel, V.M. Transweather: Transformer-based restoration of images degraded by adverse weather conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 2353–2363. [Google Scholar]
Cui, Y.; Wang, Q.; Li, C.; Ren, W.; Knoll, A. EENet: An effective and efficient network for single image dehazing. Pattern Recognit. 2025, 158, 111074. [Google Scholar] [CrossRef]
Zhou, M.; Huang, J.; Guo, C.L.; Li, C. Fourmer: An efficient global modeling paradigm for image restoration. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2023; pp. 42589–42601. [Google Scholar]
Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 17683–17693. [Google Scholar]
Yu, Z.; Zhao, C.; Wang, Z.; Qin, Y.; Su, Z.; Li, X.; Zhou, F.; Zhao, G. Searching central difference convolutional networks for face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 5295–5305. [Google Scholar]
Su, Z.; Liu, W.; Yu, Z.; Hu, D.; Liao, Q.; Tian, Q.; Pietikäinen, M.; Liu, L. Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 5117–5127. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Li, B.; Ren, W.; Fu, D.; Tao, D.; Feng, D.; Zeng, W.; Wang, Z. Benchmarking single-image dehazing and beyond. IEEE Trans. Image Process. 2018, 28, 492–505. [Google Scholar] [CrossRef]
Ancuti, C.O.; Ancuti, C.; Sbert, M.; Timofte, R. Dense-haze: A benchmark for image dehazing with dense-haze and haze-free images. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP); IEEE: New York, NY, USA, 2019; pp. 1014–1018. [Google Scholar]
Ancuti, C.O.; Ancuti, C.; Timofte, R. NH-HAZE: An image dehazing benchmark with non-homogeneous hazy and haze-free images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 2020, Seattle, WA, USA, 14–19 June 2020; pp. 444–445. [Google Scholar]
Zhang, J.; Cao, Y.; Zha, Z.J.; Tao, D. Nighttime dehazing with a synthetic benchmark. In Proceedings of the 28th ACM International Conference on Multimedia 2020, Seattle, WA, USA, 12–16 October 2020; pp. 2355–2363. [Google Scholar]
Chen, W.T.; Fang, H.Y.; Ding, J.J.; Tsai, C.C.; Kuo, S.Y. JSTASR: Joint size and transparency-aware snow removal algorithm based on modified partial convolution and veiling effect removal. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 754–770. [Google Scholar]
Chen, W.T.; Fang, H.Y.; Hsieh, C.L.; Tsai, C.C.; Chen, I.; Ding, J.J.; Kuo, S.Y. All snow removed: Single image desnowing algorithm using hierarchical dual-tree complex wavelet representation and contradict channel loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 4196–4205. [Google Scholar]
Liu, X.; Ma, Y.; Shi, Z.; Chen, J. Griddehazenet: Attention-based multi-scale network for image dehazing. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7314–7323. [Google Scholar]
Wu, H.; Qu, Y.; Lin, S.; Zhou, J.; Qiao, R.; Zhang, Z.; Xie, Y.; Ma, L. Contrastive learning for compact single image dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 10551–10560. [Google Scholar]
Ye, T.; Zhang, Y.; Jiang, M.; Chen, L.; Liu, Y.; Chen, S.; Chen, E. Perceiving and modeling density for image dehazing. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 130–145. [Google Scholar]
Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxim: Multi-axis mlp for image processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 5769–5780. [Google Scholar]
Li, Y.; Tan, R.T.; Brown, M.S. Nighttime haze removal with glow and multiple light colors. In Proceedings of the IEEE International Conference on Computer Vision 2015, Santiago, Chile, 7–13 December 2015; pp. 226–234. [Google Scholar]
Zhang, J.; Cao, Y.; Fang, S.; Kang, Y.; Wen Chen, C. Fast haze removal for nighttime image using maximum reflectance prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 7418–7426. [Google Scholar]
Wang, T.; Tao, G.; Lu, W.; Zhang, K.; Luo, W.; Zhang, X.; Lu, T. Restoring vision in hazy weather with hierarchical contrastive learning. Pattern Recognit. 2024, 145, 109956. [Google Scholar] [CrossRef]
Cui, Y.; Ren, W.; Cao, X.; Knoll, A. Image restoration via frequency selection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 1093–1108. [Google Scholar] [CrossRef]
Karaali, A.; Jung, C.R. Edge-based defocus blur estimation with adaptive scale selection. IEEE Trans. Image Process. 2017, 27, 1126–1137. [Google Scholar] [CrossRef]
Lee, J.; Lee, S.; Cho, S.; Lee, S. Deep defocus map estimation using domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 12222–12230. [Google Scholar]
Shi, J.; Xu, L.; Jia, J. Just noticeable defocus blur detection and estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, Boston, MA, USA, 7–12 June 2015; pp. 657–665. [Google Scholar]
Son, H.; Lee, J.; Cho, S.; Lee, S. Single image defocus deblurring using kernel-sharing parallel atrous convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 2642–2650. [Google Scholar]
Lee, J.; Son, H.; Rim, J.; Cho, S.; Lee, S. Iterative filter adaptive network for single image defocus deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 2034–2042. [Google Scholar]
Abuolaim, A.; Afifi, M.; Brown, M.S. Improving single-image defocus deblurring: How dual-pixel images help through multi-task learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 2022, Waikoloa, HI, USA, 3–8 January 2022; pp. 1231–1239. [Google Scholar]
Cui, Y.; Ren, W.; Yang, S.; Cao, X.; Knoll, A. Irnext: Rethinking convolutional network design for image restoration. In Proceedings of the 40th International Conference on Machine Learning 2023, Honolulu, HI, USA, 23–29 July 2023; pp. 6545–6564. [Google Scholar]
Engin, D.; Genç, A.; Kemal Ekenel, H. Cycle-dehaze: Enhanced cyclegan for single image dehazing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 825–833. [Google Scholar]
Li, R.; Tan, R.T.; Cheong, L.F. All in one bad weather removal using architectural search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 3175–3185. [Google Scholar]

Figure 1. Architectural details of GMUD-Net. (a) GMUD-Net consists of a CNN branch and a transformer branch. (b) GAG is used between the two branches to filter out the acquired global information. (c) FGAM applies FFT in two ways across the channel dimension to enhance channel interaction for global information acquisition. (d) The encoder/decoder blocks of the CNN branch include N regular residual blocks and a modified residual block, which is GLBB. (e) The Mixed Detail Convolution (MDC) is composed of a regular convolution, a Central Difference Convolution (CDC) [40], and a Pixel Difference Convolution (PDC) [41] in parallel.

Figure 2. Channel attention and pixel attention are illustrated as shown.

Figure 3. The global transformer block used in the transformer branch of GMUD-Net.

Figure 4. Comparisons of image dehazing on the SOTS [43]. The top and bottom images are obtained from SOTS-indoor and SOTS-outdoor, respectively. Our results are generated by GMUD-Net.

Figure 5. Comparisons of nighttime image dehazing on the NHR [46] dataset.

Figure 6. Single-image defocus deblurring results on the DPDD dataset [5]. The top and bottom images are obtained from indoor and outdoor scenes, respectively.

Figure 7. Qualitative comparisons for image desnowing on the CSD [48] dataset. The top row shows the overall image conditions, and the bottom row shows the detailed conditions.

Figure 8. Comparison of dehazing results of different components of GMUD-Net on the SOTS-indoor dataset.

Table 1. Comparisons on four daytime synthetic and real-world dehazing datasets, namely SOTS-indoor [43], SOTS-outdoor [43], Dense-Haze [44], and NH-Haze [45].

Method	SOTS-indoor		SOTS-outdoor		Dense-Haze		NH-Haze		Overhead		Venue
Method	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	Params/M FLOPs/G		Venue
DehazeNet [26]	19.82	0.8210	24.75	0.9271	13.84	0.43	16.62	0.52	0.009	0.581	TIP 2016
GridDehazeNet [49]	32.16	0.9845	30.86	0.9827	13.31	0.37	13.80	0.54	0.956	21.49	ICCV 2019
MSBDN [27]	33.67	0.9856	33.48	0.9824	15.37	0.49	19.23	0.71	31.35	24.44	CVPR 2020
FFA-Net [28]	36.39	0.9894	33.57	0.9842	14.39	0.45	19.87	0.69	4.456	287.8	AAAI 2020
AECR-Net [50]	37.17	0.9901	-		15.80	0.47	19.88	0.72	2.611	52.20	CVPR 2021
PMNet [51]	38.41	0.9900	34.74	0.9850	16.79	0.51	20.42	0.73	18.90	81.13	ECCV 2020
MAXIM-2S [52]	38.11	0.9910	34.19	0.9850	-		-		14.1	216	CVPR 2022
Dehamer [21]	36.63	0.9881	35.18	0.9860	16.62	0.56	-		132.50	60.3	CVPR 2022
Fourmer [38]	37.32	0.9901	-		15.95	0.49	19.91	0.72	1.29	20.6	ICML 2023
Dehazeformer [20]	38.46	0.9940	34.29	0.9830	-		19.11	0.66	4.634	48.64	TIP 2023
MB-TaylorFormer-B [24]	40.71	0.9920	37.42	0.9890	16.66	0.56	-		2.68	38.5	ICCV 2023
DEA-Net-CR [17]	41.31	0.9945	36.59	0.9897	-		-		3.653	32.23	TIP 2024
Restormer [13]	38.88	0.9910	-		15.78	0.55	-		25.31	87.7	CVPR 2022
FocalNet [29]	40.82	0.9960	37.71	0.9950	17.07	0.63	20.43	0.79	3.74	30.63	ICCV 2023
OKM [18]	40.79	0.9960	37.68	0.9950	16.92	0.64	20.48	0.80	4.72	39.67	AAAI 2024
Ours	41.70	0.9966	38.57	0.9953	17.13	0.64	20.47	0.79	6.28	37.15

Table 2. Performance on the nighttime dehazing dataset NHR [46].

Method	GS [53]	MRPF [54]	MRP [54]	OSFD [46]	HCD [55]	FSNet-S [56]	Restormer [13]	FocalNet [29]	Ours
PSNR	17.32	16.95	19.93	21.32	23.43	24.35	25.01	25.35	25.95
SSIM	0.629	0.667	0.777	0.804	0.953	0.965	0.967	0.969	0.973

Table 3. Comparisons of image defocus deblurring on the DPDD dataset [5].

Method	Indoor Scenes				Outdoor Scenes				Combined				Params
Method	PSNR↑	SSIM↑	MAE↓	LPIPS↓	PSNR↑	SSIM↑	MAE↓	LPIPS↓	PSNR↑	SSIM↑	MAE↓	LPIPS↓	(M)
EBDB [57]	25.77	0.772	0.040	0.297	21.25	0.599	0.058	0.373	23.45	0.683	0.049	0.336	-
DMENet [58]	25.50	0.788	0.038	0.298	21.43	0.644	0.063	0.397	23.41	0.714	0.051	0.349	-
JNB [59]	26.73	0.828	0.031	0.273	21.10	0.608	0.064	0.355	23.84	0.715	0.048	0.315	-
DPDNet [5]	26.54	0.816	0.031	0.239	22.25	0.682	0.056	0.313	24.34	0.747	0.044	0.277	31.03
KPAC [60]	27.97	0.852	0.026	0.182	22.62	0.701	0.053	0.269	25.22	0.774	0.040	0.227	2.06
DeepRFT [33]	-				-				25.71	0.801	0.039	0.218	9.60
IFAN [61]	28.11	0.861	0.026	0.179	22.76	0.720	0.052	0.254	25.37	0.789	0.039	0.217	10.48
MDP [62]	28.02	0.841	0.027	-	22.82	0.690	0.052	-	25.35	0.763	0.040	0.303	46.86
DRBNet [31]	-				-				25.73	0.791	-	0.183	11.69
Restormer [13]	28.87	0.882	0.025	0.145	23.24	0.743	0.050	0.209	25.98	0.811	0.038	0.178	26.16
FocalNet [29]	29.10	0.876	0.024	0.173	23.41	0.743	0.049	0.246	26.18	0.808	0.037	0.210	12.82
Ours	28.89	0.878	0.024	0.149	23.25	0.741	0.049	0.212	25.99	0.808	0.037	0.182	15.58

Table 4. Results on three widely used desnowing datasets.

Method	CSD		SRRS		Snow100K
Method	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
DesnowNet [2]	20.13	0.81	20.38	0.84	30.50	0.94
CycleGAN [64]	20.98	0.80	20.21	0.74	26.81	0.89
All in One [65]	26.31	0.87	24.98	0.88	26.07	0.88
JSTASR [47]	27.96	0.88	25.82	0.89	23.12	0.86
HDCW-Net [48]	29.06	0.91	27.78	0.92	31.54	0.95
TransWeather [36]	31.76	0.93	28.29	0.92	31.82	0.93
NAFNet [30]	33.13	0.96	29.72	0.94	32.41	0.95
Restormer [13]	37.07	0.99	31.12	0.97	33.51	0.95
FocalNet [29]	37.18	0.99	31.34	0.98	33.53	0.95
IRNeXt [63]	37.29	0.99	31.91	0.98	33.61	0.95
Ours	37.68	0.99	32.25	0.98	33.80	0.96

Table 5. Ablation study of individual components of GMUD-Net on the SOTS-indoor dataset. FA denotes feature addition for fusing the two branches.

Methods	FA	FGAM	GLBB	GAG	PSNR	SSIM	Params/M	FLOPs/G
M1	✓				31.98	0.9852	2.01	17.47
M2	✓	✓			33.26	0.9883	2.09	17.50
M3	✓		✓		38.95	0.9945	3.92	21.22
M4	✓	✓	✓		39.20	0.9947	4.00	21.49
M5		✓	✓	✓	39.31	0.9947	4.01	21.52

Table 6. Ablation study of the two branches in GMUD-Net on the SOTS-indoor dataset. N denotes the number of residual blocks, and B denotes the transformer configuration.

Methods	N	B	PSNR	SSIM	Params/M	FLOPs/G
F1	1	[0, 0, 0, 0, 0]	39.06	0.9946	3.47	19.47
F2	1	[1, 1, 1, 1, 1]	39.31	0.9947	4.01	21.52
F3	1	[1, 2, 4, 2, 1]	39.41	0.9947	4.95	23.85
F4	1	[8, 8, 8, 8, 8]	39.44	0.9947	6.41	29.47

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, S.; Liu, Y.; Zhu, H. GMUD-Net: Global Modulated Unbalanced Dual-Branch Network for Image Restoration in Various Degraded Environments. Appl. Sci. 2026, 16, 2854. https://doi.org/10.3390/app16062854

AMA Style

Wang S, Liu Y, Zhu H. GMUD-Net: Global Modulated Unbalanced Dual-Branch Network for Image Restoration in Various Degraded Environments. Applied Sciences. 2026; 16(6):2854. https://doi.org/10.3390/app16062854

Chicago/Turabian Style

Wang, Shengchun, Yingjie Liu, and Huijie Zhu. 2026. "GMUD-Net: Global Modulated Unbalanced Dual-Branch Network for Image Restoration in Various Degraded Environments" Applied Sciences 16, no. 6: 2854. https://doi.org/10.3390/app16062854

APA Style

Wang, S., Liu, Y., & Zhu, H. (2026). GMUD-Net: Global Modulated Unbalanced Dual-Branch Network for Image Restoration in Various Degraded Environments. Applied Sciences, 16(6), 2854. https://doi.org/10.3390/app16062854

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GMUD-Net: Global Modulated Unbalanced Dual-Branch Network for Image Restoration in Various Degraded Environments

Abstract

1. Introduction

2. Related Work

2.1. Single Image Restoration

2.2. Global Modulation

2.3. Dual-Branch Networks

3. Methodology

3.1. Overall Workflow

3.2. Global-Local Hybrid Backbone Block

3.3. Frequency-Based Global Attention Module

3.4. Global Transformer Block

3.5. Global Attention Guidance Block

3.6. Loss Function

4. Experiments

4.1. Implementation Details and Evaluation Protocols

4.2. Datasets

4.3. Image Dehazing Results

4.4. Defocus Image Deblurring Results

4.5. Image Desnowing Results

4.6. Ablation Study

4.6.1. Effectiveness of Individual Components

4.6.2. Analysis of Dual-Branch Configuration

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI