All-in-One Weather Image Restoration with Asymmetric Mixture-of-Experts Model and Multi-Task Collaboration

Jinhao Chen; Zhengfei Zhuang

doi:10.3390/sym18020231

and

School of Photonics Science and Engineering, South China Normal University, Guangzhou 510631, China

^*

Author to whom correspondence should be addressed.

Symmetry2026, 18(2), 231;https://doi.org/10.3390/sym18020231

Version Notes

Order Reprints

Abstract

Recent studies have achieved significant advances in all-in-one adverse weather image restoration, primarily driven by the development of sophisticated model architectures. In this work, we find that effectively coordinating the complex interactions and potential optimization conflicts among different restoration tasks is also a critical factor determining the overall performance of all-in-one adverse weather image restoration models. To this end, we propose an effective all-in-one adverse weather image restoration framework, named MOE-WIRNet, designed to harmonize the learning process across various degradation types and ensure well-balanced performance among different restoration tasks. To enhance training equilibrium, we integrate a multi-task collaboration optimization strategy into the framework, coordinating the convergence dynamics of distinct restoration objectives. Furthermore, we incorporate an asymmetric mixture-of-experts (MoE) architecture into the framework to effectively address the distinct degradation patterns and varying severity levels presented by different tasks. Extensive experiments demonstrate that our framework consistently outperforms current state-of-the-art models on multiple real-world adverse weather benchmark datasets.

Keywords:

image restoration; adverse weather; mixture-of-experts; multi-task learning

1. Introduction

Visual perception systems are critically important under adverse weather conditions such as rain, snow, and haze, with their performance significantly impacting practical applications including autonomous driving [1,2], intelligent transportation [3,4], and outdoor surveillance [5,6]. Such weather phenomena introduce challenges like occlusion, scattering, and reduced contrast, which severely degrade image quality and consequently impair the accuracy of downstream vision tasks such as object detection and tracking [7,8]. Therefore, the development of robust all-in-one adverse weather image restoration techniques capable of effectively eliminating multiple degradation factors is essential for enhancing visual clarity and ensuring the reliability of high-level perception systems.

While deep learning-based specialized models have exhibited outstanding performance in individual tasks such as desnowing [9], deraining [10,11,12], and dehazing [13,14,15], they confront a fundamental limitation in real-world applications. A model demonstrating superior performance on one specific task may entirely fail when applied to other tasks, revealing its limited robustness in cross-task scenarios. To advance this field, general image restoration approaches [16,17] have developed foundational frameworks capable of addressing multiple subtasks. While these methods show enhanced generalization compared to task-specific restoration methods [18,19,20], they still necessitate separate training processes for distinct restoration tasks, resulting in resource consumption.

Recently numerous all-in-one adverse weather image restoration frameworks [21,22,23,24] are considered potential solutions for foundational models, as they can simultaneously handle multiple restoration tasks within a single model. However, these methods commonly adopt a mixed training paradigm that directly combines the optimization objectives of multiple tasks. We observe that due to the complex interdependencies and potential conflicts among different degradation types, the simultaneous convergence of all tasks is hindered during the joint optimization process, resulting in significant performance bottlenecks [25]. By introducing a training scheme with sequential and prompt learning strategies, MioIR [26] has enhanced all-in-one image restoration performance. Unfortunately, this method fails to resolve inter-task conflicts, becoming a critical limitation in achieving robust multi-degradation restoration. Therefore, it is urgent to develop a multi-objective optimization strategy for unified models to facilitate all-in-one adverse weather image restoration.

In practice, we note that most existing all-in-one models rely on homogeneous network designs to handle diverse restoration tasks [27,28]. However, if a model employs a uniform and rigid architecture, it inherently lacks the flexibility to address the distinct and heterogeneous reconstruction requirements across different restoration tasks [29]. For example, dehazing [30] requires broader contextual understanding, while deraining [31] demands localized processing with strong spatial awareness, necessitating adaptive processing aligned to the task requirements. Thus, this motivates us to develop an adaptive task-capacity matching framework, improving the joint optimization of various restoration capabilities under diverse and composite adverse weather conditions.

To address these issues, we propose MOE-WIRNet, an effective all-in-one adverse weather image restoration framework that integrates a multi-objective optimization strategy and leverages an asymmetric mixture-of-experts model. Different from existing all-in-one restoration frameworks that rely on mixed multi-task training, as well as prior MoE-based architectures and adaptive loss reweighting strategies, our method explicitly models task-wise optimization behaviors and their interactions during joint training. To achieve better training balance, we incorporate a multi-task collaborative training strategy into the framework to harmonize the optimization of multiple restoration tasks. In particular, we introduce two convergence-aware indicators: a relative convergence index (RCI), which characterizes short-term convergence inconsistency among tasks within individual iterations, and an absolute convergence index (ACI), which monitors long-term performance deviation and dynamically adjusts optimization targets when a task underperforms. By jointly considering these complementary convergence signals, our optimization strategy goes beyond static heuristics or gradient-based reweighting, enabling stable and fine-grained coordination across heterogeneous degradation tasks. Furthermore, we develop an asymmetric mixture-of-experts (MoE) architecture tailored to accommodate the heterogeneous reconstruction requirements associated with different image restoration tasks. The encoder employs a soft MoE module that utilizes a global weighting strategy to mitigate forward information attenuation, while the decoder adopts a hard MoE design, leveraging Top-K routing to achieve high-fidelity texture reconstruction. Experimental results demonstrate that our approach achieves favorable performance against state-of-the-art ones on the benchmark datasets.

The main contributions are summarized as follows:

We propose MOE-WIRNet, a unified framework for adverse weather image restoration, addressing inconsistency and conflicts among multiple restoration tasks.
We integrate a multi-task collaborative training strategy that dynamically balances learning across heterogeneous degradation types, ensuring stable and harmonized optimization.
We design an asymmetric mixture-of-experts architecture to address the distinct and heterogeneous reconstruction requirements across different restoration tasks.
Experimental results on both synthetic and real-world benchmarks demonstrate that our approach performs favorable performance against state-of-the-art ones.

3. Proposed Method

3.1. Overall Network Architecture

We present a unified image restoration framework tailored for adverse weather conditions, named MOE-WIRNet, as shown in Figure 1. The proposed network adopts a symmetric encoder-decoder framework characterized by an asymmetric routing mechanism across its feature extraction and reconstruction stages. The asymmetric MoE routing strategy is intentionally aligned with the optimization characteristics of different network stages. The encoder prioritizes adaptive representation learning through soft routing to accommodate diverse degradation patterns, while the decoder enforces stable expert selection via hard routing to ensure consistent high-frequency reconstruction across tasks. Specifically, both the encoder and decoder incorporate a comprehensive mixture-of-experts (MoE) feature extraction module at each level. This module concurrently integrates three specialized self-attention experts, spanning the channel, spatial, and frequency domains, to capture diverse degradation characteristics. In the encoder stage, a soft routing mechanism is employed to perform dynamic weighted fusion of these multi-dimensional features, thereby adaptively synthesizing task-specific representations

F_{e n c}^{l}

. Conversely, the decoder utilizes a hard routing mechanism, which assigns uniform weights to the three expert branches to ensure stable and balanced feature restoration within a unified parameter space. From the perspective of symmetry, the proposed encoder–decoder framework adopts a structurally symmetric design, where all degradation types are projected into a shared latent space through a unified encoder and reconstructed via decoder branches of consistent topology. This architectural symmetry enforces a common representational basis, facilitating feature alignment and knowledge sharing across tasks. However, functional asymmetry is deliberately introduced at the decoding stage, where task-specific decoding pathways adapt the shared latent representations to heterogeneous degradation characteristics. Such asymmetric decoding enables the model to preserve degradation-specific restoration behaviors while maintaining global structural consistency, thereby balancing universality and specificity.

Figure 1. The overall structure of the MOE-WIRNet, designed for all-in-one weather image restoration.

In the operational workflow, a degraded image

I \in R^{H \times W \times 3}

initially undergoes overlap patch embedding

H_{e m b e d} (\cdot)

via

3 \times 3

convolutions to generate high-dimensional shallow features

F_{0} = H_{e m b e d} (I)

. The encoder subsequently processes these features through a four-tier hierarchy, where downsampling operations progressively increase channel depth while expanding the semantic receptive field. This architectural design enables the encoder to decouple complex degradation components through scale-aware feature synthesis under soft routing guidance, while the decoder accurately reconstructs the clean image by integrating features through a consistent hard routing integration.

After capturing highly abstract context in the bottleneck layer, the decoder reinstates spatial resolution through symmetrical upsampling. Skip connections [11] facilitate the fusion of multi-scale encoder details

F_{e n c}^{l}

with decoder features to replenish lost high-frequency information. This process is mathematically formulated as:

F_{d e c}^{l} = H_{d e c}^{l} ([H_{u p} (F_{d e c}^{l + 1}), F_{e n c}^{l}]),

(1)

where

[\cdot, \cdot]

signifies the channel-wise concatenation, and

H_{u p}

and

H_{d e c}^{l}

represent the upsampling and decoder layer transformations, respectively. The final output is derived through global residual learning, the residual map is generated via the refinement module

H_{r e f i n e} (\cdot)

, adding the predicted residual map to the original input:

\hat{I} = I + H_{r e f i n e} (F_{d e c}^{l}) .

(2)

To harmonize learning across diverse weather tasks within a unified parameter space, we introduce an active reweighting-based multi-objective optimization strategy during training. By framing multi-weather restoration as a multi-task learning challenge, this approach equilibrates optimization rates across various degradation tasks, effectively resolving gradient interference and convergence disparities inherent in mixed training. This synergy of a robust perception architecture and dynamic optimization enables high-fidelity restoration for complex weather scenarios, including rain, haze, and snow.

3.2. MoE Feature Extraction Module

The asymmetric routing strategy in MOE-WIRNet is grounded in the distinct functional requirements of the encoder and decoder. Specifically, the encoder focuses on adaptive feature extraction from diverse degradation scenarios, allowing the soft routing mechanism to dynamically weight multiple experts based on the input feature characteristics. In contrast, the decoder prioritizes stable and high-quality reconstruction, where hard routing ensures balanced integration of expert features and prevents oscillations or artifacts. This alignment of routing strategy with stage-specific functional needs motivates the encoder-decoder asymmetry in our design.

To efficiently mitigate adverse weather degradations with distinct physical characteristics, such as rain streaks, haze, and snow, within a unified framework, the proposed architecture integrates MoE feature extraction module in the encoder stages. The core philosophy of this module involves distributing the feature stream among multiple parallel expert sub-networks to address diverse degradation representations, subsequently utilizing a dynamic soft routing mechanism for adaptive feature fusion. At each hierarchical level of the encoder, the module accepts input features

X \in R^{C \times H \times W}

from the preceding layer. To facilitate high-efficiency multi-dimensional feature extraction, the module first applies a

1 \times 1

convolutional layer for linear mapping and subsequently partitions the features along the channel dimension into three distinct sub-feature spaces. These spaces correspond to the channel, spatial, and frequency self-attention experts:

[X_{c}, X_{s}, X_{f}] = Split ({Conv}_{1 \times 1} (Linear (X))),

(3)

where

X_{c}, X_{s}, X_{f} \in R^{\frac{C}{3} \times H \times W}

denote the respective input features for each specialized expert branch. The

Linear (\cdot)

denotes a linear transformation that projects the input features into a more discriminative embedding space, which helps the soft router make more effective routing decisions and improves feature selection. This design allows the encoder to capture complex degradation cues across channel, spatial, and frequency domains simultaneously, ensuring a robust perception of heterogeneous weather distortions.

Channel Self-attention Expert. The channel self-attention expert is specifically engineered to mitigate global image distortions, such as the contrast reduction characteristic of haze induced by dust storms, where degradation exhibits high spatial correlation. This expert utilizes MDTA [27] to extract global contrast information and model inter-channel dependencies while deliberately bypassing redundant spatial-dimension interactions. By prioritizing cross-covariance modeling, the expert effectively reconstructs scenes affected by global clarity loss, rectifying anomalous and skewed luminance distributions.

The process begins by projecting the input feature

X_{c} \in R^{\frac{C}{3} \times H \times W}

into query (Q), key (K), and value (V) representations. To compute attention without sacrificing local spatial integrity, the module integrates

1 \times 1

point-wise convolution with

3 \times 3

depth-wise convolution:

Q = W_{d}^{Q} W_{p}^{Q} X, K = W_{d}^{K} W_{p}^{K} X, V = W_{d}^{V} W_{p}^{V} X,

(4)

where

W_{p}^{(\cdot)}

aggregates information across channels via

1 \times 1

point-wise convolution, while

W_{d}^{(\cdot)}

encodes localized spatial constraints during the projection stage via

3 \times 3

depth-wise convolution. The transposed attention strategy shifts the computational bottleneck from spatial pixels to feature channels. The generated

Q, K,

and V tensors are reshaped from

R^{\frac{C}{3} \times H \times W}

to

R^{\frac{C}{3} \times N}

, where

N = H \times W

. To enhance training stability and constrain feature magnitudes,

L_{2}

normalization is applied to Q and K prior to computing their inner product. The resulting

\frac{C}{3} \times \frac{C}{3}

channel-wise attention map

A_{c h a n n e l}

is derived as:

A_{c h a n n e l} = Softmax (\frac{\hat{Q} \cdot {\hat{K}}^{T}}{τ}),

(5)

where

\hat{Q}

and

\hat{K}

represent the normalized matrices, and

τ

is a learnable temperature coefficient that adaptively smooths the attention distribution. This map effectively captures global feature correlations, such as the unique wavelength distribution of haze particles. Finally, the attention map is applied to the value tensor V, followed by a linear transformation and a residual connection to produce the restored features:

{\hat{F}}_{c} = W_{o} (A_{c h a n n e l} \cdot V),

(6)

where

W_{o}

signifies the output projection layer. This mechanism identifies and rectifies global weather-related degradations at a minimal computational cost.

Spatial Self-attention Expert. Unlike the channel self-attention expert, which emphasizes cross-channel global modeling, the spatial self-attention expert focuses on extracting pixel-level spatial dependencies to address degradations with pronounced spatial non-uniformity, such as rain streaks and snowflakes. While both experts employ a similar parallel branch structure during the feature projection phase, the spatial self-attention expert exhibits a distinct interaction logic centered on a sparsification pipeline designed to enhance restoration specificity. Upon generating the Query (Q) and Key (K) tensors, this expert does not compute cross-covariance matrix like its channel-based counterpart. Instead, it establishes

H W \times H W

associations directly in the pixel dimension and immediately applies a mask operation to filter the interaction matrix. This step suppresses interference from clear background regions on the attention distribution. To further minimize computational redundancy and precisely isolate significant degradation areas, the expert utilizes a unique Top-k selection strategy, formulated as:

A_{s p a t i a l} = Softmax (Scatter (Top-k (mask (Q \otimes K)))),

(7)

where

Q \otimes K

represents the raw correlation scores between any two spatial coordinates. The mask operation prunes physically irrelevant long-range or low-quality interactions. The critical Top-k operator then implements the spatial sparsification logic, for each query pixel, the model retains only the k most correlated points. This process is vital for tasks like snow or heavy rain removal, as it forces the model to concentrate its attention on the most salient degradation features, such as snowflake edges or rain streak trajectories, while disregarding redundant clear background information. The filtered scores are subsequently remapped to the full matrix dimensions via a scatter operation and normalized using a softmax function to produce the sparse spatial attention map

A_{s p a t i a l}

. Following the feature aggregation stage, the spatial self-attention expert introduces a distinct architectural refinement, an additional

3 \times 3

depth-wise convolution (DConv) layer at the output. This process is defined as:

{\hat{F}}_{s} = {DConv}_{3 \times 3} (Reshape (A_{s p a t i a l} \otimes V)),

(8)

where

A_{s p a t i a l} \otimes V

performs feature aggregation based on sparse weights, and the reshape operation restores the aggregated sequence into a four-dimensional spatial tensor. The final

3 \times 3

DConv acts as a local structure reinforcer. Since the self-attention mechanism primarily captures long-range dependencies, and the Top-k selection may introduce spatial discretization or block artifacts, this convolutional layer provides secondary filtering over local neighborhoods. This significantly enhances the continuity between restored regions and surrounding pixels. This supplementary spatial smoothing ensures that restored edge details possess higher coherence in visual perception, resulting in superior fidelity when processing weather tasks with complex spatial structures.

Frequency Self-attention Expert. The frequency self-attention expert aims to achieve precise separation of degradation components from a signal processing perspective. This is necessitated by the fact that weather-related degradations, such as rain streaks or structural noise, frequently overlap with image textures in the spatial domain but exhibit distinct energy distributions in the frequency domain. Fundamentally different from the channel or spatial experts, this module relies on non-linear mappings across the spatio-temporal and frequency domains combined with global feature interactions. Following the parallel projection to generate the Query (Q), Key (K), and Value (V) tensors, the frequency expert initiates a feature transformation process. To efficiently capture local frequency patterns and reduce computational complexity, the input feature maps are partitioned into patches before applying the Fourier transform. Patchification enables the module to focus on localized spectral components, improving the discrimination between degradation signals (e.g., rain streaks) and natural image textures, while keeping the attention computation tractable. For partitioned feature patches, a 2D fast fourier transform (FFT) maps the spatial features into the complex frequency domain:

Q_{f f t} = F (Q), K_{f f t} = F (K),

(9)

where

F (\cdot)

denotes the FFT operator. It decomposes local features into a feature representation where amplitude information reflects the energy distribution of the signal, and phase information encapsulates its structural characteristics. The core interaction of the frequency expert occurs within the feature plane, implementing a self-attention mechanism through the complex conjugate dot product of the frequency components:

F_{f f t} = Q_{f f t} \cdot conj (K_{f f t}),

(10)

where

conj (\cdot)

represents the complex conjugate operation. The physical significance of this formula lies in performing frequency-domain correlation retrieval to identify feature components in

Q_{f f t}

that are similar to

K_{f f t}

. This operation performs a correlation retrieval between the query and key in the frequency domain, effectively highlighting energy distributions associated with weather-induced degradations (e.g., rain streaks or structural noise). Since these degradations often exhibit distinct periodic patterns compared with natural image textures, the frequency-domain dot product selectively amplifies these degradation components, enabling their separation from the authentic textures. The subsequent IFFT mapping transforms this attention back to the spatial domain, providing a weight map that emphasizes degraded regions. Finally, the element-wise multiplication with V ensures that only the degraded components are filtered or suppressed while preserving the underlying image structures, allowing precise and physically meaningful restoration. Subsequently, the resulting weight matrix

F_{f f t}

is mapped back to the spatial domain via an inverse fast fourier transform (IFFT) operator,

F^{- 1}

, yielding a spatial weight map

F_{s p a c e} = F^{- 1} (F_{f f t})

integrated with global frequency insights. In the final feature synthesis stage, the frequency expert generates the refined features

{\hat{F}}_{f}

through the following mapping:

{\hat{F}}_{f} = {Dconv}_{3 \times 3} (Reshape (F_{s p a c e} \otimes V)),

(11)

where ⊗ denotes the element-wise multiplication, which utilizes the learned feature saliency weights to re-calibrate the value tensor V. This enables the precise filtering of pixel components contaminated by weather degradations in the spatial domain. To mitigate periodic artifacts potentially introduced by feature truncation or the initial patchification process, the outermost layer incorporates a

3 \times 3

depth-wise convolution. This convolutional layer functions as a feature smoothing filter, ensuring that the final output maintains frequency fidelity while exhibiting natural spatial coherence and continuity.

Soft Routing Mechanism. The soft routing mechanism adaptively modulates the contributions of various experts by learning the distribution patterns of different degradation types within the feature space. The module defines a set of learnable gating parameters

G = {g_{c}, g_{s}, g_{f}}

, which represent the fundamental weight baselines for the channel, spatial, and frequency experts at the current hierarchical level. To establish a balanced relationship of both competition and collaboration among experts, the system employs a Softmax function during the inference phase to perform a non-linear mapping on these gating parameters, yielding normalized mixing coefficients

W = {w_{c}, w_{s}, w_{f}}

:

w_{i} = \frac{exp (g_{i})}{\sum_{j \in {c, s, f}} exp (g_{j})} .

(12)

Through exponential computation, the softmax operator amplifies the disparities between weights. This allows the model to significantly elevate the confidence of a specific expert, such as the share expert (s) when encountering periodic rain streaks, while simultaneously suppressing noise interference from irrelevant dimensions, thereby achieving precise task focusing.

Upon obtaining these adaptive weights, the output feature maps from each expert branch

{{\hat{F}}_{c}, {\hat{F}}_{s}, {\hat{F}}_{f}}

undergo weighted scaling before entering the fusion layer via a channel-wise concatenation operation:

\begin{matrix} F_{f u s i o n} = Concat ([w_{c} {\hat{F}}_{c}, w_{s} {\hat{F}}_{s}, w_{f} {\hat{F}}_{f}]), \end{matrix}

(13)

where

Concat [\cdot]

operator denotes concatenation along the channel dimension. Since each expert branch processes only one-third of the original channel count (

\frac{C}{3}

), the concatenation re-aggregates these heterogeneous features into a mixed tensor of dimension C. A subsequent

1 \times 1

Fusion Convolution layer facilitates cross-expert information interaction, performing a non-linear integration of features derived from the spectral, local spatial, and global channel domains. To maintain training stability when the deep network processes extreme weather conditions, the module ultimately outputs the final features

X_{o u t}

through Layer Normalization (LN) and a global residual connection:

X_{o u t} = LN (X + F_{f u s i o n}),

(14)

the LN layer standardizes the feature distribution by its mean and variance, which effectively prevents potential gradient explosion or vanishing issues that might arise from drastic fluctuations in gating weights. This ensures the robustness of the multi-task collaborative optimization process.

Hard Routing Mechanism. In contrast to the soft routing mechanism, the hard routing mechanism enforces a discrete selection of a single expert for each feature map, effectively choosing the most relevant pathway based on the current degradation characteristics. Specifically, the gating parameters

G = {g_{c}, g_{s}, g_{f}}

are first computed in the same manner as in soft routing. Instead of applying a softmax to obtain weighted contributions, the expert with the highest gating value is selected:

i^{*} = \arg \max_{i \in {c, s, f}} g_{i},

(15)

and only the corresponding feature map

{\hat{F}}_{i^{*}}

is forwarded to the fusion stage. The selected feature is then passed through the same

1 \times 1

fusion convolution, residual connection, and Layer Normalization as in the soft routing design. This hard selection strategy reduces computational overhead by avoiding unnecessary processing of irrelevant experts, and enforces a more decisive expert specialization. While it may be less flexible than soft routing in blending complementary features, it can improve efficiency and encourage expert modules to fully focus on their designated degradation types.

3.3. Multi-Task Collaborative Training Strategy

In adverse weather restoration tasks, traditional joint training paradigms with fixed weights often suffer from unbalanced optimization rhythms due to the significant heterogeneity in feature distributions, optimization difficulties, and gradient magnitudes across various degradation types. From a multi-task optimization perspective, such imbalance arises because different tasks progress at different convergence rates and may temporarily dominate shared parameters, leading to task interference and suboptimal joint solutions. To address this challenge, this study formulates the restoration process as a multi-task collaborative system and introduces an online reweighting strategy based on convergence state awareness. By quantifying the convergence status of each task in real-time, this strategy dynamically evolves an optimal loss weight distribution, facilitating seamless synergy among sub-tasks within a unified parameter space. Rather than heuristically assigning fixed or pre-defined weights, our strategy explicitly monitors task-wise convergence behaviors and dynamically coordinates their optimization priorities, aiming to equalize effective learning progress across tasks.

Relative Convergence Index. To accurately characterize the instantaneous optimization dynamics of each restoration task during specific training stages, we define the relative convergence index (RCI), denoted as

η^{t}

. Conceptually, RCI serves as a normalized indicator of relative convergence speed, measuring whether a task is currently learning faster or slower than its recent historical trend under the same optimization conditions. Given the stochastic fluctuations of loss values in deep learning, we first establish a smoothed historical loss baseline,

H^{t}

, utilizing an exponential moving average (EMA) strategy to suppress high-frequency noise inherent in mini-batch training:

H_{i t e r}^{t} = m \cdot H_{i t e r - 1}^{t} + (1 - m) \cdot L_{t}^{i t e r},

(16)

where

H_{i t e r}^{t}

is the dynamic weight vector across t types,

L_{t}^{i t e r}

represents the raw loss value of task t at the current iteration, while

m \in [0, 1]

serves as the momentum coefficient controlling the decay of historical information. Based on this baseline, the RCI for task t is defined as the normalized ratio of the current loss to the smoothed historical baseline:

η^{t} = \frac{L_{t}^{i t e r}}{H^{t}},

(17)

where

η^{t}

functions as a metric for the adaptive evolutionary pace. An

η^{t} > 1

indicates that the learning progress of the task lags behind its historical average, suggesting a weakened ability of the model to capture those specific degradation features. Conversely,

η^{t} < 1

signifies that the task is in a stable state of convergence.

Absolute Convergence Index. Monitoring RCI alone is insufficient to reflect the model’s progress relative to the global optimum. A task may exhibit stable short-term convergence while still remaining far from its best-achieved performance, which motivates the need for an absolute reference. Therefore, we propose the absolute convergence index (ACI), denoted as

ϕ^{t}

, to serve as a long-term steady-state anchor. This factor constrains the optimization direction by quantifying the absolute deviation between the current performance state and the limit performance achieved during its training history:

ϕ^{t} = ln (\frac{L_{t}^{i t e r}}{L_{o p t}^{t}}),

(18)

where

L_{o p t}^{t}

represents the global minimum loss recorded for task t since the start of training. The introduction of the logarithmic transformation aims to compress the fluctuation range of loss values, thereby providing a more robust feedback signal. A rising

ϕ^{t}

alerts the system that the model is drifting away from its best-attained state for a specific weather condition, triggering a restorative force to pull the optimizer back toward the high-fidelity restoration trajectory.

Online Reweighting Index. By integrating the RCI (

η^{t}

) with the ACI (

ϕ^{t}

), we generate the final online reweighting index (ORI), denoted as

λ^{t}

, for the dynamic allocation of loss weights across tasks. The joint use of RCI and ACI provides complementary feedback: RCI emphasizes short-term convergence imbalance, while ACI enforces long-term performance consistency, enabling principled task coordination during optimization. To ensure higher discriminative power while maintaining weight normalization, we employ a Softmax function with a temperature coefficient

κ

for adaptive synthesis:

λ^{t} = \frac{exp (η^{t} \cdot ϕ^{t} / κ)}{\sum_{j = 1}^{K} exp (η^{j} \cdot ϕ^{j} / κ)},

(19)

where the speed and quality of convergence through the product term

η^{t} \cdot ϕ^{t}

. The temperature coefficient

κ

modulates the smoothness of the weight distribution, a lower

κ

encourages the model to exhibit a stronger punishment bias towards extremely difficult tasks. Consequently, this strategy enables the model to actively identify and intensify training on weak tasks during every iteration. By harmonizing the optimization paths of rain, haze, and snow tasks, the ORI mechanism significantly enhances restoration accuracy and robustness in complex multi-degradation scenarios.

3.4. Loss Function

To facilitate comprehensive feature learning, we design a composite loss function that integrates pixel-, edge-, and frequency-level constraints for remote sensing image dehazing. We adopt the robust Charbonnier loss to penalize pixel-wise discrepancies between the reconstructed image

\hat{T} i

and the ground truth

T_{i}

at multiple scales (

i = 1, 2, 3

):

L_{char} = \sum_{i = 1}^{3} \sqrt{{({\hat{T}}_{i} - T_{i})}^{2} + ε^{2}}, ε = 10^{- 3} .

(20)

To preserve structural details, an edge-aware loss based on the Laplacian operator

Δ

is introduced:

L_{edge} = \sum_{i = 1}^{3} \sqrt{{(Δ ({\hat{T}}_{i}) - Δ (T_{i}))}^{2} + ε^{2}} .

(21)

Spectral consistency is enforced via an

ℓ_{1}

loss in the Fourier domain (

F T

denotes Fourier Transform):

L_{freq} = \sum_{i = 1}^{3} {|F T ({\hat{T}}_{i}) - F T (T_{i})|}_{1} .

(22)

The total loss is a weighted combination of the above terms:

L_{total} = λ_{1} L_{char} + λ_{2} L_{edge} + λ_{3} L_{freq},

(23)

where the weights are set as

λ_{1} = 1, λ_{2} = 0.05, λ_{3} = 0.01

, following common practice in prior work [31].

4. Experiments

4.1. Experiments Setup

This research utilizes WeatherBench [28] and FoundIR-Weather [4] as the primary benchmarks for training and evaluation. These datasets are specifically curated to assess the performance of all-in-one restoration algorithms within heterogeneous outdoor environments. WeatherBench encompasses not only isolated degradations such as rain streaks, snow accumulation, and dense haze but also addresses composite degradation processes resulting from overlapping weather phenomena. The WeatherBench dataset contains a total of 42,002 image pairs, including 14,729 training pairs in the Rain subset, 13,614 training pairs in the Haze subset, and 13,059 training pairs in the Snow subset; each subset includes 200 image pairs for testing. While the FoundIR dataset offers an extensive range of scenarios, this study focuses exclusively on four weather-centric categories: haze, rain streaks, raindrops, and nighttime rain. In the FoundIR-Weather dataset, the haze subset contains 79,800 images for training and 200 images for testing; the rainstreak subset contains 39,900 images for training and 100 images for testing; the raindrop subset contains 44,828 images for training and 100 images for testing; and the nighttime rain subset contains 40,111 images for training and 50 images for testing. To ensure a fair comparison, we use the officially released code for all comparison methods and retrain and test them under the same experimental settings.

In our network, we adopt the same number of heads configuration as MDTA [27], i.e., [1, 2, 4, 8], and set the k value to 70%. Regarding implementation, the computational framework operates on a single NVIDIA RTX 4090D GPU, where the Adam optimizer facilitates end-to-end parameter learning. The initial learning rate is established at

2 \times 10^{- 4}

and undergoes adaptive reduction to

1 \times 10^{- 6}

through a cosine annealing strategy to ensure a stable and smooth optimization trajectory. The training configuration maintains a batch size of 1 to balance memory constraints and gradient stability. The entire optimization procedure spans approximately 300,000 iterations, allowing the model to fully converge and capture the underlying statistical distributions of diverse weather-induced artifacts.

4.2. Evaluation on the WeatherBench Dateset

In the experimental evaluation on the WeatherBench dataset, the proposed framework demonstrates competitive overall performance among the compared methods in both quantitative metrics (as shown in Table 1) and qualitative visual results. Quantitative assessments reveal that the model consistently outperforms state-of-the-art (SOTA) restoration algorithms across two primary metrics, PSNR and SSIM. This comprehensive performance gain is largely attributed to the MoE feature extraction module, which accurately decouples heterogeneous degradation features, and the multi-task collaborative training strategy, which effectively synchronizes multi-task convergence trajectories. Consequently, the restored outputs achieve high-level semantic alignment with ground-truth imagery.

Table 1. Quantitative comparison of different categories of methods on the WeatherBench [28] dataset. The best and second-best values are bold and underlined. “↑” indicates better performance with higher values.

Regarding qualitative visual comparisons, the architecture exhibits exceptional high-fidelity reconstruction capabilities, as shown in Figure 2, Figure 3 and Figure 4. In rainy scenarios, the frequency self-attention expert leverages its sensitivity to spectral cues to precisely identify and eliminate rain patterns characterized by periodic physical patterns. When addressing global degradation caused by dense haze, the channel expert utilizes cross-covariance modeling to rectify spectral shifts induced by atmospheric scattering. As shown in Figure 2, compared with AdaIR [49], our method can more accurately restore colors and preserve finer structural details, such as the wall cracks highlighted in the figure. Simultaneously, the spatial self-attention expert employs a sparse Top-k selection strategy to handle non-uniform snow degradation, thereby ensuring the structural continuity and integrity of object boundaries.

Figure 2. The restored results of dehazing on the WeatherBench dataset. Our method restores high-quality images with more defined details than the restored results shown in (b–h).

Figure 3. The restored results of deraining on the WeatherBench dataset. Our method restores high-quality images with more defined details than the restored results shown in (b–h).

Figure 4. The restored results of desnowing on the WeatherBench dataset. Our method restores high-quality images with more defined details than the restored results shown in (b–h).

4.3. Evaluation on the FoundIR-Weather Dataset

In extended experiments on the FoundIR-Weather dataset, the proposed method achieves superior performance across four highly challenging weather scenarios: haze, rain streaks, raindrops, and nighttime rain. Quantitative assessments reveal that our approach consistently outperforms SOTA algorithms in PSNR and SSIM for all degradation categories. As shown in Table 2 and Table 3, our method outperforms the dedicated dehazing algorithm DehazeFormer [41] by approximately 5.48 dB in PSNR on the dehazing task. It also surpasses the multi-weather restoration method TransWeather [21] by about 1.46 dB in average PSNR and 0.0169 in SSIM. This superiority substantiates that the multi-task optimization training strategy effectively prevents the model from converging to local optima or suffering from performance degradation during the simultaneous optimization of multiple tasks. Qualitative visual analysis further confirms the model’s robustness against complex physical degradations, as depicted in Figure 5. This collaborative multi-expert paradigm ensures that the model maintains macro-level contrast and visual naturalness while simultaneously reconstructing micro-level geometric details across the diverse degradation distributions of FoundIR-Weather.

Table 2. Quantitative evaluations on the FoundIR-Weather [4] dataset. The best and second-best values are bold and underlined. “↑” indicates better performance with higher values.

Table 3. Quantitative evaluation using no-reference metrics on the RainStreak and Haze subsets of the FoundIR-Weather [4] dataset. The best and second-best values are bold and underlined. “↑” indicates higher is better, while “↓” indicates lower is better.

Figure 5. The restored results of raindrop removal on the FoundIR-Weather dataset. Our method restores high-quality images with more defined details than the restored results shown in (b–f).

4.4. Model Complexity

To evaluate the model complexity (number of parameters) and computational cost (floating-point operations, FLOPs), we compare our method with representative state-of-the-art approaches, including task-specific methods (DehazeFormer [41], DRSformer [12], and NeRD-Rain [11]), task-general methods (MPRNet [43] and Restormer [27]), and the all-in-one method PromptIR [44]. Specifically, all efficiency evaluations are conducted under a unified input resolution of

256 \times 256

, and the results are summarized in Table 4. The analysis indicates that our method achieves lower model complexity while maintaining competitive performance, thereby attaining a better balance between restoration quality and computational efficiency.

Table 4. Comparisons of model complexity. The size of the test image is

256 \times 256

pixels.

4.5. Ablation Studies

Effectiveness of the asymmetric MOE network architecture. This study validates the pivotal role of the proposed asymmetric routing mechanism in the all-in-one image restoration task through ablation experiments, as summarized in Table 5. During the encoding phase, the model encounters highly heterogeneous degradation features such as rain, haze, and snow. The implementation of a soft routing mechanism enables the model to dynamically and adaptively calibrate the contribution ratios of various experts according to the specific degradation distribution of the input image, thereby facilitating precise feature decoupling and artifact capture.

Table 5. Ablation study of the asymmetric mixture-of-experts network architecture. “✓” indicates that the corresponding component is used, and “✗” indicates that the corresponding component is not used. “↑” indicates better performance with higher values.

Conversely, the decoding stage prioritizes high-fidelity image reconstruction and synthesis. By adopting a hard routing strategy that assigns uniform weights (1/3 each) to the three expert branches, the decoder provides a robust and balanced feature integration environment. This asymmetric design significantly outperforms all variants from (a) to (d) in terms of evaluation metrics, substantiating that customizing routing strategies based on the functional characteristics of the encoder and decoder is essential for enhancing the precision of all-in-one restoration models.

Effectiveness of different expert types in the MoE architecture. To ascertain the individual efficacy of various expert types within the MoE, we evaluate several configurations of channel, spatial, and frequency self-attention experts, as summarized in Table 6. Case (e) serves as a baseline utilizing only channel self-attention, which exhibits limited capacity in mitigating degradations characterized by intricate geometric structures or periodic textures.

Table 6. Ablation study of different MoE expert types in Hard-MoE and Soft-MoE. “✓” indicates that the corresponding component is used, and “✗” indicates that the corresponding component is not used. “↑” indicates better performance with higher values.

The integration of spatial self-attention in case (f) markedly bolsters the model’s ability to localize non-uniform physical distortions, yielding substantial gains in PSNR and SSIM. Simultaneously, case (g) validates the pivotal role of the frequency self-attention expert in eliminating periodic physical artifacts. Ultimately, our comprehensive MoE framework, which integrates all three specialized dimensions, achieves superior performance across all evaluation metrics. This confirms the high degree of complementarity among these feature domains, their collaborative decoupling allows the model to attain optimal restoration fidelity when addressing diverse and heterogeneous weather-induced degradations.

Effectiveness of multi-task collaborative training strategy. To validate the effectiveness of the proposed multi-task collaborative training strategy, we conduct a comprehensive ablation study by comparing it with alternative schemes adopting different optimization strategies. The quantitative results are reported in Table 7. It is observed that, without any optimization strategy, the model exhibits a clear performance bias, resulting in inferior restoration quality on more challenging degradation scenarios. Introducing the RCI-based reweighting approach alleviates this issue to some extent by adjusting task weights according to short-term optimization dynamics; however, this method remains sensitive to stochastic loss fluctuations and may lead to unstable weight oscillations. In contrast, the full collaborative training strategy consistently achieves the best performance across all degradation types, confirming that the absolute convergence index provides a crucial long-term constraint that anchors each task to its historical optimum. These results demonstrate that the proposed multi-task collaborative training strategy effectively balances convergence speed and restoration quality among heterogeneous tasks, enabling more coordinated optimization and exhibiting superior robustness in multi-degradation image restoration scenarios.

Table 7. Ablation study on our multi-task collaborative training strategy. “✓” indicates that the corresponding component is used, and “✗” indicates that the corresponding component is not used. “↑” indicates better performance with higher values.

Effect of different values of k. To investigate the performance variation of the model under different top-k values, we conduct an ablation study, as shown in Figure 6. The results indicate that setting k to 70% achieves the best performance, with PSNR reaching 25.15 dB. When k is small, some useful information may be filtered out, leading to performance degradation. When k is increased to 80%, the performance instead shows a declining trend, which further demonstrates the effectiveness of the top-k selection mechanism. This mechanism reduces redundant information and suppresses irrelevant interference, thereby enhancing the model’s image restoration capability.

Figure 6. Ablation studies for different values of k of the top-k token selection mechanism in the model.

4.6. Discussion and Limitations

While MOE-WIRNet demonstrates strong performance on unified adverse weather image restoration, several limitations should be acknowledged. First, the proposed framework is primarily designed for adverse weather restoration tasks (e.g., rain, haze, and snow), which share common physical characteristics such as atmospheric scattering, occlusion, and visibility degradation. When extending MOE-WIRNet to more general all-in-one image restoration scenarios that include heterogeneous degradations (e.g., noise, blur, compression artifacts, and low-light conditions), performance gains may become less consistent. This is mainly because the assumption of partial task relatedness, which underlies both the collaborative optimization strategy and the expert-sharing mechanism, becomes weaker in such settings. We plan to further explore more general all-in-one image restoration scenarios in future work, such as sandstorms, noise, blur, and compression artifacts.

5. Conclusions

We have presented an effective all-in-one adverse weather image restoration framework, named MOE-WIRNet. We formulate a multi-task collaborative optimization strategy that dynamically coordinates the training of multiple restoration objectives to harmonize cross-task learning and enhance overall performance. Furthermore, we integrate an asymmetric mixture-of-experts architecture into the framework to adaptively address the distinct degradation patterns and varying severity levels across different adverse weather conditions. Extensive experiments and ablation analysis confirm the importance of multi-task equilibrium in training, and demonstrate that our method achieves state-of-the-art results on multiple real-world benchmarks.

Author Contributions

Methodology, J.C.; Software, J.C.; Validation, J.C.; Formal analysis, J.C.; Data curation, Z.Z.; Writing—original draft, J.C.; Writing—review & editing, Z.Z.; Supervision, Z.Z.; Funding acquisition, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (Grant No. 62475077).

Data Availability Statement

The data presented in this study are openly available in [WeatherBench] at https://github.com/guanqiyuan/WeatherBench (accessed on 28 December 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

Chen, X.; Pan, J.; Dong, J.; Tang, J. Towards unified deep image deraining: A survey and a new benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 5414–5433. [Google Scholar] [CrossRef]
Lou, G.; Deng, Y.; Zheng, X.; Zhang, M.; Zhang, T. Testing of autonomous driving systems: Where are we and where should we go? In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore, 14–18 November 2022; pp. 31–43. [Google Scholar]
Chavhan, S.; Gupta, D.; Gochhayat, S.P.; Chandana, B.N.; Khanna, A.; Shankar, K.; Rodrigues, J.J. Edge computing AI-IoT integrated energy-efficient intelligent transportation system for smart cities. ACM Trans. Internet Technol. 2022, 22, 106. [Google Scholar] [CrossRef]
Li, H.; Chen, X.; Dong, J.; Tang, J.; Pan, J. Foundir: Unleashing million-scale training data to advance foundation models for image restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; pp. 12626–12636. [Google Scholar]
Yang, Q.; Chen, X.; Li, P.; Guan, Q.; Jin, G.; Jin, J. Rethinking Rainy 3D Scene Reconstruction via Perspective Transforming and Brightness Tuning. arXiv 2025, arXiv:2511.06734. [Google Scholar] [CrossRef]
Sheng, H.; Yao, K.; Goel, S. Surveilling surveillance: Estimating the prevalence of surveillance cameras with street view data. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, Virtual, USA, 19–21 May 2021; pp. 221–230. [Google Scholar]
Guan, Q.; Chen, X.; Jin, G.; Jin, J.; Fan, S.; Song, T.; Pan, J. Rethinking Nighttime Image Deraining via Learnable Color Space Transformation. arXiv 2025, arXiv:2510.17440. [Google Scholar] [CrossRef]
Li, P.; Shu, X.; Feng, C.M.; Feng, Y.; Zuo, W.; Tang, J. Surgical video workflow analysis via visual-language learning. npj Health Syst. 2025, 2, 5. [Google Scholar] [CrossRef]
Lai, J.; Chen, S.; Lin, Y.; Ye, T.; Liu, Y.; Fei, S.; Xing, Z.; Wu, H.; Wang, W.; Zhu, L. SnowMaster: Comprehensive Real-world Image Desnowing via MLLM with Multi-Model Feedback Optimization. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 4302–4312. [Google Scholar]
Song, T.; Jin, G.; Li, P.; Jiang, K.; Chen, X.; Jin, J. Learning a spiking neural network for efficient image deraining. arXiv 2024, arXiv:2405.06277. [Google Scholar] [CrossRef]
Chen, X.; Pan, J.; Dong, J. Bidirectional multi-scale implicit neural representations for image deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 25627–25636. [Google Scholar]
Chen, X.; Li, H.; Li, M.; Pan, J. Learning a sparse transformer network for effective image deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 5896–5905. [Google Scholar]
Zhang, Y.; Zhou, S.; Li, H. Depth information assisted collaborative mutual promotion network for single image dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 2846–2855. [Google Scholar]
Feng, Y.; Ma, L.; Meng, X.; Zhou, F.; Liu, R.; Su, Z. Advancing real-world image dehazing: Perspective, modules, and training. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9303–9320. [Google Scholar] [CrossRef] [PubMed]
Song, T.; Fan, S.; Li, P.; Jin, J.; Jin, G.; Fan, L. Learning an effective transformer for remote sensing satellite image dehazing. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, USA, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
Lin, J.; Zhang, Z.; Wei, Y.; Ren, D.; Jiang, D.; Tian, Q.; Zuo, W. Improving image restoration through removing degradations in textual representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 2866–2878. [Google Scholar]
Song, T.; Li, P.; Jin, G.; Jin, J.; Fan, S.; Chen, X. Image deraining transformer with sparsity and frequency guidance. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 1889–1894. [Google Scholar]
Fan, S.; Song, T.; Jin, G.; Jin, J.; Li, Q.; Xia, X. A lightweight cloud and cloud shadow detection transformer with prior-knowledge guidance. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8003405. [Google Scholar] [CrossRef]
Song, T.; Li, P.; Fan, S.; Jin, J.; Jin, G.; Fan, L. Exploring a context-gated network for effective image deraining. J. Vis. Commun. Image Represent. 2024, 98, 104060. [Google Scholar] [CrossRef]
Valanarasu, J.M.J.; Yasarla, R.; Patel, V.M. Transweather: Transformer-based restoration of images degraded by adverse weather conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 2353–2363. [Google Scholar]
Patil, P.W.; Gupta, S.; Rana, S.; Venkatesh, S.; Murala, S. Multi-weather image restoration via domain translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 21696–21705. [Google Scholar]
Wen, Y.; Gao, T.; Zhang, J.; Li, Z.; Chen, T. Multi-axis prompt and multi-dimension fusion network for all-in-one weather-degraded image restoration. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025. [Google Scholar]
Wu, X.; Xiao, Z.; He, J.; Lei, J.; Zeng, X.; Xu, G. Multi-weather unmanned aerial vehicle remote sensing image restoration via scale-aware Trident Mamba. J. Appl. Remote Sens. 2025, 19, 046507. [Google Scholar] [CrossRef]
Gong, Z.; Yu, H.; Liao, C.; Liu, B.; Chen, C.; Li, J. Coba: Convergence balancer for multitask finetuning of large language models. arXiv 2024, arXiv:2410.06741. [Google Scholar] [CrossRef]
Kong, X.; Dong, C.; Zhang, L. Towards effective multiple-in-one image restoration: A sequential and prompt learning strategy. arXiv 2024, arXiv:2401.03379. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 5728–5739. [Google Scholar]
Guan, Q.; Yang, Q.; Chen, X.; Song, T.; Jin, G.; Jin, J. Weatherbench: A real-world benchmark dataset for all-in-one adverse weather image restoration. In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 27–31 October 2025; pp. 12607–12613. [Google Scholar]
Zamfir, E.; Wu, Z.; Mehta, N.; Tan, Y.; Paudel, D.P.; Zhang, Y.; Timofte, R. Complexity Experts are Task-Discriminative Learners for Any Image Restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025. [Google Scholar]
Jin, W.; Yang, Q.; Wu, X.; Chen, H.; Li, P.; Chen, X. SmokeBench: A Real-World Dataset for Surveillance Image Desmoking in Early-Stage Fire Scenes. In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 27–31 October 2025; pp. 12722–12728. [Google Scholar]
Chen, H.; Chen, X.; Lu, J.; Li, Y. Rethinking Multi-Scale Representations in Deep Deraining Transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
Wu, X.; Chen, H.; Chen, X.; Xu, G. Multi-scale transformer with conditioned prompt for image deraining. Digit. Signal Process. 2025, 156, 104847. [Google Scholar] [CrossRef]
Li, B.; Liu, X.; Hu, P.; Wu, Z.; Lv, J.; Peng, X. All-in-one image restoration for unknown corruption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 17452–17462. [Google Scholar]
Zhang, M.; Yin, R.; Yang, Z.; Wang, Y.; Li, K. Advances and challenges of multi-task learning method in recommender system: A survey. arXiv 2023, arXiv:2305.13843. [Google Scholar] [CrossRef]
Samant, R.M.; Bachute, M.R.; Gite, S.; Kotecha, K. Framework for deep learning-based language models using multi-task learning in natural language understanding: A systematic literature review and future directions. IEEE Access 2022, 10, 17078–17097. [Google Scholar] [CrossRef]
Allenspach, S.; Hiss, J.A.; Schneider, G. Neural multi-task learning in drug design. Nat. Mach. Intell. 2024, 6, 124–137. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Q. A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 2021, 34, 5586–5609. [Google Scholar] [CrossRef]
Zhang, T.; Liu, C.; Liu, X.; Liu, Y.; Meng, L.; Sun, L.; Jiang, W.; Zhang, F.; Zhao, J.; Jin, Q. Multi-task learning framework for emotion recognition in-the-wild. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 143–156. [Google Scholar]
Chen, S.; Zhang, Y.; Yang, Q. Multi-task learning in natural language processing: An overview. ACM Comput. Surv. 2024, 56, 1–32. [Google Scholar] [CrossRef]
Fifty, C.; Amid, E.; Zhao, Z.; Yu, T.; Anil, R.; Finn, C. Efficiently identifying task groupings for multi-task learning. Adv. Neural Inf. Process. Syst. 2021, 34, 27503–27516. [Google Scholar]
Song, Y.; He, Z.; Qian, H.; Du, X. Vision transformers for single image dehazing. IEEE TIP 2023, 32, 1927–1941. [Google Scholar] [CrossRef] [PubMed]
Chen, S.; Ye, T.; Liu, Y.; Chen, E. SnowFormer: Context interaction transformer with scale-awareness for single image desnowing. arXiv 2022, arXiv:2208.09703. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Multi-stage progressive image restoration. In Proceedings of the CVPR, Virtual, USA, 19–25 June 2021; pp. 14821–14831. [Google Scholar]
Potlapalli, V.; Zamir, S.W.; Khan, S.; Khan, F. PromptIR: Prompting for All-in-One Image Restoration. Adv. Neural Inf. Process. Syst. 2023, 36, 71275–71293. [Google Scholar]
Zhu, Y.; Wang, T.; Fu, X.; Yang, X.; Guo, X.; Dai, J.; Qiao, Y.; Hu, X. Learning weather-general and weather-specific features for image restoration under multiple adverse weather conditions. In Proceedings of the CVPR, Vancouver, BC, Canada, 18–22 June 2023; pp. 21747–21758. [Google Scholar]
Zheng, D.; Wu, X.M.; Yang, S.; Zhang, J.; Hu, J.F.; Zheng, W.S. Selective hourglass mapping for universal image restoration based on diffusion model. In Proceedings of the CVPR, Seattle, WA, USA, 17–21 June 2024; pp. 25445–25455. [Google Scholar]
Zhu, R.; Tu, Z.; Liu, J.; Bovik, A.C.; Fan, Y. Mwformer: Multi-weather image restoration using degradation-aware transformers. IEEE TIP 2024, 33, 6790–6805. [Google Scholar] [CrossRef]
Sun, S.; Ren, W.; Gao, X.; Wang, R.; Cao, X. Restoring images in adverse weather conditions via histogram transformer. In Proceedings of the ECCV, Milan, Italy, 29 September–4 October 2024; pp. 111–129. [Google Scholar]
Cui, Y.; Zamir, S.W.; Khan, S.; Knoll, A.; Shah, M.; Khan, F.S. AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation. In Proceedings of the ICLR, Singapore, 24–28 April 2025. [Google Scholar]
Zhang, Y.; Li, D.; Law, K.L.; Wang, X.; Qin, H.; Li, H. IDR: Self-Supervised Image Denoising via Iterative Data Refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
Luo, Z.; Gustafsson, F.K.; Zhao, Z.; Sjölund, J.; Schön, T.B. Image restoration with mean-reverting stochastic differential equations. arXiv 2023, arXiv:2301.11699. [Google Scholar] [CrossRef]
Luo, Z.; Gustafsson, F.K.; Zhao, Z.; Sjölund, J.; Schön, T.B. Controlling vision-language models for universal image restoration. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Chen, X.; Li, Z.; Pu, Y.; Liu, Y.; Zhou, J.; Qiao, Y.; Dong, C. A Comparative Study of Image Restoration Networks for General Backbone Network Design. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Guo, H.; Li, J.; Dai, T.; Ouyang, Z.; Ren, X.; Xia, S.T. Mambair: A simple baseline for image restoration with state-space model. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 222–241. [Google Scholar]
Guo, H.; Guo, Y.; Zha, Y.; Zhang, Y.; Li, W.; Dai, T.; Xia, S.T.; Li, Y. Mambairv2: Attentive state space restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–17 June 2025; pp. 28124–28133. [Google Scholar]

Figure 1. The overall structure of the MOE-WIRNet, designed for all-in-one weather image restoration.

Figure 2. The restored results of dehazing on the WeatherBench dataset. Our method restores high-quality images with more defined details than the restored results shown in (b–h).

Figure 3. The restored results of deraining on the WeatherBench dataset. Our method restores high-quality images with more defined details than the restored results shown in (b–h).

Figure 4. The restored results of desnowing on the WeatherBench dataset. Our method restores high-quality images with more defined details than the restored results shown in (b–h).

Figure 5. The restored results of raindrop removal on the FoundIR-Weather dataset. Our method restores high-quality images with more defined details than the restored results shown in (b–f).

Figure 6. Ablation studies for different values of k of the top-k token selection mechanism in the model.

Table 1. Quantitative comparison of different categories of methods on the WeatherBench [28] dataset. The best and second-best values are bold and underlined. “↑” indicates better performance with higher values.

Type		Methods	Haze		Rain		Snow		Average
Type		Methods	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑
Task-Specific Methods	Dehaze	DehazeFormer [41]	14.11	0.5047	25.05	0.8039	24.26	0.7992	21.14	0.7026
	Dehaze	DCMPNet [13]	15.18	0.5056	28.03	0.8656	22.80	0.6135	22.00	0.6616
	Derain	DRSformer [12]	15.95	0.6444	26.97	0.8426	23.99	0.7858	22.30	0.7576
	Derain	NeRD-Rain [11]	17.51	0.6684	27.73	0.8403	24.86	0.7907	23.37	0.7665
	Desnow	SnowFormer [42]	16.70	0.6762	28.18	0.8614	23.30	0.7878	22.73	0.7751
Task-General Methods		MPRNet [43]	16.26	0.6793	27.14	0.8337	23.17	0.7898	22.19	0.7676
Task-General Methods		Restormer [27]	15.30	0.6566	29.48	0.8942	22.95	0.7860	22.58	0.7789
All-in-One Methods		AirNet [33]	15.93	0.6654	28.58	0.8718	22.05	0.7799	22.19	0.7723
		TransWeather [21]	16.79	0.6700	29.33	0.8834	24.96	0.7959	23.69	0.7831
		PromptIR [44]	16.11	0.6628	29.53	0.8843	24.92	0.7863	23.52	0.7778
		WGWS-Net [45]	13.79	0.6027	28.07	0.8608	20.80	0.7299	20.89	0.7311
		DiffUIR [46]	17.73	0.6739	29.93	0.8847	23.50	0.7996	23.72	0.7861
		MWFormer [47]	18.42	0.6762	30.14	0.8907	23.97	0.7921	24.18	0.7863
		Histoformer [48]	17.68	0.6691	30.69	0.9060	23.39	0.7876	23.92	0.7876
		AdaIR [49]	19.08	0.6811	30.86	0.9058	24.44	0.7972	24.79	0.7947
		Ours	19.59	0.6881	31.05	0.9107	24.82	0.8012	25.15	0.8000

Table 2. Quantitative evaluations on the FoundIR-Weather [4] dataset. The best and second-best values are bold and underlined. “↑” indicates better performance with higher values.

Methods	Rain Streak		Raindrop		Haze		Nighttime Rain		Average
Methods	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑
Restormer [27]	24.70	0.7773	22.87	0.7195	12.88	0.4096	29.72	0.8757	22.54	0.6955
TransWeather [21]	23.49	0.6886	22.94	0.6814	17.43	0.6545	28.29	0.8663	23.04	0.7227
IDR [50]	16.37	0.4363	16.21	0.3422	11.40	0.2410	23.42	0.7939	16.85	0.4534
PromptlR [44]	25.59	0.8011	23.83	0.7127	16.61	0.4985	31.03	0.9039	24.27	0.7291
IR-SDE [51]	20.05	0.6541	20.61	0.6789	12.62	0.3905	25.49	0.7651	19.69	0.6222
DA-CLIP [52]	23.04	0.6849	20.86	0.6490	13.97	0.3563	21.58	0.7246	19.86	0.6037
X-Restormer [53]	25.70	0.8076	23.11	0.7068	15.17	0.4354	30.93	0.9026	23.73	0.7131
DRSformer [12]	26.64	0.8203	23.70	0.7026	15.01	0.4787	31.52	0.9113	24.22	0.7282
NeRD-Rain [11]	26.40	0.8176	24.11	0.7169	17.19	0.6561	32.31	0.9206	25.00	0.7778
MambaIRv1 [54]	26.72	0.8115	24.21	0.7220	13.90	0.4685	32.33	0.9218	25.50	0.7441
MambaIRv2 [55]	26.89	0.8187	23.55	0.7234	14.33	0.5036	32.28	0.9305	24.26	0.7616
Ours	27.09	0.8278	24.53	0.7254	17.98	0.6925	32.52	0.9282	25.53	0.7935

Table 3. Quantitative evaluation using no-reference metrics on the RainStreak and Haze subsets of the FoundIR-Weather [4] dataset. The best and second-best values are bold and underlined. “↑” indicates higher is better, while “↓” indicates lower is better.

Methods		TransWeather [21]	IDR [50]	PromptIR [44]	DA-CLIP [52]	DRSformer [12]	NeRD-Rain [11]	MambaIRv2 [55]	Ours
RainStreak	MUSIQ ↑	20.2728	22.1755	27.6832	23.7416	26.4744	26.6328	27.2034	30.8493
RainStreak	NIQE ↓	8.1970	7.8953	6.3012	7.7218	6.7140	6.5498	6.4137	4.5855
Haze	MUSIQ ↑	25.4545	29.9019	36.5235	32.8744	34.5037	32.2178	34.1100	50.9111
Haze	NIQE ↓	8.0826	8.1195	6.8669	8.4188	8.1302	7.7104	7.2102	4.0006

Table 4. Comparisons of model complexity. The size of the test image is

256 \times 256

pixels.

Table 4. Comparisons of model complexity. The size of the test image is

256 \times 256

pixels.

Methods	DehazeFormer [41]	DRSformer [12]	MPRNet [43]	Restormer [27]	NeRD-Rain [11]	PromptIR [44]	Ours
Parameters (M)	9.7	33.6	3.6	26.1	22.9	32.9	18.6
FLOPs (G)	89.8	220.4	548.7	140.9	147.9	158.1	89.5

Table 5. Ablation study of the asymmetric mixture-of-experts network architecture. “✓” indicates that the corresponding component is used, and “✗” indicates that the corresponding component is not used. “↑” indicates better performance with higher values.

Methods	Encoder		Decoder		Metrics
Methods	Hard-MoE	Soft-MoE	Hard-MoE	Soft-MoE	PSNR ↑	SSIM ↑
(a)	✗	✗	✗	✗	24.26	0.7823
(b)	✓	✗	✓	✗	24.53	0.7972
(c)	✓	✗	✗	✓	24.46	0.7961
(d)	✗	✓	✗	✓	25.07	0.7994
Ours	✗	✓	✓	✗	25.15	0.8000

Table 6. Ablation study of different MoE expert types in Hard-MoE and Soft-MoE. “✓” indicates that the corresponding component is used, and “✗” indicates that the corresponding component is not used. “↑” indicates better performance with higher values.

Methods	MoE Expert Types			Metrics
Methods	Channel Self-Attention	Spatial Self-Attention	Frequency Self-Attention	PSNR ↑	SSIM ↑
(e)	✓	✗	✗	24.65	0.7932
(f)	✓	✓	✗	24.87	0.7977
(g)	✓	✗	✓	24.98	0.7983
Ours	✓	✓	✓	25.15	0.8000

Table 7. Ablation study on our multi-task collaborative training strategy. “✓” indicates that the corresponding component is used, and “✗” indicates that the corresponding component is not used. “↑” indicates better performance with higher values.

Methods	Collaborative Training Strategy			Metrics
Methods	Relative Convergence Index	Absolute Convergence Index	Online Reweighting Index	PSNR ↑	SSIM ↑
(h)	✗	✗	✗	24.80	0.7802
(i)	✓	✗	✗	24.76	0.7834
(j)	✗	✓	✗	24.84	0.7895
(k)	✓	✗	✓	25.93	0.7972
(l)	✗	✓	✓	25.94	0.7983
Ours	✓	✓	✓	25.15	0.8000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.

All-in-One Weather Image Restoration with Asymmetric Mixture-of-Experts Model and Multi-Task Collaboration

Abstract

1. Introduction

2. Related Work

2.1. Single-Weather Image Restoration

2.2. All-in-One Weather Image Restoration

2.3. Multi-Task Learning

3. Proposed Method

3.1. Overall Network Architecture

3.2. MoE Feature Extraction Module

3.3. Multi-Task Collaborative Training Strategy

3.4. Loss Function

4. Experiments

4.1. Experiments Setup

4.2. Evaluation on the WeatherBench Dateset

4.3. Evaluation on the FoundIR-Weather Dataset

4.4. Model Complexity

4.5. Ablation Studies

4.6. Discussion and Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics