BAT-Net: Bidirectional Attention Transformer Network for Joint Single-Image Desnowing and Snow Mask Prediction

Zhang, Yongheng

doi:10.3390/info16110966

Open AccessArticle

BAT-Net: Bidirectional Attention Transformer Network for Joint Single-Image Desnowing and Snow Mask Prediction^†

by

Yongheng Zhang

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications (BUPT), No. 10 Xitucheng Road, Beijing 100876, China

^†

This article is a revised and expanded version of a paper entitled Simultaneous Snow Mask Prediction and Single Image Desnowing with a Bidirectional Attention Transformer Network, which was presented at PRCV2024, Urumqi, China, 18–20 October 2024.

Information 2025, 16(11), 966; https://doi.org/10.3390/info16110966

Submission received: 23 September 2025 / Revised: 31 October 2025 / Accepted: 5 November 2025 / Published: 7 November 2025

(This article belongs to the Special Issue Intelligent Image Processing by Deep Learning, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

In the wild, snow is not merely additive noise; it is a non-stationary, semi-transparent veil whose spatial statistics vary with depth, illumination, and wind. Because conventional two-stage pipelines first detect a binary mask and then inpaint the occluded regions, any early mis-classification is irreversibly baked into the final result, leading to over-smoothed textures or ghosting artifacts. We propose BAT-Net, a Bidirectional Attention Transformer Network that frames desnowing as a coupled representation learning problem, jointly disentangling snow appearance and scene radiance in a single forward pass. Our core contributions are as follows: (1) A novel dual-decoder architecture where a background decoder and a snow decoder are coupled via a Bidirectional Attention Module (BAM). The BAM implements a continuous predict–verify–correct mechanism, allowing the background branch to dynamically accept, reject, or refine the snow branch’s occlusion hypotheses, dramatically reducing error accumulation. (2) A lightweight yet effective multi-scale feature fusion scheme comprising a Scale Conversion Module (SCM) and a Feature Aggregation Module (FAM), enabling the model to handle the large scale variance among snowflakes without a prohibitive computational cost. (3) The introduction of the FallingSnow dataset, curated to eliminate the label noise caused by irremovable ground snow in existing benchmarks, providing a cleaner benchmark for evaluating dynamic snow removal. Extensive experiments on synthetic and real-world datasets demonstrate that BAT-Net sets a new state of the art. It achieves a PSNR of 35.78 dB on the CSD dataset, outperforming the best prior model by 1.37 dB, and also achieves top results on SRRS (32.13 dB) and Snow100K (34.62 dB) datasets. The proposed method has significant practical applications in autonomous driving and surveillance systems, where accurate snow removal is crucial for maintaining visual clarity.

Keywords:

single-image desnowing; snow mask prediction; bidirectional attention

1. Introduction

When snow begins to fall, the camera’s projection model loses its usual simplicity. Flakes are neither opaque occluders nor homogeneous noise; they are semi-transparent, depth-dependent stochastic scatterers whose size, velocity, and albedo change from frame to frame. The resulting image is therefore a volatile mixture of three stochastic processes: the static scene’s radiance, the transmission modulation induced by overlapping flakes, and the forward-scattering glow surrounding each flake [1].

Most computer vision pipelines assume that the input is a faithful projection of a Lambertian world—an assumption that collapsing snowflakes systematically violate. Consequently, modern detectors and segmentation networks trained on fair-weather data suffer a dramatic drop in precision under even light snowfall, a fact that has become a safety concern for winter autonomous driving.

The early snow removal literature tried to invert a single physical model of the form:

\begin{matrix} I (x) = K (x) T (x) + A (x) (1 - T (x)), \end{matrix}

(1)

where

I (x)

represents the snowy image, and

T (x)

and

A (x)

are the transmission map and atmospheric illumination coefficient, respectively.

K (x)

, the scene without veiling or haze effects, was further decomposed into the following:

\begin{matrix} K (x) = J (x) (1 - Z (x) R (x)) + C (x) Z (x) R (x), \end{matrix}

(2)

where

J (x)

corresponds to the clear background without snow;

C (x)

and

Z (x)

constitute the snow chromatic map and binary snow coverage mask; and

R (x)

encodes the spatial locations affected by snowfall.

While these equations are convenient for rendering synthetic snow, they treat every pixel as an independent micro-snow event, ignoring the long-range spatial correlations that falling snow inevitably creates. Hand-crafted priors like color asymmetry, patch-wise sparse representation, or chromatic aberration cues [2,3,4,5,6] were therefore added to regularize the ill-posed inversion. Yet such priors are valid only within narrow meteorological bounds; a heavy-snow night scene or a back-lit blizzard easily breaks their underlying statistics.

Convolutional neural networks [7,8] shifted the burden from explicit priors to data-driven priors. Encoder–decoder architectures now learn an end-to-end mapping between synthetic pairs and achieve visually pleasing results—as long as the test distribution matches the synthetic prior. Unfortunately, the moment real snow arrives, two problems emerge. First, CNN receptive fields are too local to capture the meter-long streaks produced by fast-falling flakes close to the lens. Second, the community’s best CNN generators still confuse removable (falling) snow with permanent (frozen) ground cover, so networks learn to hallucinate gray patches instead of truly removing weather effects.

Recent deep desnowing pipelines [9,10,11,12] no longer treat Equations (1) and (2) as mere post hoc justifications; instead, they embed the forward model inside the network, forcing the decoder to predict an explicit snow layer and then subtract it from the observation. While this physically inspired design improves interpretability, it also introduces a brittle dependency; any error in the predicted mask is propagated backward and amplified during the subsequent restoration. Empirically, we observe that state-of-the-art models tend to generate either over-dilated snow regions that erase legitimate texture or under-segmented masks that leave translucent residues. Both failure modes originate from the same bottleneck—once the mask is produced, it is never re-examined, depriving the restoration branch of the opportunity to correct earlier mis-estimates.

Instead of refining a better synthesizer, we ask a different question: Can we make the network doubt its own mask and iteratively revise it? Our hypothesis is that snow removal and snow localization are mutually reinforcing: a better mask yields a cleaner image, and a cleaner image makes the mask easier to refine. To implement this closed-loop idea, we designed BAT-Net, a single-forward model that contains two equally deep Transformer decoders fed by a shared hierarchical encoder. The background decoder reconstructs the clean radiance, while the snow decoder produces a soft per-flake occlusion map. Critically, every Transformer block inside the background path is preceded by a Bidirectional Attention Module (BAM) that can read the current snow belief, negate it to reveal residual artifacts, and write the corrected feature back. The operation is asymmetric: the background branch sees the snow features, but the snow branch is agnostic to the background, preventing the mask from collapsing into a trivial copy of the input. To handle the extreme scale range of snow (from sub-pixel crystals to macro-streaks), we equipped the encoder with a Scale Conversion Module (SCM) that dynamically re-calibrates the convolutional stride and a Feature Aggregation Module (FAM) that fuses pyramidal descriptors without heavy up-/downsampling. Together, these components form BAT-Net, an efficient and lightweight architecture that enables cooperative reasoning between tasks and handles extreme scale variations, as shown in Figure 1.

We also release FallingSnow, a carefully filtered dataset of real-world images that contain only dynamic flakes. By removing frozen ground snow, we reduce label noise and provide a benchmark where mask accuracy is unambiguous. Across five datasets (three synthetic, two real), BAT-Net improves the PSNR by 0.9 dB over the previous best single-stage network.

Our contributions are summarized as follows:

We propose BAT-Net, a coupled Transformer framework that jointly disentangles snow and scenes without explicit physical priors.
A Bidirectional Attention Module that enables the background branch to see and undo the current snow belief.
We present the SCM and FAM, two lightweight encoder upgrades that handle extreme flake scale variance.
FallingSnow, a new real-world benchmark with only falling snow and pixel-accurate masks.

2. Related Works

2.1. Single-Image Snow Removal

Prior-Based Methods: In the nascent stages of single-image snow removal research, methodologies predominantly relied on prior-based approaches [2,3,4,5,6,16,17]. These methods often involved handcrafted rules or assumptions to tackle the task of snow removal. For instance, Xu et al. [2] proposed a method that involved refining the guidance image to preserve crucial detailed information during snow removal. Similarly, Yu et al. [6] approached the problem by treating it as a bilateral filtering problem and introduced a content-based saliency prior. Additionally, Wang et al. [5] developed a hierarchical scheme aimed at enhancing snow-removed images, emphasizing the sensitivity of variance across color channels.

These prior-based methods, while effective to some extent, often struggled with generalization across diverse snow conditions and failed to fully exploit the complex spatial and spectral characteristics of snow.

Learning-Based Methods: The advent of deep CNNs has led to the rise of learning-based desnowing methods [7,8,12,18,19,20,21,22]. Liu et al. [10] introduced DesnowNet, a model equipped with a multi-scale design throughout the network to capture the diverse characteristics of snow. Chen et al. [9] proposed HDCWNet, which leverages a hierarchical decomposition paradigm to gain a better understanding of snow particle sizes. Moreover, Cheng et al. [12] presented SMGARN, the Snow Mask-Guided Adaptive Residual Network, where Mask-Net predicts the snow mask and guides snow removal in the Guidance Fusion Network.

While the aforementioned learning-based methods have pushed the boundaries of desnowing performance, a critical analysis reveals a common architectural trend: they predominantly rely on unidirectional information flow. For instance, in the SMGARN [12], a snow mask is predicted first and then used to guide the background restoration. This sequential, open-loop process is inherently fragile, as any error in the initial mask estimation is directly propagated and often amplified in the final output. BAT-Net is designed to fill this critical gap. Unlike the SMGARN’s unidirectional guidance, BAT-Net introduces a bidirectional attention mechanism that allows the snow and background decoders to mutually correct and refine each other’s predictions within a single closed-loop forward pass, thereby mitigating error accumulation.

2.2. General Image Restoration Methods

In contrast to task-specific desnowing methods, numerous image restoration techniques have been developed to address a wide range of degradations, including haze, rain, noise, and more [14,15,23,24,25,26]. Zamir et al. [25] introduced MPRNet, a method that progressively learns restoration functions for degraded inputs using a multi-stage architecture. By iteratively refining the restoration process across multiple stages, MPRNet effectively enhances the quality of images affected by different types of degradation, including snow. Chen et al. [26] adopted a two-stage knowledge learning mechanism to train a model with a unified architecture and shared weights. This approach enables the model to address multiple types of adverse weather conditions after collating and examining knowledge from diverse sources. By leveraging a unified framework, the model gains versatility in handling various forms of degradation, offering practical utility in real-world scenarios where images are often subject to multiple types of deterioration simultaneously.

These general image restoration methods not only provide solutions for single degradation types but also offer a more comprehensive approach to image enhancement, ensuring improved visual quality across a broad spectrum of adverse conditions.

2.3. Vision Transformer in Image Restoration

The utilization of a Vision Transformer (ViT) in image restoration tasks has garnered attention due to its ability to effectively capture long-range dependencies in images, crucial for understanding global context and relationships among different parts of an image. Recent research has demonstrated the efficacy of a ViT in various image restoration applications [13,27,28,29,30,31]. Wang et al. [13] proposed Uformer, a U-shaped hierarchical encoder–decoder network built using the Vision Transformer block. Uformer employs a non-overlapping window-based self-attention mechanism instead of global self-attention, allowing it to capture both local and global dependencies effectively for image restoration tasks. By leveraging this innovative approach, Uformer achieves remarkable results in restoring images affected by various types of degradation. Chen et al. [31] introduced DRSformer, which incorporates a learnable top-k selection operator in the Vision Transformer block. This operator enables DRSformer to adaptively retain the most crucial attention scores from the keys, enhancing its ability to focus on relevant information during the restoration process. Their method demonstrated promising performance in single-image deraining, showcasing the potential of Vision Transformer-based approaches in addressing challenging image restoration tasks.

ViT-based restoration models like Uformer [13] and Restormer [27] excel at capturing long-range dependencies through self-attention. However, their application to desnowing often treats it as a general image-to-image translation problem without explicitly modeling the underlying physical interplay between the snow occlusion and the background scene. BAT-Net leverages the Transformer’s strength in modeling global context but uses it for a specific purpose: to facilitate a continuous, attentive dialogue between two competing interpretations of the image (snow vs. background). This represents a shift from using Transformers as a generic backbone to designing a task-specific interactive architecture where attention is used not just for feature enhancement, but for cross-task verification and refinement. The comparison of desnowing and general image restoration methods is summarized in Table 1.

3. Proposed Method

This section presents BAT-Net, a unified Transformer framework that simultaneously recovers a clean image J and a soft snow mask Z from a single snowy observation I. The overall architecture is outlined in Section 3.1. We then detail the core components, the Scale Conversion Module (SCM) and Feature Aggregation Module (FAM) for multi-scale feature unification (Section 3.2), and the Bidirectional Attention Module (BAM), which enables interactive reasoning between the two decoding tasks (Section 3.3).

3.1. Overall Pipeline

As illustrated in Figure 2, BAT-Net takes a snowy image

I \in R^{H \times W \times 3}

and produces two aligned outputs, the desnowed image J and a soft snow mask Z, without any post-processing.

Encoder. The input is first projected into feature space by a

3 \times 3

convolution. A three-level Transformer encoder then extracts multi-scale representations. Each level uses Channel-Wise Self-Attention (CSA) [27] to model global dependencies with

O (C^{2})

complexity, crucial for capturing long-range flake patterns. Spatial resolution is halved and channel depth doubled at each level, building a feature pyramid.

Feature Unification. Multi-scale encoder features

{F_{e}^{i}}_{i = 1}^{3}

are harmonized by the SCM and aggregated by the FAM into a unified representation

F_{u}

, which encapsulates both local detail and global context.

Dual Decoder. The unified feature

F_{u}

is processed by a shared Transformer block before branching into two parallel decoders. Both decoders symmetrically upsample features, doubling spatial size and halving channels at each level. The core of our framework is the Bidirectional Attention Module (BAM), integrated within each background decoder block. The BAM dynamically fuses features from the concurrent snow decoder, enabling mutual refinement between the emerging background reconstruction and snow mask prediction. This closed-loop interaction is the key to mitigating error propagation.

Training Protocol. The model was trained in an end-to-end supervised manner using a composite objective function that simultaneously supervises both output branches:

L_{t o t a l} = {∥ J - J_{g t} ∥}_{1} + α {∥ Z - Z_{g t} ∥}_{1},

(3)

where

α

is a balancing weight between background reconstruction and mask estimation losses.

To ensure fair evaluation and dataset-specific optimization, we adopted a separate training protocol where individual models were trained and evaluated on each benchmark dataset (CSD, SRRS, and Snow100K) independently. This approach allowed each model to specialize in the particular snow characteristics and image statistics of its respective dataset. The AdamW optimizer with cosine annealing learning rate scheduling was employed across all experiments. Comprehensive implementation details including hyperparameter values and computational environment information are provided in Section 4.1.

3.2. Scale Conversion Module and Feature Aggregation Module

Each encoder level in BAT-Net outputs feature tensors at different resolutions, carrying a unique mixture of local flake details and global atmospheric context. To enable the decoder to fuse these heterogeneous features without aliasing, the Scale Conversion Module (SCM) first projects them onto a common spatial grid.

The SCM begins with a lightweight

1 \times 1

convolution to standardize the channel count to 64. It then employs two parallel downsampling branches: a max-pooling path to preserve crisp flake boundaries and an average-pooling path to maintain smooth illumination gradients. Both branches reduce the feature map to the target stride-8 size (

H / 8 \times W / 8

). The operations are defined as follows:

F_{m}^{i} = MaxPool ({Conv}_{1} (F_{e}^{i})), F_{a}^{i} = AvgPool ({Conv}_{1} (F_{e}^{i}))

(4)

where

{Conv}_{1}

denotes the

1 \times 1

convolution.

The resulting pooled tensors are concatenated channel-wise. A final

1 \times 1

convolution compresses this combined descriptor to

8 C

channels, yielding a compact yet expressive representation ready for aggregation:

F_{k}^{i} = {Conv}_{1} (Concat (F_{m}^{i}, F_{a}^{i})) .

(5)

This dual-path fusion ensures that both sharp flake boundaries and gentle illumination ramps are preserved.

Following scale normalization by the SCM, the Feature Aggregation Module (FAM) integrates the three converted encoder maps

{F_{k}^{i}}_{i = 1}^{3}

. The FAM first concatenates them, then uses a

1 \times 1

convolution for channel compression, producing a unified feature

F_{u}

:

F_{u} = {Conv}_{1} (Concat (F_{k}^{1}, F_{k}^{2}, F_{k}^{3})) .

(6)

A Channel-Wise Self-Attention (CSA) block then models long-range inter-channel dependencies, followed by a second

1 \times 1

convolution for feature refinement. A residual connection stabilizes gradient flow, yielding the final aggregated feature

\hat{F_{u}}

:

\hat{F_{u}} = {Conv}_{1} (CSA (F_{u})) + F_{u} .

(7)

The CSA block computes pair-wise affinities in the

C \times C

space, providing a global receptive field across channels without quadratic spatial cost. A learnable temperature parameter

λ

allows the module to adaptively sharpen its focus on snow-related statistics. By re-weighting channel contributions, CSA amplifies informative responses while suppressing noise. The residual shortcut ensures stable gradient propagation.

3.3. Bidirectional Attention Module

The Bidirectional Attention Module (BAM) serves as the core interactive component between the two decoders, enabling the background branch to incorporate snow-specific contextual cues from the snow branch without significant parameter overhead. As illustrated in Figure 3, the BAM performs complementary forward and reverse attention passes, allowing the background decoder to simultaneously focus on snow-occluded regions for inpainting and preserve fine details in snow-free areas. This dual-path design implements a continuous “predict–verify–correct” mechanism that is crucial for mitigating the error propagation common in sequential estimation pipelines.

Forward Attention for Snow Localization. At decoder level i, the background feature

F_{c}^{i}

and snow feature

F_{s}^{i}

are first projected to a unified channel depth via independent

1 \times 1

convolutions. The projected snow feature is then transformed into a soft attention map

A_{s} \in {[0, 1]}^{H \times W \times C}

via a sigmoid function, which can be interpreted as a pixel-wise probability of snow occupancy. Element-wise multiplication between the transformed background feature and

A_{s}

selectively amplifies tokens likely to be snow-occluded:

F_{a}^{i} = {Conv}_{1} (F_{c}^{i}) ⊙ σ ({Conv}_{1} (F_{s}^{i})),

(8)

where

σ

denotes the sigmoid function. This operation explicitly highlights regions requiring restoration, such as translucent flakes and their halos, by assigning higher weights to features activated in the snow decoder. It provides a direct pathway for the snow branch to indicate where the background reconstruction should intensify its inpainting efforts.

Reverse Attention for Detail Preservation. In parallel, the reverse pathway functions as a conservative mechanism to safeguard legitimate scene content that might be erroneously suppressed. An inverse confidence map

(1 - A_{s})

is generated, which peaks in areas the snow decoder considers clear. Element-wise multiplication with the transformed background feature reactivates and reinforces pixels in these snow-free regions:

F_{r}^{i} = {Conv}_{1} (F_{c}^{i}) ⊙ (1 - σ ({Conv}_{1} (F_{s}^{i}))) .

(9)

To enhance the adaptability of this process, a learnable temperature parameter

τ

is incorporated to modulate the sigmoid slope, allowing the network to dynamically sharpen the inverse mask during heavy snowfall when precise localization is critical and soften it when flakes are sparse. Furthermore, a subsequent depth-wise convolution acts as a channel-wise scaling factor, adaptively boosting high-frequency responses in the reverse path. This ensures the preservation of fine structures like edges and textures, which are essential for visual realism but vulnerable to over-smoothing.

Feature Fusion and Stable Integration. The forward descriptor

F_{a}^{i}

and reverse descriptor

F_{r}^{i}

are concatenated along the channel dimension, providing the subsequent layers with a comprehensive view that encapsulates both snow-highlighted and snow-suppressed contexts of the same spatial location. A

1 \times 1

convolution then compresses this combined tensor back to the original channel count of

F_{c}^{i}

, effectively fusing the complementary information. This is followed by GroupNorm and a GELU activation function to introduce beneficial non-linearity and inter-channel competition.

The refined feature is then added back to the input

F_{c}^{i}

via a gated residual connection:

F_{b}^{i} = {Conv}_{1} (Concat (F_{a}^{i}, F_{r}^{i})) + γ F_{c}^{i},

(10)

where

γ

is a learnable scalar parameter initialized near zero. This gating mechanism stabilizes the training process in its early stages by allowing the network to initially rely on the identity skip connection and gradually incorporate the more complex transformations from the bidirectional attention paths as training progresses. This design not only preserves gradient flow but also enables BAT-Net to synergistically leverage contextual information in both “highlight” and “shadow” forms within a single forward pass, effectively enhancing edge sharpness and reducing ghost artifacts with minimal parameter addition.

3.4. Computational Complexity Analysis

To assess the practical efficiency of the proposed architecture, we analyze its computational complexity. The overall complexity of BAT-Net is primarily determined by its Transformer blocks with Channel-Wise Self-Attention (CSA) and the lightweight design of the Scale Conversion Module (SCM) and Bidirectional Attention Module (BAM). The CSA mechanism reduces the computational complexity of standard self-attention from

O ({(H W)}^{2})

to

O (C^{2})

, making global receptive field modeling feasible without excessive computational cost. Meanwhile, the SCM and BAM are deliberately constructed using standard convolutions and element-wise operations, introducing minimal parameter overhead while delivering significant performance benefits. A comprehensive quantitative comparison of model size, FLOPs, and inference speed against those of other state-of-the-art methods is provided in Section 4.7, which demonstrates that BAT-Net maintains a highly competitive efficiency profile suitable for real-world applications.

4. Experiments and Analysis

4.1. Datasets and Implementation Details

We conducted extensive experiments on both synthetic and real-world datasets to evaluate the performance of our proposed method. For synthetic evaluation, we utilized three public benchmarks: CSD [9], SRRS [11], and Snow100K [10]. A summary of all datasets used in this study is provided in Table 2.

The CSD dataset consists of a training set with 8000 synthetic image pairs and a test set with 2000 pairs. Similarly, the SRRS dataset comprises 10,000 training and 2000 test pairs. The Snow100K dataset is larger, with 20,000 training and 2000 test pairs, and includes three levels of snowfall severity.

For evaluation on real images, we used the Snow100K-Real dataset. However, we observed that existing real-snow datasets often contain a mixture of falling snow and accumulated snow on the ground, which conflates two distinct restoration challenges. To specifically study the problem of removing falling snow, we curated a new real-world dataset, FallingSnow. This dataset contains 1394 high-resolution real snowy images sourced from publicly available internet repositories under licenses that permit academic use. Each image was manually verified to ensure it contains dynamic falling snowflakes without significant ground snow accumulation. Example images from the FallingSnow dataset are provided in Figure 4.

Our framework was implemented using PyTorch 1.10.0. For all experiments, models were separately trained and evaluated on each dataset (CSD, SRRS, Snow100K) to ensure domain-specific performance. The training configuration was consistent across datasets: we employed the Adam optimizer with parameters

β_{1} = 0.9

,

β_{2} = 0.999

, and a weight decay of

1 \times 10^{- 4}

. The models were trained for 200 epochs on their respective training sets using the L1 loss function. We initialized the learning rate at

1 \times 10^{- 4}

, which was gradually reduced to

1 \times 10^{- 6}

using cosine annealing [32]. We specified the number of layers

N_{1}, N_{2}, N_{3}, N_{4}

as 2, 4, 4, 6, respectively, while the number of attention heads varied as 1, 2, 4, 8 and the number of channels as 48, 96, 192, 384. The batch size was set to 16. During training, all images were randomly cropped into patches of size

256 \times 256

, with pixel values normalized to the range [−1, 1].

4.2. Comparison with State-of-the-Art Methods

We benchmarked BAT-Net against nine competitive baselines: four snow-specific algorithms (DesnowNet [10], JSTASR [11], HDCWNet [9], SMGARN [12]) and five general-purpose restoration networks (MPRNet [25], Chen et al. [26], Uformer [13], WGWS [14], Patil et al. [15]). See Table 3 for architecture summaries. All metrics were obtained from either using officially released weights or re-training the public code on our training splits until convergence.

4.2.1. Synthetic Image Desnowing

To quantitatively appraise desnowing quality on synthetic data, we adopted the following full-reference metrics: the PSNR (in dB) and the SSIM. As summarized in Table 4, BAT-Net surpasses every competing snow-specific and general restoration model across all three benchmarks, securing the top rank in both indices. On the challenging CSD dataset, the margin between BAT-Net and the second-best approach widens to 1.37 dB in terms of the PSNR, while the SSIM is improved by 0.012—an increment that is clearly visible to human observers. The visual comparisons in Figure 5, Figure 6 and Figure 7 corroborate the numbers.

Rival outputs frequently retain hazy flake remnants and blur fine structures such as railings, distant text, or leaf veins. In contrast, BAT-Net yields nearly residue-free images with crisper texture fidelity and more accurate color reproduction. Zoomed-in patches reveal that competing methods either over-smooth or leave translucent ghosts, whereas our results retain brickwork edges and road markings that are almost indistinguishable from the ground truth image. The consistent quantitative lead and perceptual clarity jointly underscore BAT-Net’s superior efficacy on synthetic snowy scenes, providing a reliable foundation for subsequent evaluations on real-world data.

4.2.2. Real-Image Desnowing

We further validate BAT-Net on two real-world benchmarks—Snow100K-Real and the newly curated FallingSnow—whose images were captured under authentic meteorological conditions with varying flake sizes, densities, and illumination. Representative results are displayed in Figure 8 and Figure 9. Consistent with the synthetic evaluation, competing snow-specific networks and general restoration backbones frequently leave hazy white residues or produce unnaturally desaturated regions. In contrast, BAT-Net—although trained exclusively on synthetic pairs—demonstrates strong cross-domain generalization. The bidirectional attention gate accurately separates cool-toned flake reflections from legitimate bright objects, so most snow is removed, while the original hue and micro-texture remain intact. This favorable real-world performance highlights BAT-Net’s robustness and adaptability, confirming that the snow–background disentanglement learned from synthetic data generalizes well to authentic meteorological conditions without additional fine-tuning.

4.3. Qualitative Summary

To holistically assess the performance and generalization capability of all compared methods, we provide a comprehensive visual summary in Figure 10.

This unified grid juxtaposes the outputs of key models across all five benchmark datasets, encompassing both synthetic and real-world snowy conditions. The side-by-side comparison immediately reveals that, while other methods perform well on certain datasets or scene types, their results can be inconsistent, often exhibiting residual haze, over-smoothing, or incomplete snow removal in more challenging cases. In contrast, BAT-Net demonstrates remarkable consistency, delivering high-quality, artifact-free restoration across the entire spectrum of conditions—from controlled synthetic snow to complex real-world captures. This visual evidence strongly corroborates the quantitative findings and underscores the practical advantage of our method’s closed-loop, cooperative architecture in achieving robust desnowing performance.

4.4. Snow Mask Prediction

Beyond recovering a clean backdrop, we evaluated BAT-Net’s ability to predict precise snow masks, adopting both pixel-level metrics and visual scrutiny. Table 5 reports the PSNR and SSIM achieved on the synthetic CSD benchmark, while Figure 11 provides qualitative comparisons with the current best-published method, SMGARN [12]. Quantitatively, BAT-Net attains a PSNR of 26.67 dB and an SSIM of 0.732, surpassing the second-best competitor by 1.72 dB and 0.065 points, respectively. Qualitatively, BAT-Net reproduces minute crystals that SMGARN [12] blurs into uniform blobs. These results confirm that the bidirectional attention mechanism not only removes snow but also delineates it with pixel-level accuracy, providing a reliable cue for downstream tasks such as flake motion estimation or winter-scene editing.

4.5. Object Detection

Adverse weather such as snow, rain, or haze degrades contrast and edge clarity, causing standard object detectors to drop instances or mis-classify categories. We therefore quantified how well different desnowing strategies restore detection reliability. A YOLOv4 [34] model pre-trained on fair-weather data was applied to the entire test set after snow removal; no fine-tuning on wintry scenes was performed, so the detector’s performance reflected the visual quality produced by each method. The results are depicted in Figure 12. Upon reviewing the results, it is evident that YOLOv4 [34] mistakenly detects a bicycle in the snowy image. Additionally, it fails to detect the car on the right in the results obtained by SMGARN [12]. In contrast, the detection results of BAT-Net closely align with the ground truth, showcasing its effectiveness in snow removal and background restoration. Analysis reveals that competing methods frequently leave translucent residues or over-smooth regions, causing the detector to assign low-confidence or incorrect labels. BAT-Net, in contrast, produces radiometrically faithful backgrounds that stay within the detector’s training distribution, thereby reducing false negatives and category confusion. These results corroborate that accurate snow removal is not merely an esthetic improvement but a practical prerequisite for reliable high-level vision in winter environments.

4.6. Ablation Studies

4.6.1. Contributions of Core Components

To rigorously isolate the contribution of each proposed module, we systematically removed them from the full BAT-Net model and report the results achieved on the CSD dataset in Table 6.

The findings are clear and definitive:

SCM: Removing the Scale Conversion Module results in a significant drop of 0.94 dB in PSNR, confirming that harmonizing multi-scale features onto a uniform grid is crucial for effective subsequent processing.
FAM: Ablating the Feature Aggregation Module while keeping the SCM leads to a 0.73 dB decrease, demonstrating that cross-scale channel-wise attention is independently vital for integrating fine edges with global illumination contexts.
Snow Decoder: Disabling the snow decoder and its bidirectional attention causes the most substantial performance loss of 2.47 dB in PSNR, underscoring that explicit snow mask prediction and the cooperative “challenge” between decoders are the most indispensable elements for high-quality background inpainting.

The visual comparisons in Figure 13 corroborate these numerical findings. The model without the SCM produces blocky artifacts, the one without the FAM fails to restore contrast adequately, and the full model yields the cleanest edges with minimal residual haze. Collectively, these ablations verify that every design choice made in the design of BAT-Net—scale harmonization, token aggregation, and bidirectional snow–background dialogue—contributes uniquely and meaningfully to the final desnowing quality.

4.6.2. Study of the Bidirectional Attention Module

Table 7 and Figure 14 summarize the ablation experiments evaluating the Bidirectional Attention Module (BAM) using the CSD dataset. As shown quantitatively, both forward and reverse attention mechanisms individually improve desnowing performance, while their integration in the full BAM yields further gains of 1.57 dB in PSNR and 0.010 in SSIM. These numerical results are corroborated by the visual comparisons in Figure 14, collectively validating the effectiveness of the proposed BAM architecture in BAT-Net.

4.7. Cross-Weather Validation

To further probe the robustness and generalization capability of the proposed BAT-Net beyond the domain it was trained on, we conducted a cross-weather validation experiment. Notably, the model was not retrained or fine-tuned on any data from these new weather conditions. We evaluated its zero-shot performance on two representative adverse weather datasets: the RTTS dataset [35] for real-world haze and the SPA dataset [36] for rain streaks.

The qualitative results, showcased in Figure 15, reveal a compelling finding: BAT-Net exhibits a notable ability to mitigate other weather degradations. As shown in the first pair of images, the model effectively removes a significant portion of the haze from the input, improving visibility and contrast. Similarly, for the rainy scene, the model successfully removes a majority of the rain streaks without introducing noticeable blurring or artifacts into the background.

We posit that this emergent capability stems from the shared underlying physics of weather degradations—often modeled as a combination of attenuation and additive noise—and the powerful, content-aware restoration learned by our bidirectional architecture. While its performance is understandably less perfect than that of models trained specifically for deraining or dehazing, this intrinsic generalization ability underscores the robustness and potential utility of BAT-Net as a more universal adverse weather restoration tool, which is a promising direction for future work.

4.8. Failure Cases Analysis

To objectively assess the boundaries of our model’s capability, we analyzed its performance in challenging scenarios where it underperformed. Figure 16 showcases two representative failure cases.

Night Scenes with Snow. In low-light conditions from the Snow100K-Real dataset (Figure 16a,b), BAT-Net exhibits two primary shortcomings: incomplete snow removal and color distortion in the output. The model struggles to separate snowflakes from noise and grain inherent in poorly lit scenes, partly because our training data primarily consist of well-illuminated synthetic snow. Furthermore, the model’s difficulty in accurately estimating and preserving the color palette under complex nighttime illumination leads to undesirable color shifts.

Heavy Snowfall with Light Sources. In dense snowfall from the FallingSnow dataset (Figure 16c,d), the model fails to remove all snow particles, leaving residual noise. More notably, it mistakenly removes certain bright light sources, likely because intense, localized lights share high-intensity characteristics with snowflakes, confusing the model’s identification process. This results in a loss of the scene’s atmospheric context.

These limitations highlight specific avenues for future work. Augmenting the training set with more diverse and challenging real-world examples, particularly night scenes and various heavy snow conditions, is crucial. Furthermore, enhancing the model’s ability to understand global scene context and illumination could help it better distinguish between light sources and snow, preventing over-correction.

4.9. Study of Hyperparameters and Model Complexity

In this subsection, we investigate the impact of hyperparameters and model complexity on the performance of snow removal networks. Four hyperparameter configurations were tested, with varying layer numbers, attention heads, and channel numbers. Specifically, we considered two settings for layer numbers (4,4,4,4 and 2,4,4,6) and two settings for attention heads (2,2,4,4 and 1,2,4,8). The channel numbers corresponding to attention heads were 96,96,192,192 and 48,96,192,384, respectively. The comparison results are summarized in Table 8.

Furthermore, we analyzed the model complexity across different configurations. The results, presented in Table 9, indicate that BAT-Net exhibits the fewest FLOPS and shortest inference time, making it a relatively lightweight model. Notably, BAT-Net achieves an inference speed of 9.32 FPS, which is the highest among the compared methods and approaches real-time performance, fulfilling the computational demands of practical applications. This characteristic holds significant practical value, particularly for scenarios such as autonomous driving and monitoring systems, where computing resources are constrained and real-time processing is crucial. The lightweight nature of BAT-Net ensures higher performance within resource-constrained environments.

5. Conclusions

In this paper, we have shown that the apparent conflict between “removing snow” and “estimating where snow is” can be turned into a cooperative game. BAT-Net couples the two tasks inside a single Transformer backbone and lets each decoder continuously challenge the other; the snow branch proposes an occlusion hypothesis, while the background branch uses a bidirectional attention gate to accept, reject, or refine that hypothesis on the fly. The resulting loop is closed within one forward pass, eliminating the need for costly iterative inference yet yielding images that are 0.9 dB cleaner than those produced by the best cascading schemes. Beyond the raw numbers, our study shows that long-range modeling matters: the combination of the SCM and FAM allows the network to consider flakes that span three octaves in scale without exploding the parameter count.

Looking toward practical deployment, BAT-Net’s efficient snow elimination is well-suited for real-world systems. Its ability to operate in real time makes it a viable pre-processor for autonomous driving pipelines and outdoor surveillance systems, ensuring reliable performance in heavy snowfall. Furthermore, the underlying architecture holds promise for generalizing to other weather degradations like rain and haze, thereby positioning it as a versatile framework for all-weather computer vision. By providing autonomous vehicles and outdoor robots with clean visual inputs, our method enhances situational awareness and operational safety during winter conditions.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflicts of interest.

References

Zhang, Y.; Yan, D. Simultaneous snow mask prediction and single image desnowing with a bidirectional attention transformer network. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Urumqi, China, 13–15 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 294–308. [Google Scholar]
Xu, J.; Zhao, W.; Liu, P.; Tang, X. An improved guidance image based method to remove rain and snow in a single image. Comput. Inf. Sci. 2012, 5, 49. [Google Scholar] [CrossRef]
Zheng, X.; Liao, Y.; Guo, W.; Fu, X.; Ding, X. Single-image-based rain and snow removal using multi-guided filter. In Proceedings of the Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, 3–7 November 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 258–265, Proceedings, Part III 20. [Google Scholar]
Pei, S.C.; Tsai, Y.T.; Lee, C.Y. Removing rain and snow in a single image using saturation and visibility features. In Proceedings of the 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Chengdu, China, 14–18 July 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1–6. [Google Scholar]
Wang, Y.; Liu, S.; Chen, C.; Zeng, B. A hierarchical approach for rain or snow removing in a single color image. IEEE Trans. Image Process. 2017, 26, 3936–3950. [Google Scholar] [CrossRef] [PubMed]
Yu, S.; Zhao, Y.; Mou, Y.; Wu, J.; Han, L.; Yang, X.; Zhao, B. Content-adaptive rain and snow removal algorithms for single image. In Proceedings of the Advances in Neural Networks–ISNN 2014: 11th International Symposium on Neural Networks, ISNN 2014, Hong Kong SAR, China; Macao SAR, China, 28 November–1 December 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 439–448, Proceedings 11. [Google Scholar]
Zhang, K.; Li, R.; Yu, Y.; Luo, W.; Li, C. Deep dense multi-scale network for snow removal using semantic and depth priors. IEEE Trans. Image Process. 2021, 30, 7419–7431. [Google Scholar] [CrossRef] [PubMed]
Li, P.; Yun, M.; Tian, J.; Tang, Y.; Wang, G.; Wu, C. Stacked dense networks for single-image snow removal. Neurocomputing 2019, 367, 152–163. [Google Scholar] [CrossRef]
Chen, W.T.; Fang, H.Y.; Hsieh, C.L.; Tsai, C.C.; Chen, I.; Ding, J.J.; Kuo, S.Y. All snow removed: Single image desnowing algorithm using hierarchical dual-tree complex wavelet representation and contradict channel loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4196–4205. [Google Scholar]
Liu, Y.F.; Jaw, D.W.; Huang, S.C.; Hwang, J.N. DesnowNet: Context-aware deep network for snow removal. IEEE Trans. Image Process. 2018, 27, 3064–3073. [Google Scholar] [CrossRef] [PubMed]
Chen, W.T.; Fang, H.Y.; Ding, J.J.; Tsai, C.C.; Kuo, S.Y. JSTASR: Joint size and transparency-aware snow removal algorithm based on modified partial convolution and veiling effect removal. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 754–770, Proceedings, Part XXI 16. [Google Scholar]
Cheng, B.; Li, J.; Chen, Y.; Zeng, T. Snow mask guided adaptive residual network for image snow removal. Comput. Vis. Image Underst. 2023, 236, 103819. [Google Scholar] [CrossRef]
Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on CVPR, New Orleans, LA, USA, 19–24 June 2022; pp. 17683–17693. [Google Scholar]
Zhu, Y.; Wang, T.; Fu, X.; Yang, X.; Guo, X.; Dai, J.; Qiao, Y.; Hu, X. Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 21747–21758. [Google Scholar]
Patil, P.W.; Gupta, S.; Rana, S.; Venkatesh, S.; Murala, S. Multi-weather Image Restoration via Domain Translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 21696–21705. [Google Scholar]
Wang, C.; Shen, M.; Yao, C. Rain streak removal by multi-frame-based anisotropic filtering. Multimed. Tools Appl. 2017, 76, 2019–2038. [Google Scholar] [CrossRef]
Ding, X.; Chen, L.; Zheng, X.; Huang, Y.; Zeng, D. Single image rain and snow removal via guided L0 smoothing filter. Multimed. Tools Appl. 2016, 75, 2697–2712. [Google Scholar] [CrossRef]
Fazlali, H.; Shirani, S.; Bradford, M.; Kirubarajan, T. Single image rain/snow removal using distortion type information. Multimed. Tools Appl. 2022, 81, 14105–14131. [Google Scholar] [CrossRef]
Jaw, D.W.; Huang, S.C.; Kuo, S.Y. DesnowGAN: An efficient single image snow removal framework using cross-resolution lateral connection and GANs. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 1342–1350. [Google Scholar] [CrossRef]
Li, Z.; Zhang, J.; Fang, Z.; Huang, B.; Jiang, X.; Gao, Y.; Hwang, J.N. Single image snow removal via composition generative adversarial networks. IEEE Access 2019, 7, 25016–25025. [Google Scholar] [CrossRef]
Cheng, Y.; Ren, H.; Zhang, R.; Lu, H. Context-aware coarse-to-fine network for single image desnowing. Multimed. Tools Appl. 2023, 83, 55903–55920. [Google Scholar] [CrossRef]
Li, Y.; Qian, Q.; Duan, H.; Min, X.; Xu, Y.; Jiang, X. Boosting power line inspection in bad weather: Removing weather noise with channel-spatial attention-based UNet. Multimed. Tools Appl. 2023, 83, 88429–88445. [Google Scholar] [CrossRef]
Li, B.; Liu, X.; Hu, P.; Wu, Z.; Lv, J.; Peng, X. All-in-one image restoration for unknown corruption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 17452–17462. [Google Scholar]
Valanarasu, J.M.J.; Yasarla, R.; Patel, V.M. Transweather: Transformer-based restoration of images degraded by adverse weather conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 2353–2363. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 14821–14831. [Google Scholar]
Chen, W.T.; Huang, Z.K.; Tsai, C.C.; Yang, H.H.; Ding, J.J.; Kuo, S.Y. Learning multiple adverse weather removal via two-stage knowledge learning and multi-contrastive regularization: Toward a unified model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17653–17662. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1833–1844. [Google Scholar]
Lee, H.; Choi, H.; Sohn, K.; Min, D. Knn local attention for image restoration. In Proceedings of the EEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2139–2149. [Google Scholar]
Song, Y.; He, Z.; Qian, H.; Du, X. Vision transformers for single image dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Li, H.; Li, M.; Pan, J. Learning A Sparse Transformer Network for Effective Image Deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–23 June 2023; pp. 5896–5905. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Yang, W.; Tan, R.T.; Feng, J.; Liu, J.; Guo, Z.; Yan, S. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; pp. 1357–1366. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Li, B.; Ren, W.; Fu, D.; Tao, D.; Feng, D.; Zeng, W.; Wang, Z. Benchmarking single-image dehazing and beyond. IEEE Trans. Image Process. 2018, 28, 492–505. [Google Scholar] [CrossRef] [PubMed]
Wang, T.; Yang, X.; Xu, K.; Chen, S.; Zhang, Q.; Lau, R.W. Spatial attentive single-image deraining with a high quality real rain dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12270–12279. [Google Scholar]

Figure 1. Quantitative comparisons of our method and other state-of-the-art algorithms (Uformer [13], WGWS [14], and Patil [15]) applied to the CSD dataset [9] demonstrate the superior performance of our BAT-Net. The symbols ↑ and ↓ indicate higher is better, and lower is better, respectively.

Figure 2. The framework of BAT-Net comprises an image encoder, background decoder, and snow decoder. It incorporates a Scale Conversion Module (SCM) and a Feature Aggregation Module (FAM) for multi-scale feature utilization. The Bidirectional Attention Module (BAM) enhances interaction between background and snow features.

Figure 3. Architecture of the Bidirectional Attention Module (BAM), comprising complementary forward and reverse attention paths. The forward attention branch utilizes the snow feature for image desnowing, while the reverse attention branch utilizes the reverse snow feature.

Figure 4. Representative samples from the proposed FallingSnow dataset, illustrating the diversity of its real-world scenarios. The collection encompasses various scenes, including urban environments, portraits, night conditions, top-down views, and heavy snowstorms, ensuring comprehensive coverage of the challenges posed by falling snow. The Korean text on the signboard reads “Youanta Securities”.

Figure 5. Qualitative evaluation of synthetic snow from the CSD dataset [9]. ’GT’ refers to ground truth in this and all subsequent figures. Our BAT-Net effectively removes snow artifacts while preserving image structures and textures, outperforming existing methods in visual quality. The Chinese text on the signboard is the shop name “Yi Bao”.

Figure 6. Visual comparison on the SRRS dataset [11]. The proposed method demonstrates robust snow removal across varying snow densities and complex scenes, producing cleaner and more natural-looking results.

Figure 7. Desnowing performance on Snow100K synthetic dataset [10]. BAT-Net achieves comprehensive snow elimination while maintaining color fidelity and edge sharpness.

Figure 8. Real-world snow removal evaluation on Snow100K-Real dataset [10]. Our approach shows superior generalization to authentic snow patterns, effectively handling semi-transparent flakes.

Figure 9. Performance assessment on the FallingSnow real-world dataset. The results validate BAT-Net’s capability in processing dynamic falling snow in diverse natural scenes with minimal residual artifacts and well-preserved background details. The Japanese text on the lantern reads “Kobune-chō”.

Figure 10. Visual comparison summary of datasets and methods. The rows, from top to bottom, correspond to the CSD, SRRS, Snow100K (synthetic), Snow100K-Real, and FallingSnow (real-world) datasets. The columns present the results of the input image and desnowing models. BAT-Net consistently produces the most visually coherent and artifact-free results across all conditions, demonstrating superior robustness and generalization. The Chinese characters on the signboard form part of a shop name.

Figure 11. Snow mask prediction results on CSD dataset [9]. Our method accurately localizes both dense and sparse snowflakes, providing precise spatial guidance for the restoration process.

Figure 12. Downstream object detection performance on restored images from CSD dataset [9]. Detection results from BAT-Net-processed images show improved accuracy and reliability, demonstrating the practical utility of our desnowing method for vision-based applications.

Figure 13. Ablation study of the Feature Aggregation Module and the snow decoder applied to an image from CSD [9]. (a) Without snow decoder: shows severe residual snow and poor restoration. (b) Without SCM: exhibits blocky artifacts. (c) Without FAM: fails to adequately restore image contrast. (d) Full BAT-Net model: yields the cleanest edges with minimal residual haze. The inscription on the plaque is a celebrated honorific title meaning “The Greatest Fortress Under Heaven”.

Figure 14. Ablation study on the Bidirectional Attention Module (BAM) using a sample from the CSD dataset [9]. Visual results correspond to the following configurations: (a) baseline (without BAM), (b) reverse attention only, (c) forward attention only, and (d) full BAM. The complete module yields the most effective snow removal and detail preservation.

Figure 15. Qualitative results of cross-weather generalization. BAT-Net, trained solely on snowy images, demonstrates a remarkable capability to generalize to other adverse weather conditions. From left to right: (a) a hazy image from the RTTS dataset and (b) its restored result by BAT-Net, showing effective haze removal; (c) a rainy image from the SPA dataset and (d) its restored result, where most rain streaks are eliminated and the image structure preserved. The Chinese texts on the traffic sign are place names.

Figure 16. Typical failure cases of BAT-Net: (a,b) A night scene from Snow100K-Real: incomplete snow removal and noticeable color shift in the restored image. (c,d) A heavy snow scenario from FallingSnow: residual snow particles and over-aggressive removal of light sources.

Table 1. Comparative summary of representative snow removal and image restoration models.

Model	Architecture	Input/Output	Core Mechanism	Limitations/Addressed Gap
DesnowNet [10]	Multi-scale CNN	Snowy Image/Clean Image	Multi-scale convolutional network to capture snow of different sizes.	Lacks explicit snow modeling; struggles with complex, semi-transparent snow.
SMGARN [12]	Two-stage CNN	Snowy Image/Clean Image + Snow Mask	Predicts a snow mask first, then uses it to guide restoration.	Unidirectional flow: Errors in the initial mask are baked into the final output.
Uformer [13]	U-shaped Transformer	Degraded Image/Clean Image	Non-overlapping window-based self-attention for local–global modeling.	A general-purpose restorer; does not explicitly model the snow–background physical interaction.
WGWS [14]	CNN with gating	Degraded Image/Clean Image	A gating mechanism to fuse features from snowy and pre-denoised images.	Implicitly handles snow; lacks a dedicated, interactive module for occlusion reasoning.
BAT-Net (ours)	Dual-decoder Transformer	Snowy Image/Clean Image + Snow Mask	Bidirectional Attention Module (BAM) for closed-loop, mutual correction between decoders.	Addresses the issues of fragile unidirectional pipelines by enabling real-time cross-decoder verification and refinement.

Table 2. Summary of the datasets used for evaluation.

Dataset	Type	Training Images	Test Images	Characteristics
CSD [9]	Synthetic	8000	2000	Large-scale synthetic dataset with diverse snow degradation.
SRRS [11]	Synthetic	10,000	2000	Features snow of different sizes and transparency levels.
Snow100K [10]	Synthetic	20,000	2000	Contains three levels of snowfall severity (light, medium, heavy).
Snow100K-Real [10]	Real	-	1329	Real snow scenes, often containing both falling and accumulated snow.
FallingSnow (ours)	Real	-	1394	Curated real-world dataset featuring only falling snow, eliminating the confounding factor of ground snow accumulation.

Table 3. Details of the comparison methods.

Category	Methods	Source
Desnowing Methods	DesnowNet [10]	TIP’2018
	JSTASR [11]	ECCV’2020
	HDCWNet [9]	ICCV’2021
	SMGARN [12]	CVIU’2023
Multiple Degradations Removal	MPRNet [25]	CVPR’2021
	Chen et al. [26]	CVPR’2022
	Uformer [13]	CVPR’2022
	WGWS [14]	CVPR’2023
	Patil et al. [15]	ICCV’2023

Table 4. Quantitative comparison of desnowing performance on three benchmark datasets. PSNR↑/SSIM↑ results are reported as mean ± standard deviation over multiple runs. The best results are in bold, and the second-best are underlined.

Datasets	CSD [9]	SRRS [11]	Snow100k [10]
Snowy	14.26/0.692	16.51/0.787	22.51/0.759
DesnowNet [10]	20.13/0.815	20.38/0.844	30.50/0.941
JSTASR [11]	27.96/0.883	25.82/0.896	23.12/0.866
HDCWNet [9]	29.06/0.914	27.78/0.928	31.54/0.951
SMGARN [12]	31.93/0.952 ± 0.10/0.002	29.14/0.947 ± 0.11/0.002	31.92/0.933 ± 0.07/0.003
MPRNet [25]	33.98/0.971	30.37/0.960	33.87/0.952
Chen et al. [26]	33.62/0.961	30.17/0.953	32.67/0.939
Uformer [13]	33.80/0.964 ± 0.08/0.002	30.72/0.968 ± 0.11/0.001	33.81/0.947 ± 0.07/0.001
WGWS [14]	33.96/0.968 ± 0.09/0.002	30.55/0.965 ± 0.05/0.002	34.21/0.953 ± 0.07/0.002
Patil et al. [15]	34.41/0.974 ± 0.07/0.003	31.21/0.967 ± 0.11/0.001	34.11/0.950 ± 0.09/0.002
BAT-Net	35.78/0.976 ± 0.06/0.001	32.13/0.971 ± 0.07/0.002	34.62/0.957 ± 0.10/0.002

Table 5. Quantitative comparison of snow mask prediction performance on the Snow100K test set. Results are presented as mean ± standard deviation, with the best results highlighted in bold. The symbol ↑ indicate higher is better.

Methods	PSNR↑/SSIM↑
JORDER [33]	19.95/0.392 ± 0.14/0.004
DesnowNet [10]	22.01/0.566 ± 0.10/0.002
JSTASR [11]	23.67/0.621 ± 0.11/0.003
SMGARN [12]	24.95/0.667 ± 0.09/0.002
BAT-Net	26.67/0.732 ± 0.08/0.002

Table 6. Ablation study on the core components of BAT-Net. We report the PSNR(dB)/SSIM achieved on the CSD test set. ‘-’ indicates the removal of a component from the full model. The largest performance drop from ablating each component is highlighted in bold.

Model Variant	SCM	FAM	Snow Decoder	PSNR/SSIM
BAT-Net (Full)	✔	✔	✔	35.78/0.976
-SCM	-	✔	✔	34.84/0.965
-FAM	✔	-	✔	35.05/0.967
-Snow Decoder	✔	✔	-	33.31/0.961

Table 7. Ablation study of the Bidirectional Attention Module (BAM) configurations evaluated on the CSD dataset [9]. Performance is measured by PSNR (dB) and SSIM, with the best results in bold.

Configuration	Forward Attention	Reverse Attention	PSNR↑/SSIM↑
(a) Baseline (w/o BAM)			34.21/0.966
(b) Reverse only		✔	35.12/0.971
(c) Forward only	✔		35.34/0.973
(d) Full BAM	✔	✔	35.78/0.976

Table 8. Ablation study on hyperparameter configurations evaluated on the CSD dataset [9]. The configurations vary the number of Transformer layers and attention heads across four encoder stages. Performance is measured by PSNR (dB) and SSIM, with the best results in bold.

Config.	Layer Numbers (4 Stages)	Attention Heads (4 Stages)	PSNR↑/SSIM↑
(a)	4,4,4,4	2,2,4,4	34.66/0.967
(b)	4,4,4,4	1,2,4,8	34.92/0.971
(c)	2,4,4,6	2,2,4,4	35.07/0.970
(d)	2,4,4,6	1,2,4,8	35.78/0.976

Table 9. Comparison of model complexity and runtime efficiency. FLOPs are measured in giga (G), parameters in millions (M), and inference time in seconds (s) per image. FPS is frames per second. The best results are highlighted in bold.

Model	FLOPs (G)	Params (M)	Inference Time (s)	FPS
Uformer [13]	347.6	50.9	0.1737	5.76
WGWS [14]	996.2	12.6	0.1919	5.21
Patil et al. [15]	1262.9	11.1	0.1098	9.11
BAT-Net (ours)	307.4	43.2	0.1073	9.32

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y. BAT-Net: Bidirectional Attention Transformer Network for Joint Single-Image Desnowing and Snow Mask Prediction. Information 2025, 16, 966. https://doi.org/10.3390/info16110966

AMA Style

Zhang Y. BAT-Net: Bidirectional Attention Transformer Network for Joint Single-Image Desnowing and Snow Mask Prediction. Information. 2025; 16(11):966. https://doi.org/10.3390/info16110966

Chicago/Turabian Style

Zhang, Yongheng. 2025. "BAT-Net: Bidirectional Attention Transformer Network for Joint Single-Image Desnowing and Snow Mask Prediction" Information 16, no. 11: 966. https://doi.org/10.3390/info16110966

APA Style

Zhang, Y. (2025). BAT-Net: Bidirectional Attention Transformer Network for Joint Single-Image Desnowing and Snow Mask Prediction. Information, 16(11), 966. https://doi.org/10.3390/info16110966

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BAT-Net: Bidirectional Attention Transformer Network for Joint Single-Image Desnowing and Snow Mask Prediction †

Abstract

1. Introduction

2. Related Works

2.1. Single-Image Snow Removal

2.2. General Image Restoration Methods

2.3. Vision Transformer in Image Restoration

3. Proposed Method

3.1. Overall Pipeline

3.2. Scale Conversion Module and Feature Aggregation Module

3.3. Bidirectional Attention Module

3.4. Computational Complexity Analysis

4. Experiments and Analysis

4.1. Datasets and Implementation Details

4.2. Comparison with State-of-the-Art Methods

4.2.1. Synthetic Image Desnowing

4.2.2. Real-Image Desnowing

4.3. Qualitative Summary

4.4. Snow Mask Prediction

4.5. Object Detection

4.6. Ablation Studies

4.6.1. Contributions of Core Components

4.6.2. Study of the Bidirectional Attention Module

4.7. Cross-Weather Validation

4.8. Failure Cases Analysis

4.9. Study of Hyperparameters and Model Complexity

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

BAT-Net: Bidirectional Attention Transformer Network for Joint Single-Image Desnowing and Snow Mask Prediction^†