SnowMamba: Achieving More Precise Snow Removal with Mamba

Wang, Guoqiang; Zhou, Yanyun; Shi, Fei; Jia, Zhenhong

doi:10.3390/app15105404

Open AccessArticle

SnowMamba: Achieving More Precise Snow Removal with Mamba

¹

School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China

²

Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi 830046, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(10), 5404; https://doi.org/10.3390/app15105404

Submission received: 19 February 2025 / Revised: 28 April 2025 / Accepted: 8 May 2025 / Published: 12 May 2025

Download

Browse Figures

Versions Notes

Abstract

Due to the diversity and semi-transparency of snowflakes, accurately locating and reconstructing background information during image restoration poses a significant challenge. Snowflakes obscure image details, thereby affecting downstream tasks such as object recognition and image segmentation. Although Convolutional Neural Networks (CNNs) and Transformers have achieved promising results in snow removal through local or global feature processing, residual snowflakes or shadows persist in restored images. Inspired by the recent popularity of State Space Models (SSMs), this paper proposes a Mamba-based multi-scale desnowing network (SnowMamba), which effectively models the long-range dependencies of snowflakes. This enables the precise localization and removal of snow particles, addressing the issue of residual snowflakes and shadows in images. Specifically, we design a four-stage encoder–decoder network that incorporates Snow Caption Mamba (SCM) and SE modules to extract comprehensive snowflake and background information. The extracted multi-scale snow and background features are then fed into the proposed Multi-Scale Residual Interaction Network (MRNet) to learn and reconstruct clear, snow-free background images. Extensive experiments demonstrate that the proposed method outperforms other mainstream desnowing approaches in both qualitative and quantitative evaluations on three standard image desnowing datasets.

Keywords:

snow removal; Mamba; image restoration; feature extraction

1. Introduction

Images are an important way to record information, but their quality is often affected by weather conditions, especially snowy weather. The snowflakes scattered across the image can obscure other information, not only impacting the visual quality of the image but also severely interfering with downstream computer vision tasks such as object detection and recognition [1], image segmentation [2], and video surveillance [3]. Therefore, research on image desnowing is of great significance. Figure 1 shows the desnowing results of our method compared to other advanced methods on real images. It can be seen that our method achieves better desnowing performance.

Early on, researchers approached image desnowing by leveraging the physical characteristics of snowflakes, using image statistical features, frequency domain techniques, and filtering methods [4,5,6]. With the development of deep learning, image desnowing methods based on Convolutional Neural Networks (CNNs) [7,8,9,10] have shown significant performance advantages and have gradually become mainstream. To achieve better desnowing results, advanced visual techniques such as multi-scale modules, residual modules, and dense connections have been applied to more efficient desnowing networks [11,12,13]. Although these methods can remove most snowflakes, their ability to recognize snowflakes is limited due to the constraints of long-range feature extraction in CNNs, resulting in residual snowflakes or black shadows in the restored images [9,10,12] and, consequently, a decline in image quality.

The application of Vision Transformers (ViT) [14] addresses the inherent locality of CNNs by establishing global feature relationships, enabling the effective handling of image-related tasks, and has been widely adopted [15,16]. Transformers, leveraging the attention mechanism, combine contextual features for the more precise recognition and removal of snowflakes. Snow removal networks based on Transformers [17,18,19,20] demonstrate strong performance advantages. However, existing methods [17,18] show limited stability across different datasets and exhibit issues such as residual snowflakes and artifacts.

Recently, the introduction of Mamba [21,22,23] has provided new insights for long-range modeling, showing superior performance over CNNs and Transformers in image segmentation [22] and image restoration [24]. Due to the diversity and complexity of snowflakes, relying solely on local features or long-range modeling is insufficient for accurately identifying and removing snowflakes, reconstructing background information, and producing higher-quality images. Designing more efficient networks that combine local features with long-range modeling for image desnowing remains an open challenge.

Therefore, we propose a multi-scale image desnowing network based on Mamba, named SnowMamba, which better captures and integrates long-distance feature relationships among snowflakes, avoiding image quality issues still present in desnowing, as highlighted by [10,12,18]. Firstly, we design a four-layer encoder–decoder structure to identify and differentiate between snowflake particles and background information. During the information input and decoding stages, an SE module is introduced to enhance the network’s learning of snowflake features. Then, a Multi-Scale Residual Interaction Network (MRNet) is used to learn and reconstruct a clear background image without snow. For the encoder–decoder, a Snow Caption Mamba (SCM) module is designed for the more precise extraction of snowflake and background features. This module combines CNN and Mamba to fully utilize the long-distance feature relationships and local details of snowflakes in the image, improving snow removal and background restoration.

The contributions of this paper are summarized as follows:

This paper presents a novel multi-scale residual snow removal architecture based on Mamba, marking the first attempt to apply Mamba in the field of snow removal;
This paper designs the SCM module to combine local and contextual image features, to accurately identify snowflakes and background information, and to assist the network in removing snowflakes and restoring clear images;
Extensive experiments show that the method proposed in this paper outperforms existing approaches on three major synthetic datasets and real-world datasets, achieving higher-quality snow removal in images.

2. Methods

This section first introduces the design concept of the SnowMamba framework, followed by a description of the designed SCM module.

2.1. Design of the SnowMamba Framework

The architecture of the designed SnowMamba snow removal network is shown in Figure 2. First, the network applies a 3 × 3 convolution module to the input snowflake image for preliminary processing, extracting detailed information and edge features to lay the foundation for subsequent image processing. Based on this, the snowflake image information processed by the convolution operation is fed into the SE module to further enhance the network’s ability to perceive snowflake features. The SE module adaptively adjusts the weights of each channel to strengthen critical snowflake features, thereby improving the overall feature representation capability. To retain more information and enhance the generalization ability of the network, this work adopts GeLU as the activation function in the SE module. Compared with the traditional ReLU activation function, GeLU demonstrates superior performance in nonlinear mapping and output smoothness. Its smooth transition characteristics enable the network to adjust channel weights with greater precision. GeLU effectively captures variations in features across different channels, enhancing the network’s sensitivity to subtle differences. This is particularly beneficial for processing complex snowflake images, as it helps to retain more useful feature information.

The snowflake and background feature information enhanced by the SE module is fed into the encoder–decoder structure composed of SCM modules for multi-scale feature extraction. The design of this structure aims to process snowflake features at different scales, enabling the network to simultaneously focus on the local details and global background information of the image. In the encoder stage, the network extracts multi-scale snowflake and background features through downsampling, progressively deepening its understanding of the image during the process. In the decoder stage, the network gradually restores the spatial resolution of the image through upsampling to further extract detailed features and background information from the snowflake image. To enable the network to capture complex snowflake features across a global range, the SE module is reintroduced in the decoder to enhance features during the transmission of information across different scales. This process ensures that the network dynamically adjusts channel weights when transferring features between scales, allowing it to learn more precise and effective snowflake mask image features.

The multi-scale features obtained from the encoder–decoder are carefully fed into the MRNet for multi-scale feature fusion, enabling the network to learn the background features obscured by snowflakes and assist in reconstructing clear, snow-free background images. As shown in Figure 2, the MRNet architecture is composed of multi-scale feature residuals, ResBlocks, and multiple upsampling stages. First, the snowflake feature information and background feature information of different scales extracted by the encoder–decoder are fused through residual operations to obtain snow-free feature information at various scales. These features are then reconstructed through ResBlocks and multiple upsampling stages, allowing the network to learn detailed local background features. Since the process of restoring images using downsampling and upsampling modules inherently results in information loss, this work incorporates a residual connection between the initial features and the final extracted snowflake image features. This strategy captures more direct background information, mitigating the loss caused by layer-by-layer sampling. Finally, the MRNet performs feature fusion and reconstruction in its residual interaction stage, producing the final clear, snow-free image.

2.2. SEBlock

To enable the network to better differentiate snowflake particles and enhance the learning of global snowflake features, this paper introduces the SEBlock [25] to help the network to more accurately identify snowflakes and background information. In this work, SEBlock is employed to strengthen the model’s channel modeling and improve its perception of snowflake features, allowing the network to more precisely distinguish and identify the relationships between snowflakes and the background in images. As shown in the network architecture diagram (Figure 2), SEBlock adaptively weights the features of different channels. This enables the network to focus more precisely on features related to snowflake details in diverse image desnowing scenarios, reducing the influence of background and other irrelevant information on snowflake recognition, thereby improving the overall desnowing performance.

Specifically, during the squeeze phase, SEBlock aggregates global information from the input feature map by fusing each channel’s global features through global average pooling, achieving global perception for each channel. In the excitation phase, to enable the network to learn more smooth and detailed information and improve its capacity for capturing complex features, this paper sets the activation function in SEBlock to GeLU. The SEBlock generates a weight coefficient for each channel via a fully connected layer followed by the GeLU activation function. These weight coefficients reflect the importance of each channel. The network then adjusts the features of different channels based on these weights, enabling it to more accurately capture the intricate details of snowflake features and further enhance these relevant details during the desnowing process. Therefore, the SE module provides the network with more refined feature representations, ultimately making the image desnowing process more precise and efficient. The following describes the implementation process of squeeze and excitation in the SE module.

Squeeze: The spatial dimensions (height and width) of the input snowflake feature map are transformed into channel descriptors through Global-Average Pooling. Suppose that the input feature map is a tensor with the shape

H W C

, where H and W are the height and width of the feature map, and C is the number of channels. After Global-Average Pooling, the average value for each channel is obtained:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{i j c}

(1)

Here,

x_{i j c}

represents the activation value of the c-th channel at position

(i, j)

, and

z_{c}

represents the squeezed result of the c-th channel.

Excitation: The compressed channel descriptors are excited through a fully connected layer to generate inter-channel dependencies. This process generates the weights for each channel using a two-layer fully connected network. First, the feature descriptors pass through a fully connected layer:

s_{c} = σ (W_{2} δ (W_{1} z))

(2)

where

W_{1}

and

W_{2}

are the weight matrices of the fully connected layers,

δ

is the ReLU activation function, and

σ

is the Sigmoid activation function, used to map the output to the range [0, 1]. The final generated

s_{c}

is the excitation coefficient for each channel.

Re-weighting Operation: Each channel of the original feature map is weighted according to the computed excitation coefficients. This can be described as

{\hat{x}}_{i j c} = x_{i j c} \cdot s_{c}

(3)

where

x_{i j c}

is the weighted feature map, and

s_{c}

is the excitation coefficient for that channel.

In this paper, the image dataset is relatively large, and the network depth is broad. To make the SE module more suitable for the image desnowing task, the ReLU activation function is replaced with the smoother and better-performing GELU activation function. This helps the network to avoid the dead neuron problem, facilitates smoother gradient propagation, and enables more thorough model training, thus improving the stability of the training process.

2.3. Snow Caption Mamba (SCM)

Figure 3 shows the SCM module designed in this paper, which consists of two branches: Mamba for global feature extraction and CNN for local detail extraction. First, the snowflake image features

I_{s}

enhanced by the SE module are input into the SCM. Then, global and local snowflake and background feature extraction are performed through the Mamba and CNN branches, respectively. The two branches can be expressed as

I_{M} = Conv (L_{y} [Ma (I_{s}, θ)])

(4)

I_{c} = {Dconv}_{5} (I_{s}) * {Dconv}_{3} (I_{s})

(5)

Here,

I_{M}

and

I_{c}

are the features extracted by Mamba and CNN, respectively.

M a

represents the Mamba module,

θ

denotes the parameters,

L_{y}

represents layer normalization, and

D c o n v_{5}

and

D c o n v 3

represent the 5 × 5 and 3 × 3 separable convolutions, respectively, while ∗ denotes channel concatenation.

The features obtained from

I_{c}

are fed into Mamba for global feature mapping, helping the network to more accurately identify snowflake particles and background features. This part can be expressed as

I_{CM} = {Conv}_{3} (Ma (R_{c} (I_{c}), θ) + I_{c})

(6)

I_{C M}

represents a more refined snowflake and background feature analysis by combining local and global features, while Rc and

C o n v_{3}

refer to convolution activation and a 3 × 3 convolution operation, respectively.

The snowflake and background features with rich and precise information obtained from the two branches are activated by the Sigmoid function, and then multiplied by the local detail feature

I_{c}

to enhance the module’s focus and differentiation of different feature information. This process can be expressed as:

I_{f} = Sig (I_{M} + I_{CM}) \otimes I_{c}

(7)

where

I_{f}

represents the feature information that highlights the snowflake, Sig denotes the Sigmoid activation function, and ⊗ represents the multiplication operation.

Finally, skip connections are added to the different branches to reduce information loss during network propagation, removing the snowflake while preserving more details and textures. For model parameters, separable convolutions of different scales are used for feature extraction.

2.4. LOSS

To better preserve the original detail information of the image and apply it to snowflake images with sparse features, this paper uses

L_{1}

Loss to guide the network in snow removal. The formula is as follows:

L_{i} = \frac{1}{N} \sum_{i = 1}^{N} |{\hat{y}}_{i} - y_{i}|

(8)

where

L_{i}

represents the loss between the predicted image and the ground-truth image, N denotes the total number of pixels, and

{\hat{y}}_{i}

and

y_{i}

represent the predicted value and the ground-truth value of the i-th pixel, respectively.

3. Experiments

3.1. Experimental Configuration and Evaluation

Datasets. This study validates the effectiveness of the proposed algorithm on three benchmark datasets: CSD [10], Snow100K [7], and SRRS [9]. For Snow100K and CSD, 100 images were randomly selected from their training sets as the validation set, with the remaining images used for training. For testing, 2000 images were randomly stratified from the Snow100K test set as the test set for this experiment, while the test set provided by the CSD dataset was used directly. For SRRS, the first 10,000 images were used as the training set, 100 images in the middle were selected as the validation set, and the last 2000 images were used as the test set.

Implementation Details. The snow removal network designed in this paper adopts a four-level encoder–decoder progressive learning structure, which extracts snow image features at different scales through downsampling and upsampling. Different SCM blocks are set for feature learning according to the scale of the image at each level. The number of SCM blocks from the first to the fourth level is [2, 2, 9, 2], and the number of channels from the first to the fourth layer of the encoder–decoder is [48, 96, 192, 384]. The initial input image patch size is set to 128 × 128. As the input image patches increase in size, more image information is fed into the network, and the number of iterations for progressive learning is reduced. The correspondence between the two is shown in Table 1. During training, the model is implemented using the Pytorch framework on a single Nvidia RTX 3090, with the AdamW optimizer for updates and L1 Loss guiding the model in learning the snow removal process. The initial learning rate is set to

10^{- 4}

, which decreases to 0.7 times its original value as the input patches change, and is reduced to

10^{- 7}

through cosine annealing.

Evaluation Metrics. This work conducts a detailed qualitative analysis and comparison of the proposed desnowing algorithm on three public synthetic datasets and the real-world dataset provided by Snow100K, showcasing the desnowing effects of different methods through intuitive visual demonstrations. To further validate the effectiveness of the proposed algorithm, commonly used image quality evaluation metrics, including Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM) [26], Learned Perceptual Image Patch Similarity (LPIPS) [27], and Deep Image Structure and Texture Similarity (DISTS) [28], are employed for the quantitative assessment of the experimental results. These evaluations demonstrate the superiority and robustness of the proposed algorithm.

3.2. Qualitative Evaluation

Figure 4 and Figure 5 show the qualitative comparison results of our method with other state-of-the-art methods on real and synthetic datasets, including JSTASR [9] (ECCV’2020), HDCWNet [10] (ICCV’2020), TKL [17] (CVPR’2022), and LMQFormer [18] (TCSVT’2023). As shown in Figure 4, the images restored by the JSTASR method exhibit significant black artifacts and snowflake noise, resulting in poor visual quality. While HDCWNet and TKL are capable of removing a substantial portion of snowflake artifacts, residual noise remains visible. LMQFormer effectively eliminates snowflake interference and produces clean snow-free images, yet minor artifacts persist. In contrast, the algorithm proposed in this paper achieves more precise snow removal while preserving richer background details. The restored images are free from black artifacts and exhibit higher overall quality compared to existing methods.

Figure 5 demonstrates the comparative results between the proposed method and other state-of-the-art approaches on a real-world dataset. It is evident that some existing methods still leave noticeable snowflake particles unremoved. In contrast, our method demonstrates superior precision in identifying and eliminating snowflakes, exhibiting a clear advantage in desnowing performance.

3.3. Quantitative Evaluation

To demonstrate the effectiveness of the proposed algorithm, a quantitative comparison was conducted with other state-of-the-art algorithms on three major benchmark datasets, including single-image snow removal algorithms [7,9,10,12,18] and severe-weather image restoration algorithms [17,29,30,31]. During the comparison, if the compared methods provided reproducible code, we reproduced the methods and obtained the results. If no code was provided, we referred to the results presented in the respective papers or other widely recognized experimental results for comparison. The comparison results are shown in Table 2. Compared with other state-of-the-art methods, the proposed method achieves the best results across three benchmark datasets. In particular, on the SNOW100K and CSD datasets, the PSNR and SSIM metrics exhibit outstanding performance. Compared to the second-best method, LMQFormer, the PSNR of the proposed algorithm improves by 1.12 dB and 2.03 dB on the SNOW100K and CSD datasets, respectively. The SSIM improves by 0.028 and 0.011, respectively. This fully demonstrates the superiority and stability of the proposed algorithm.

To validate the generalization capability of the proposed algorithm, this paper further verifies its performance by randomly selecting 2000 images from the L, M, and S datasets of Snow100K. As shown in Table 3 (where ↑ indicates higher metric values correspond to better image quality closer to real images and ↓ denotes lower values indicate superior restoration quality closer to practical images), the proposed algorithm achieves state-of-the-art results across various image evaluation metrics in datasets with different snowfall conditions. Compared to the second-best method, all metrics, including PSNR, SSIM, LPIPS, and DISTS, show improvements. Notably, the enhancements in PSNR and SSIM are particularly significant: PSNR demonstrates improvements exceeding 1 dB across all datasets, while SSIM shows gains of 0.022, 0.024, and 0.038, respectively, in datasets with snow particles ordered from small to large sizes, compared to the second-most advanced algorithm. These results comprehensively demonstrate the stability and superiority of the proposed algorithm.

3.4. Ablation Study

An ablation study was conducted on the CSD dataset to verify the effectiveness of the proposed algorithm. First, each layer was designed with two SCM Blocks as the basic encoder–decoder structure, referred to as SCM Blockbase, and experiments were performed using the snow removal framework designed in this study, followed by quantitative evaluation. Then, the number of SCM Blocks was adjusted to [2, 2, 9, 2] as per the configuration in the Mamba experiment, while keeping other settings of the snow removal framework unchanged, to conduct further ablation studies. Finally, SEBlock were introduced at corresponding positions, and the activation function was modified to GeLU for the image snow removal task. The quantitative evaluation and visual comparisons of the ablation study are presented in Table 4 and Figure 6, respectively.

When using the SCM Blockbase as the encoder–decoder framework, the PSNR and SSIM metrics already surpass some state-of-the-art methods. The reduction of residual snow particles in the images demonstrates that this module is well-suited to the snow removal framework designed in this study and exhibits more precise discrimination and removal capabilities for snowflakes. Furthermore, adding more SCM Blocks significantly improves both PSNR and SSIM, indicating the module’s strong effectiveness in identifying and removing snowflakes from images. When SEBlocks are incorporated, the PSNR and SSIM metrics further increase, and snow textures in the images are further reduced, resulting in enhanced image quality. This also confirms that the SEBlock positively contributes to the proposed algorithm.

4. Conclusions

This paper proposes a novel Mamba-based network architecture for image desnowing, called SnowMamba. This method represents the first application of Mamba in image desnowing. The design combines CNN’s local detail processing with Mamba’s long-range feature capturing capabilities to achieve higher-quality images. Specifically, the paper designs the SCM, a CNN–Mamba desnowing module, as the core component, using a four-level encoder–decoder structure combined with a multi-scale residual network for image desnowing. SE modules are added during upsampling and the initial input stages to enhance the model’s ability to distinguish and focus on different image features. Extensive experiments demonstrate that the proposed method achieves better desnowing results compared to existing approaches.

Author Contributions

Conceptualization, G.W. and F.S.; methodology, G.W.; software, G.W.; validation, G.W. and Y.Z.; formal analysis, G.W., F.S. and Y.Z.; investigation, G.W. and F.S.; resources, F.S.; data curation, G.W.; writing—original draft preparation, G.W.; writing—review and editing, G.W., F.S., Y.Z. and Z.J.; visualization, G.W.; supervision, F.S.; project administration, Z.J.; funding acquisition, F.S. and Z.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of XinJiang province under Grant 2022D01C60, National Natural Science Foundation of China (No. 62261053), Tianshan Talent Training Project—Xinjiang Science and Technology Innovation Team Program (2023TSYCTD0012), Tianshan Innovation Team Program of Xinjiang Uygur Autonomous Region of China (2023D14012), and Scientific Research Plan of Universities of Xinjiang under Grant XJEDU(2019Y006).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are all publicly available. The specific dataset information is as follows: the access link for the Snow 100K dataset is https://sites.google.com/view/yunfuliu/desnownet, the access link for the CSD dataset is https://github.com/weitingchen83/ICCV2021-Single-Image-Desnowing-HDCWNet?tab=readme-ov-file, and the access link for the SRRS dataset is https://github.com/weitingchen83/JSTASR-DesnowNet-ECCV-2020?tab=readme-ov-file.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Liu, Q.; Xu, Z.; Bertasius, G.; Niethammer, M. Simpleclick: Interactive image segmentation with simple vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 22290–22300. [Google Scholar]
Rezaee, K.; Rezakhani, S.M.; Khosravi, M.R.; Moghimi, M.K. A survey on deep learning-based real-time crowd anomaly detection for secure distributed video surveillance. Pers. Ubiquitous Comput. 2024, 28, 135–151. [Google Scholar] [CrossRef]
Zheng, X.; Liao, Y.; Guo, W.; Fu, X.; Ding, X. Single-image-based rain and snow removal using multi-guided filter. In Proceedings of the Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Republic of Korea, 3–7 November 2013; Proceedings, Part III 20. Springer: Berlin/Heidelberg, Germany, 2013; pp. 258–265. [Google Scholar]
Ding, X.; Chen, L.; Zheng, X.; Huang, Y.; Zeng, D. Single image rain and snow removal via guided L0 smoothing filter. Multimed. Tools Appl. 2016, 75, 2697–2712. [Google Scholar] [CrossRef]
Wang, Y.; Liu, S.; Chen, C.; Zeng, B. A hierarchical approach for rain or snow removing in a single color image. IEEE Trans. Image Process. 2017, 26, 3936–3950. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.F.; Jaw, D.W.; Huang, S.C.; Hwang, J.N. DesnowNet: Context-aware deep network for snow removal. IEEE Trans. Image Process. 2018, 27, 3064–3073. [Google Scholar] [CrossRef] [PubMed]
Li, P.; Yun, M.; Tian, J.; Tang, Y.; Wang, G.; Wu, C. Stacked dense networks for single-image snow removal. Neurocomputing 2019, 367, 152–163. [Google Scholar] [CrossRef]
Chen, W.T.; Fang, H.Y.; Ding, J.J.; Tsai, C.C.; Kuo, S.Y. JSTASR: Joint size and transparency-aware snow removal algorithm based on modified partial convolution and veiling effect removal. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXI 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 754–770. [Google Scholar]
Chen, W.T.; Fang, H.Y.; Hsieh, C.L.; Tsai, C.C.; Chen, I.; Ding, J.J.; Kuo, S.Y. All snow removed: Single image desnowing algorithm using hierarchical dual-tree complex wavelet representation and contradict channel loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4196–4205. [Google Scholar]
Chen, Z.; Sun, Y.; Bi, X.; Yue, J. Lightweight image de-snowing: A better trade-off between network capacity and performance. Neural Netw. 2023, 165, 896–908. [Google Scholar] [CrossRef] [PubMed]
Cheng, B.; Li, J.; Chen, Y.; Zeng, T. Snow mask guided adaptive residual network for image snow removal. Comput. Vis. Image Underst. 2023, 236, 103819. [Google Scholar] [CrossRef]
Quan, Y.; Tan, X.; Huang, Y.; Xu, Y.; Ji, H. Image desnowing via deep invertible separation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 3133–3144. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
Xiao, J.; Fu, X.; Liu, A.; Wu, F.; Zha, Z.J. Image de-raining transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12978–12995. [Google Scholar] [CrossRef] [PubMed]
Chen, W.T.; Huang, Z.K.; Tsai, C.C.; Yang, H.H.; Ding, J.J.; Kuo, S.Y. Learning multiple adverse weather removal via two-stage knowledge learning and multi-contrastive regularization: Toward a unified model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17653–17662. [Google Scholar]
Lin, J.; Jiang, N.; Zhang, Z.; Chen, W.; Zhao, T. LMQFormer: A laplace-prior-guided mask query transformer for lightweight snow removal. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6225–6235. [Google Scholar] [CrossRef]
Zhang, T.; Jiang, N.; Wu, H.; Zhang, K.; Niu, Y.; Zhao, T. HCSD-Net: Single Image Desnowing with Color Space Transformation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 8125–8133. [Google Scholar]
Chen, S.; Ye, T.; Liu, Y.; Liao, T.; Jiang, J.; Chen, E.; Chen, P. Msp-former: Multi-scale projection transformer for single image desnowing. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
Wang, Z.; Zheng, J.Q.; Zhang, Y.; Cui, G.; Li, L. Mamba-unet: Unet-like pure visual mamba for medical image segmentation. arXiv 2024, arXiv:2402.05079. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Ding, K.; Ma, K.; Wang, S.; Simoncelli, E.P. Image quality assessment: Unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2567–2581. [Google Scholar] [CrossRef] [PubMed]
Li, R.; Tan, R.T.; Cheong, L.F. All in one bad weather removal using architectural search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3175–3185. [Google Scholar]
Valanarasu, J.M.J.; Yasarla, R.; Patel, V.M. Transweather: Transformer-based restoration of images degraded by adverse weather conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2353–2363. [Google Scholar]
Chen, E.; Chen, S.; Ye, T.; Liu, Y. Degradation-adaptive neural network for jointly single image dehazing and desnowing. Front. Comput. Sci. 2024, 18, 182707. [Google Scholar] [CrossRef]

Figure 1. The visual comparison results with state-of-the-art methods on real-world datasets.

Figure 2. The SnowMamba framework includes modules such as SCMblock, SEBlock, and ResBlock. The two subfigures, respectively, show the detailed structure of SEBlock and ResBlock; ⊖ denotes the residual operation. The MRNet network consists of four ResBlocks, two upsampling layers, and three layers of brown residual information.

Figure 3. Structure of the Snow Caption Mamba (SCM) module. “×” denotes the multiplication operation, “+” denotes the addition operation, and “C” denotes channel-wise addition.

Figure 4. The qualitative comparison results on the three major benchmark synthetic datasets.

Figure 5. The qualitative comparison results on the real dataset.

Figure 6. The visual comparison results with state-of-the-art methods on real-world datasets.

Table 1. The parameter setting table for iterations and patches; “K” represents thousand.

Patches	128 × 128	192 × 192	224 × 224	256 × 256	384 × 384	464 × 464
Iterations (K)	128	128	92	92	64	64

Table 2. CSD, Snow100K, and SRRS datasets were quantitatively evaluated, with bolded black indicating the best results.

Type	Method/Dataset	Snow100K(2000)		SRRS(2000)		CSD(2000)
Type	Method/Dataset	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
Desnowing Task	DesnowNet(TIP’2018)	30.80	0.92	20.38	0.84	20.13	0.81
	JSTASR(ECCV’2020)	23.12	0.86	25.82	0.89	28.42	0.69
	HDCWNet(ICCV’2021)	21.23	0.73	27.78	0.92	31.80	0.90
	SMGARN(CVIU’2023)	31.92	0.93	29.14	0.94	31.93	0.95
	LMQFormer(TCSVT’2023)	35.14	0.93	31.05	0.95	34.53	0.96
Adverse Weather	All in One(CVPR’2020)	26.07	0.88	24.98	0.88	26.31	0.87
	TransWeather(CVPR’2022)	31.82	0.93	28.29	0.92	31.76	0.93
	DAN-Net(Front.Comput’2024)	32.48	0.96	29.34	0.95	30.82	0.95
	TKL(CVPR’2022)	31.27	0.90	29.37	0.93	33.01	0.93
	OUR(SnowMamba)	36.26	0.958	32.99	0.968	36.56	0.977

Table 3. Quantitative comparisons were made with other state-of-the-art methods on the Snow 100KL, Snow 100KM, and Snow 100KS datasets, with bold indicating the best results.

Method/Dataset	Snow 100KS		Snow 100KM		Snow 100KL
	PSNR↑	SSIM↑	PSNR↑	SSIM↑	PSNR↑	SSIM↑
	LPIPS↓	DISTS↓	LPIPS↓	DISTS↓	LPIPS↓	DISTS↓
JSTASR	28.94	0.61	28.92	0.59	28.07	0.51
JSTASR	0.3650	0.1787	0.3854	0.1881	0.4415	0.2225
HDWCNet	28.54	0.61	21.66	0.72	28.74	0.53
HDWCNet	0.3278	0.1692	0.3468	0.1779	0.4006	0.2116
DesnowNet	32.23	0.95	30.86	0.94	27.16	0.89
DesnowNet	-	-	-	-	-	-
TKL	34.71	0.92	33.95	0.91	31.63	0.85
TKL	0.1188	0.0566	0.1367	0.0665	0.1964	0.1018
LMQFormer	36.74	0.95	35.74	0.94	32.95	0.90
LMQFormer	0.0751	0.0381	0.0898	0.0460	0.1399	0.0748
SnowMamba (ours)	38.40	0.972	36.89	0.964	33.49	0.938
SnowMamba (ours)	0.0667	0.0377	0.0832	0.0459	0.1376	0.0746

Table 4. Ablation study, where “✗” indicates exclusion from training, while “✓” signifies inclusion in training.

Setting	Model			Metric
Setting	SCM Blockbase	SCM Block	SEBlock	PSNR	SSIM
S1	✓	✘	✘	33.44	0.95
S2	✘	✓	✘	35.61	0.96
S3	✘	✓	✓	36.56	0.97

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, G.; Zhou, Y.; Shi, F.; Jia, Z. SnowMamba: Achieving More Precise Snow Removal with Mamba. Appl. Sci. 2025, 15, 5404. https://doi.org/10.3390/app15105404

AMA Style

Wang G, Zhou Y, Shi F, Jia Z. SnowMamba: Achieving More Precise Snow Removal with Mamba. Applied Sciences. 2025; 15(10):5404. https://doi.org/10.3390/app15105404

Chicago/Turabian Style

Wang, Guoqiang, Yanyun Zhou, Fei Shi, and Zhenhong Jia. 2025. "SnowMamba: Achieving More Precise Snow Removal with Mamba" Applied Sciences 15, no. 10: 5404. https://doi.org/10.3390/app15105404

APA Style

Wang, G., Zhou, Y., Shi, F., & Jia, Z. (2025). SnowMamba: Achieving More Precise Snow Removal with Mamba. Applied Sciences, 15(10), 5404. https://doi.org/10.3390/app15105404

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SnowMamba: Achieving More Precise Snow Removal with Mamba

Abstract

1. Introduction

2. Methods

2.1. Design of the SnowMamba Framework

2.2. SEBlock

2.3. Snow Caption Mamba (SCM)

2.4. LOSS

3. Experiments

3.1. Experimental Configuration and Evaluation

3.2. Qualitative Evaluation

3.3. Quantitative Evaluation

3.4. Ablation Study

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI