WeatherMAR: Complementary Masking of Paired Tokens for Adverse-Weather Image Restoration
Abstract
1. Introduction
- We propose WeatherMAR, a framework that formulates adverse-weather restoration as paired-domain completion in a shared continuous token space. By concatenating degraded and clean tokens into a joint sequence, the model enables direct cross-domain interaction through self-attention within a unified token-processing pipeline, without requiring additional fusion branches.
- We introduce complementary bidirectional masking, which enforces a strict location-wise constraint such that exactly one token in each degraded–clean pair is masked. This design preserves strong conditional evidence at every position, mitigates trivial correlations under paired supervision, and supports an optional reverse objective used only during training to encourage degradation-aware representations.
- We develop a progress-to-step guided sampling strategy to accelerate diffusion-enhanced masked autoregressive inference. This schedule allocates more denoising steps to early, high-uncertainty iterations and fewer steps to later iterations, thereby reducing redundant computation while maintaining restoration quality.
2. Related Work
2.1. Image Restoration in Adverse Weather Conditions
- Removing Raindrops. Single-image raindrop removal has been studied extensively, encompassing both classical pipelines based on hand-crafted priors and modern learning-based approaches. Early studies explored the use of temporal redundancy for video-based raindrop removal [44]. For still images, early learning-based methods investigated supervised CNN-based restoration using paired raindrop-degraded and clean images, although the reconstructed results were often over-smoothed. Subsequent work introduced dedicated datasets and attention-based frameworks to better localize and suppress raindrop regions while recovering background content [10]. Building on this line, later methods further improved localization by incorporating edge-aware cues or explicit raindrop representations, thereby enhancing boundary handling and detail recovery around droplet contours [11].
- Image Desnowing. Early deep-learning approaches to image desnowing typically treated snow as a learnable corruption and trained direct mappings from snowy inputs to clean targets. DesnowNet [8] is a representative CNN-based method that established paired-data learning for snow removal. Later studies showed that architectures originally developed for related restoration tasks can be effectively adapted to desnowing. For example, SPANet and RESCAN [3,47] achieve strong performance on synthetic snow benchmarks. To better account for diverse snow appearances, Chen et al. [48] proposed JSTASR, which explicitly models different snow characteristics within a unified framework. Zhang et al. [9] introduced DDMSNet, a dense multi-scale network that leverages auxiliary cues to improve robustness under heavy snow and has demonstrated strong performance in prior studies.
- Image Deraining & Dehazing. Traditional single-image deraining methods relied on hand-crafted priors and decomposition, whereas modern approaches use deep networks to suppress rain streaks while preserving fine details [1,49]. Recurrent or iterative designs improve robustness by progressively estimating rain layers and refining the clean image over multiple steps, which is particularly helpful when rain streaks vary in scale and density [2]. In real heavy-rain scenarios, rain streaks often co-exist with haze-like veiling, making joint deraining–dehazing more effective than treating the two degradations independently. Representative methods explicitly model the coupled “streak + veil” degradation and recover visibility and contrast together with rain streak suppression [20]. To mitigate the synthetic-to-real gap, several studies have explored transfer and adaptation strategies that better align training data with real rainy images [41]. DerainCycleGAN [50] further investigates rain-attentive cycle-consistent translation for unsupervised single-image deraining, helping to alleviate the synthetic-to-real gap. More recently, transformer-based restoration models have leveraged long-range context to improve structural and textural coherence, and have been adopted or extended for unified adverse-weather restoration [13,26].
- Multi-Weather Restoration. Beyond task-specific restoration, recent studies have explored multi-weather restoration, in which a shared framework is designed to handle multiple weather-related degradations. Valanarasu et al. [13] introduced TransWeather, a transformer-based encoder–decoder that learns a unified restoration mapping across multiple atmospheric degradations. Zhu et al. [14] developed WGWSNet, which separates weather-general and weather-specific representations through a staged training procedure. More recently, multi-weather restoration has been studied from several additional perspectives, including knowledge distillation [51], diffusion-based probabilistic restoration [15], prior- or codebook-based modeling [16], prompt-based conditioning [22], and grid-structured feature interaction [52]. Related all-in-one restoration studies have also explored broader settings beyond the standard multi-weather benchmark protocol, including expert routing and degradation embedding [53], perception-guided coarse-to-fine restoration [54], and continual weather restoration with dynamic expert libraries [55]. Among the methods evaluated under the standard multi-weather benchmark setting, recent models such as CyclicPrompt [22] and GridFormer [52] serve as strong baselines for comparison. Despite these advances, multi-weather restoration still faces the challenge of reliably conditioning restoration on diverse weather-corrupted observations while preserving fine details. To address this challenge, we propose WeatherMAR, which performs paired-domain completion in a shared latent token space through joint-sequence self-attention and complementary masking, and further refines predictions with conditional token diffusion.
2.2. Autoregressive Models with Continuous Tokens
3. Methodology
3.1. Overall Framework
3.2. Paired-Domain Joint Sequence Modeling
3.3. Complementary Bidirectional Masking Strategy
3.3.1. Complementary Mask Construction
3.3.2. Bidirectional Completion Targets
3.3.3. Training and Inference Separation
3.4. Token Diffusion Objective with Conditional Denoising
- Conditional token distribution. Let denote the masked joint sequence in Equation (8), and let denote the contextual representation in Equation (4). Recall that places degraded tokens in the first N positions and clean tokens in the last N positions. For a spatial index , the degraded-domain token corresponds to the joint index , and the clean-domain token corresponds to . For each masked position, the transformer feature serves as a conditioning vector for token generation. We define the masked-token variable as follows:where and are defined in Equation (9). The corresponding conditioning feature is defined as follows:This formulation provides a unified prediction interface across domains while remaining domain-aware through the joint index.
- Forward noising process. For each masked token , we uniformly sample a diffusion step and add Gaussian noise:where denotes a predefined noise schedule and denotes the noisy token at step t. Specifically, is the cumulative noise coefficient induced by the predefined schedule . As t increases, becomes progressively less informative about .
- Conditional denoising objective. A lightweight denoising head takes as input and predicts the added noise. We minimize a noise-prediction objective over a masked index set :where is the cardinality of . Minimizing Equation (15) trains a shared denoiser to model the conditional distribution of the masked tokens. Gradients are backpropagated through to jointly optimize the transformer parameters and the denoising-head parameters .
- Main and auxiliary objectives. Using Equation (15), we define two directional losses by evaluating the same diffusion objective on two disjoint index subsets:These correspond to predicting masked clean tokens conditioned on visible degraded tokens and masked degraded tokens conditioned on visible clean tokens, respectively. Both terms share the same joint sequence and model parameters and differ only in the index subset used for loss evaluation. Substituting Equation (16) into Equation (11) instantiates as a diffusion-based masked-token objective.
- Sampling. During inference, for each masked token, we run the reverse diffusion process conditioned on , starting from Gaussian noise and producing after reverse steps, where denotes the number of reverse diffusion steps allocated to the k-th inference iteration according to the progress-to-step schedule in Section 3.5. The sampled tokens are placed back into the clean-token positions and decoded into the image space. During restoration inference, we keep the degraded tokens from the input y fixed and generate only the missing clean tokens, consistent with the direction.
3.5. Progress-to-Step Guided Sampling for Efficient Inference
4. Experiments
4.1. Datasets and Evaluation Metrics
- Snow100K [8] is a standard benchmark for image desnowing. It contains 50,000 training pairs and 50,000 test pairs. The synthetic test set is divided into three subsets, Snow100K-S/M/L, corresponding to light, medium, and heavy snow, with 16,611, 16,588, and 16,801 images, respectively. Snow100K also includes 1329 real snowy images (Snow100K-Real) without paired ground truth, which we use to assess real-world generalization.
- Outdoor-Rain [20] targets joint deraining and dehazing under heavy-rain conditions. The training set contains 9000 paired images. For evaluation, we follow the standard protocol and report results on the Test1 split, which contains 750 image pairs.
- RainDrop [10] focuses on raindrops adhered to the camera sensor or lens, which introduce localized occlusion-like artifacts. The dataset includes 861 training image pairs. For quantitative evaluation, we adopt the standard RainDrop-A test subset, which contains 58 image pairs and has been used in prior work for consistent comparison.
- Evaluation metrics. We report peak signal-to-noise ratio (PSNR) [74] and structural similarity (SSIM) [75] on the paired test sets. Following common image-restoration practice, we compute PSNR and SSIM on the luminance channel Y in the YCbCr color space for fair comparison, in accordance with prior convention [10,76,77]. To evaluate real-world restoration quality in the absence of ground truth, we additionally employ two no-reference image quality metrics, NIQE [78] and IL-NIQE [79]. Lower NIQE and IL-NIQE values indicate better perceptual image quality.
4.2. Training Details
- Tokenizer and inputs. We use a shared, frozen KL-regularized VAE tokenizer with a downsampling factor of 16 (KL-16) [17] for both degraded and clean images. All experiments and model variants in this work use the same frozen KL-16 tokenizer. Following standard MAR-style continuous latent modeling [18], we treat KL-16 as a fixed image-to-token interface rather than as a research variable or contribution of this paper. Consequently, the performance differences reported in this work primarily reflect the restoration design of WeatherMAR rather than differences in tokenizer design. For inputs, the tokenizer outputs a continuous latent grid in with (), which is then flattened into token sequences. During training, we extract aligned crops from each degraded–clean pair to preserve pixel-wise correspondence.
- Backbone and diffusion head. We adopt a MAR-style masked iterative transformer with a joint-sequence length of and learnable positional embeddings. We use the mar_large [18] configuration, with embedding dimension 1024, depth 16, 16 attention heads, and an MLP ratio of 4, together with attention dropout of 0.1 and projection dropout of 0.1. Masked positions are represented by a learnable mask token. Masked-token generation uses a conditional diffusion head implemented as an AdaLN-conditioned MLP with depth 12 and width 1536, conditioned on transformer features at the corresponding joint indices. We adopt a diffusion head that follows the standard MAR design for continuous-valued visual tokens [18]. This choice is well aligned with WeatherMAR, which performs masked-token prediction in a continuous latent space and therefore benefits from conditional distribution modeling rather than deterministic token regression.
- Optimization. All experiments are implemented in PyTorch 2.8.0+cu128 [80] and trained on an NVIDIA RTX 4090 GPU. We use AdamW [81] with a learning rate of , weight decay of 0.02, and . We train for 400 epochs with a batch size of 16, enable mixed-precision training with bfloat16, apply gradient clipping with a threshold of 1.0, and maintain an exponential moving average (EMA) of the model parameters with a decay of 0.9999 for evaluation. For the main comparisons in Table 1, we follow the standard benchmark protocols for Snow100K, Outdoor-Rain, and RainDrop by training and evaluating WeatherMAR separately on each dataset under its corresponding setting. This ensures that comparisons are conducted under the same dataset-specific protocol as prior methods.
- Masking and inference defaults. During training, we apply complementary bidirectional masking (Section 3.3) with a masking ratio of to sample and set . This ensures each spatial location contributes supervision to exactly one domain and that the joint sequence contains N masked tokens in each forward pass. During inference, the model observes only y, keeps the degraded tokens fixed, initializes all clean-token positions with [MASK], and performs MAR parallel completion for iterations using a cosine unmasking schedule with randomized order [18]. Unless otherwise specified, we use the progress-to-step schedule (Equation (20)) with and in all reported results.
4.3. Multi-Weather Image Restoration Results
4.3.1. Comparison Baselines and Protocol
4.3.2. Quantitative Comparison
4.3.3. Qualitative Evaluation
4.3.4. No-Reference Quantitative Evaluation on Real Snow Images
4.4. Ablation Studies
4.4.1. Component-Wise Ablation
4.4.2. Masking Strategy Ablation
4.4.3. Efficiency Analysis of ProS Scheduling
4.4.4. Token Prediction Head Ablation
4.4.5. Key Hyperparameter Ablation
4.4.6. Higher-Resolution Feasibility Study
5. Discussion
5.1. Discussion on the Frozen KL-16 Tokenizer
5.2. Scope and Future Evaluation Directions
5.3. Failure Cases and Limitations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Fu, X.; Huang, J.; Zeng, D.; Huang, Y.; Ding, X.; Paisley, J. Removing rain from single images via a deep detail network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3855–3863. [Google Scholar]
- Yang, W.; Tan, R.T.; Feng, J.; Liu, J.; Guo, Z.; Yan, S. Deep joint rain detection and removal from a single image. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1357–1366. [Google Scholar]
- Li, X.; Wu, J.; Lin, Z.; Liu, H.; Zha, H. Recurrent squeeze-and-excitation context aggregation net for single image deraining. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 254–269. [Google Scholar]
- Zhang, J.; Ren, W.; Zhang, S.; Zhang, H.; Nie, Y.; Xue, Z.; Cao, X. Hierarchical density-aware dehazing network. IEEE Trans. Cybern. 2021, 52, 11187–11199. [Google Scholar] [CrossRef]
- Wu, H.; Qu, Y.; Lin, S.; Zhou, J.; Qiao, R.; Zhang, Z.; Xie, Y.; Ma, L. Contrastive learning for compact single image dehazing. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10551–10560. [Google Scholar]
- Sun, S.; Ren, W.; Wang, T.; Cao, X. Rethinking image restoration for object detection. Adv. Neural Inf. Process. Syst. 2022, 35, 4461–4474. [Google Scholar]
- Li, B.; Ren, W.; Fu, D.; Tao, D.; Feng, D.; Zeng, W.; Wang, Z. Benchmarking single-image dehazing and beyond. IEEE Trans. Image Process. 2018, 28, 492–505. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.F.; Jaw, D.W.; Huang, S.C.; Hwang, J.N. Desnownet: Context-aware deep network for snow removal. IEEE Trans. Image Process. 2018, 27, 3064–3073. [Google Scholar] [CrossRef] [PubMed]
- Zhang, K.; Li, R.; Yu, Y.; Luo, W.; Li, C. Deep dense multi-scale network for snow removal using semantic and depth priors. IEEE Trans. Image Process. 2021, 30, 7419–7431. [Google Scholar] [CrossRef] [PubMed]
- Qian, R.; Tan, R.T.; Yang, W.; Su, J.; Liu, J. Attentive generative adversarial network for raindrop removal from a single image. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2482–2491. [Google Scholar]
- Quan, Y.; Deng, S.; Chen, Y.; Ji, H. Deep learning for seeing through window with raindrops. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2463–2471. [Google Scholar]
- Li, R.; Tan, R.T.; Cheong, L.F. All in one bad weather removal using architectural search. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3175–3185. [Google Scholar]
- Valanarasu, J.M.J.; Yasarla, R.; Patel, V.M. Transweather: Transformer-based restoration of images degraded by adverse weather conditions. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2353–2363. [Google Scholar]
- Zhu, Y.; Wang, T.; Fu, X.; Yang, X.; Guo, X.; Dai, J.; Qiao, Y.; Hu, X. Learning weather-general and weather-specific features for image restoration under multiple adverse weather conditions. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21747–21758. [Google Scholar]
- Özdenizci, O.; Legenstein, R. Restoring vision in adverse weather conditions with patch-based denoising diffusion models. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10346–10357. [Google Scholar] [CrossRef]
- Ye, T.; Chen, S.; Bai, J.; Shi, J.; Xue, C.; Jiang, J.; Yin, J.; Chen, E.; Liu, Y. Adverse weather removal with codebook priors. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 12653–12664. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Li, T.; Tian, Y.; Li, H.; Deng, M.; He, K. Autoregressive Image Generation without Vector Quantization. Adv. Neural Inf. Process. Syst. 2024, 37, 56424–56445. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Li, R.; Cheong, L.F.; Tan, R.T. Heavy rain image restoration: Integrating physics model and conditional adversarial learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1633–1642. [Google Scholar]
- Sun, S.; Ren, W.; Gao, X.; Wang, R.; Cao, X. Restoring images in adverse weather conditions via histogram transformer. In European Conference on Computer Vision 2024; Springer: Cham, Switzerland, 2024; pp. 111–129. [Google Scholar]
- Liao, R.; Li, F.; Wei, Y.; Shi, Z.; Zhang, L.; Bai, H.; Wang, M. Prompt to Restore, Restore to Prompt: Cyclic prompting for universal adverse weather removal. IEEE Trans. Image Process. 2025, 34, 7422–7435. [Google Scholar] [CrossRef]
- Ye, T.; Zhang, Y.; Jiang, M.; Chen, L.; Liu, Y.; Chen, S.; Chen, E. Perceiving and modeling density for image dehazing. In European Conference on Computer Vision 2022; Springer: Cham, Switzerland, 2022; pp. 130–145. [Google Scholar]
- Liu, Y.; Yan, Z.; Wu, A.; Ye, T.; Li, Y. Nighttime image dehazing based on variational decomposition model. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 640–649. [Google Scholar]
- Liu, Y.; Yan, Z.; Chen, S.; Ye, T.; Ren, W.; Chen, E. Nighthazeformer: Single nighttime haze removal using prior query transformer. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 4119–4128. [Google Scholar]
- Chen, S.; Ye, T.; Shi, J.; Liu, Y.; Jiang, J.; Chen, E.; Chen, P. Dehrformer: Real-time transformer for depth estimation and haze removal from varicolored haze scenes. In ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Chen, S.; Ye, T.; Liu, Y.; Chen, E. SnowFormer: Context interaction transformer with scale-awareness for single image desnowing. arXiv 2022, arXiv:2208.09703. [Google Scholar]
- Chen, S.; Ye, T.; Liu, Y.; Liao, T.; Jiang, J.; Chen, E.; Chen, P. Msp-former: Multi-scale projection transformer for single image desnowing. In ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Ye, T.; Chen, S.; Liu, Y.; Ye, Y.; Bai, J.; Chen, E. Towards real-time high-definition image snow removal: Efficient pyramid network with asymmetrical encoder-decoder architecture. In Proceedings of the 2022 Asian Conference on Computer Vision, Macau, China, 4–8 December 2022; pp. 366–381. [Google Scholar]
- Jin, Y.; Yang, W.; Tan, R.T. Unsupervised night image enhancement: When layer decomposition meets light-effects suppression. In European Conference on Computer Vision 2022; Springer: Cham, Switzerland, 2022; pp. 404–421. [Google Scholar]
- Jin, Y.; Yan, W.; Yang, W.; Tan, R.T. Structure representation network and uncertainty feedback learning for dense non-uniform fog removal. In Asian Conference on Computer Vision 2022; Springer: Cham, Switzerland, 2022; pp. 155–172. [Google Scholar]
- Ren, J.; Zheng, Q.; Zhao, Y.; Xu, X.; Li, C. Dlformer: Discrete latent transformer for video inpainting. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3511–3520. [Google Scholar]
- Ye, T.; Chen, S.; Liu, Y.; Ye, Y.; Chen, E.; Li, Y. Underwater light field retention: Neural rendering for underwater imaging. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 488–497. [Google Scholar]
- Jin, Y.; Ye, W.; Yang, W.; Yuan, Y.; Tan, R.T. Des3: Adaptive attention-driven self and soft shadow removal using vit similarity. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2024; Volume 38, pp. 2634–2642. [Google Scholar]
- Huang, J.; Zhao, F.; Zhou, M.; Xiao, J.; Zheng, N.; Zheng, K.; Xiong, Z. Learning sample relationship for exposure correction. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9904–9913. [Google Scholar]
- Jin, Y.; Li, R.; Yang, W.; Tan, R.T. Estimating reflectance layer from a single image: Integrating reflectance guidance and shadow/specular aware learning. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2023; Volume 37, pp. 1069–1077. [Google Scholar]
- Huang, J.; Liu, Y.; Fu, X.; Zhou, M.; Wang, Y.; Zhao, F.; Xiong, Z. Exposure normalization and compensation for multiple-exposure correction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6043–6052. [Google Scholar]
- Yu, H.; Huang, J.; Liu, Y.; Zhu, Q.; Zhou, M.; Zhao, F. Source-free domain adaptation for real-world image dehazing. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 6645–6654. [Google Scholar]
- Zhang, H.; Sindagi, V.; Patel, V.M. Image de-raining using a conditional generative adversarial network. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3943–3956. [Google Scholar] [CrossRef]
- Zhang, H.; Patel, V.M. Density-aware single image de-raining using a multi-stream dense network. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 695–704. [Google Scholar]
- Yasarla, R.; Patel, V.M. Uncertainty guided multi-scale residual learning-using a cycle spinning cnn for single image de-raining. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8405–8414. [Google Scholar]
- Ren, W.; Tian, J.; Han, Z.; Chan, A.; Tang, Y. Video desnowing and deraining based on matrix decomposition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4210–4219. [Google Scholar]
- Li, M.; Cao, X.; Zhao, Q.; Zhang, L.; Meng, D. Online rain/snow removal from surveillance videos. IEEE Trans. Image Process. 2021, 30, 2029–2044. [Google Scholar] [CrossRef] [PubMed]
- You, S.; Tan, R.T.; Kawakami, R.; Mukaigawa, Y.; Ikeuchi, K. Adherent raindrop modeling, detection and removal in video. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 1721–1733. [Google Scholar] [CrossRef]
- Zhang, K.; Li, D.; Luo, W.; Ren, W. Dual attention-in-attention model for joint rain streak and raindrop removal. IEEE Trans. Image Process. 2021, 30, 7608–7619. [Google Scholar] [CrossRef]
- Li, B.; Liu, X.; Hu, P.; Wu, Z.; Lv, J.; Peng, X. All-in-one image restoration for unknown corruption. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17452–17462. [Google Scholar]
- Wang, T.; Yang, X.; Xu, K.; Chen, S.; Zhang, Q.; Lau, R.W. Spatial attentive single-image deraining with a high quality real rain dataset. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12270–12279. [Google Scholar]
- Chen, W.T.; Fang, H.Y.; Ding, J.J.; Tsai, C.C.; Kuo, S.Y. JSTASR: Joint size and transparency-aware snow removal algorithm based on modified partial convolution and veiling effect removal. In European Conference on Computer Vision 2020; Springer: Cham, Switzerland, 2020; pp. 754–770. [Google Scholar]
- Kang, L.W.; Lin, C.W.; Fu, Y.H. Automatic single-image-based rain streaks removal via image decomposition. IEEE Trans. Image Process. 2011, 21, 1742–1755. [Google Scholar] [CrossRef]
- Wei, Y.; Zhang, Z.; Wang, Y.; Xu, M.; Yang, Y.; Yan, S.; Wang, M. Deraincyclegan: Rain attentive cyclegan for single image deraining and rainmaking. IEEE Trans. Image Process. 2021, 30, 4788–4801. [Google Scholar] [CrossRef]
- Chen, W.T.; Huang, Z.K.; Tsai, C.C.; Yang, H.H.; Ding, J.J.; Kuo, S.Y. Learning multiple adverse weather removal via two-stage knowledge learning and multi-contrastive regularization: Toward a unified model. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17653–17662. [Google Scholar]
- Wang, T.; Zhang, K.; Shao, Z.; Luo, W.; Stenger, B.; Lu, T.; Kim, T.K.; Liu, W.; Li, H. Gridformer: Residual dense transformer with grid structure for image restoration in adverse weather conditions. Int. J. Comput. Vis. 2024, 132, 4541–4563. [Google Scholar] [CrossRef]
- Shurui, P.; Lin, X.; Luo, S.; Ou, J.; Zhang, D.; Qi, L.; Nguyen, T.; Ren, C. SLER-IR: Spherical Layer-wise Expert Routing for All-in-One Image Restoration. arXiv 2026, arXiv:2603.05940. [Google Scholar]
- Zhang, X.; Zhang, H.; Wang, G.; Zhang, Q.; Zhang, L. ClearAIR: A Human-Visual-Perception-Inspired All-in-One Image Restoration. arXiv 2026, arXiv:2601.02763. [Google Scholar] [CrossRef]
- Liu, S.; Zuo, K.; Xiao, H. DELNet: Continuous All-in-One Weather Removal via Dynamic Expert Library. arXiv 2026, arXiv:2601.22573. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 technical report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]
- Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In International Conference on Machine Learning; PMLR: London, UK, 2020; pp. 1691–1703. [Google Scholar]
- Van Den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural discrete representation learning. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12873–12883. [Google Scholar]
- Razavi, A.; Van den Oord, A.; Vinyals, O. Generating diverse high-fidelity images with vq-vae-2. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Yu, J.; Xu, Y.; Koh, J.Y.; Luong, T.; Baid, G.; Wang, Z.; Vasudevan, V.; Ku, A.; Yang, Y.; Ayan, B.K.; et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv 2022, arXiv:2206.10789. [Google Scholar]
- Tschannen, M.; Eastwood, C.; Mentzer, F. Givt: Generative infinite-vocabulary transformers. In European Conference on Computer Vision 2024; Springer: Cham, Switzerland, 2024; pp. 292–309. [Google Scholar]
- Tang, H.; Wu, Y.; Yang, S.; Xie, E.; Chen, J.; Chen, J.; Zhang, Z.; Cai, H.; Lu, Y.; Han, S. Hart: Efficient visual generation with hybrid autoregressive transformer. arXiv 2024, arXiv:2410.10812. [Google Scholar] [CrossRef]
- Tschannen, M.; Pinto, A.S.; Kolesnikov, A. Jetformer: An autoregressive generative model of raw images and text. arXiv 2024, arXiv:2411.19722. [Google Scholar] [CrossRef]
- Dong, R.; Han, C.; Peng, Y.; Qi, Z.; Ge, Z.; Yang, J.; Zhao, L.; Sun, J.; Zhou, H.; Wei, H.; et al. Dreamllm: Synergistic multimodal comprehension and creation. arXiv 2023, arXiv:2309.11499. [Google Scholar]
- Ge, Y.; Zhao, S.; Zhu, J.; Ge, Y.; Yi, K.; Song, L.; Li, C.; Ding, X.; Shan, Y. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv 2024, arXiv:2404.14396. [Google Scholar]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
- Wei, C.; Mangalam, K.; Huang, P.Y.; Li, Y.; Fan, H.; Xu, H.; Wang, H.; Xie, C.; Yuille, A.; Feichtenhofer, C. Diffusion models as masked autoencoders. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 16284–16294. [Google Scholar]
- Li, Y.; Bornschein, J.; Chen, T. Denoising autoregressive representation learning. arXiv 2024, arXiv:2403.05196. [Google Scholar] [CrossRef]
- Chi, C.; Feng, S.; Xu, Z.; Cousineau, E.A.; Burchfiel, B.; Song, S. Visuomotor Policy Learning via Action Diffusion. U.S. Patent Application 18/594842, 4 September 2025. [Google Scholar]
- Huynh-Thu, Q.; Ghanbari, M. Scope of validity of PSNR in image/video quality assessment. Electron. Lett. 2008, 44, 800–801. [Google Scholar] [CrossRef]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
- Xiao, J.; Fu, X.; Liu, A.; Wu, F.; Zha, Z.J. Image de-raining transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12978–12995. [Google Scholar] [CrossRef]
- Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Multi-stage progressive image restoration. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14821–14831. [Google Scholar]
- Mittal, A.; Soundararajan, R.; Bovik, A.C. Making a “completely blind” image quality analyzer. IEEE Signal Process. Lett. 2012, 20, 209–212. [Google Scholar] [CrossRef]
- Zhang, L.; Zhang, L.; Bovik, A.C. A feature-enriched completely blind image quality evaluator. IEEE Trans. Image Process. 2015, 24, 2579–2591. [Google Scholar] [CrossRef] [PubMed]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Loshchilov, I.; Hutter, F. Fixing weight decay regularization in adam. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
- Liu, X.; Suganuma, M.; Sun, Z.; Okatani, T. Dual residual networks leveraging the potential of paired operations for image restoration. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7007–7016. [Google Scholar]
- Jiang, K.; Wang, Z.; Yi, P.; Chen, C.; Wang, Z.; Wang, X.; Jiang, J.; Lin, C.W. Rain-free and residue hand-in-hand: A progressive coupled network for real-time image deraining. IEEE Trans. Image Process. 2021, 30, 7404–7418. [Google Scholar] [CrossRef]
- Chen, L.; Chu, X.; Zhang, X.; Sun, J. Simple baselines for image restoration. In European Conference on Computer Vision 2022; Springer: Cham, Switzerland, 2022; pp. 17–33. [Google Scholar]
- Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxim: Multi-axis mlp for image processing. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5769–5780. [Google Scholar]
- Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
- Chen, S.; Ye, T.; Zhang, K.; Xing, Z.; Lin, Y.; Zhu, L. Teaching tailored to talent: Adverse weather restoration via prompt pool and depth-anything constraint. In European Conference on Computer Vision 2024; Springer: Cham, Switzerland, 2024; pp. 95–115. [Google Scholar]








| Image Desnowing | Deraining & Dehazing | Raindrop Removal | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Method | Snow100K-S | Snow100K-L | Method | Outdoor-Rain | Method | RainDrop | ||||
| PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | |||
| SPANet [47] | 29.92 | 0.8260 | 23.70 | 0.7930 | CycleGAN [82] | 17.62 | 0.6560 | pix2pix [83] | 28.02 | 0.8547 |
| JSTASR [48] | 31.40 | 0.9012 | 25.32 | 0.8076 | pix2pix [83] | 19.09 | 0.7100 | DuRN [84] | 31.24 | 0.9259 |
| RESCAN [3] | 31.51 | 0.9032 | 26.08 | 0.8108 | HRGAN [20] | 21.56 | 0.8550 | RaindropAttn [11] | 31.44 | 0.9263 |
| DesnowNet [8] | 32.33 | 0.9500 | 27.17 | 0.8983 | PCNet [85] | 26.19 | 0.9015 | AttentiveGAN [10] | 31.59 | 0.9170 |
| DDMSNet [9] | 34.34 | 0.9445 | 28.85 | 0.8772 | MPRNet [77] | 28.03 | 0.9192 | IDT [76] | 31.87 | 0.9313 |
| NAFNet [86] | 34.79 | 0.9497 | 30.06 | 0.9017 | NAFNet [86] | 29.59 | 0.9027 | MAXIM [87] | 31.87 | 0.9352 |
| Restormer [88] | 36.01 | 0.9579 | 30.36 | 0.9068 | Restormer [88] | 30.03 | 0.9215 | Restormer [88] | 32.18 | 0.9408 |
| All-in-One [12] | – | – | 28.33 | 0.8820 | All-in-One [12] | 24.71 | 0.8980 | All-in-One [12] | 31.12 | 0.9268 |
| TransWeather [13] | 32.51 | 0.9341 | 29.31 | 0.8879 | TransWeather [13] | 28.83 | 0.9000 | TransWeather [13] | 30.17 | 0.9157 |
| Chen et al. [51] | 34.42 | 0.9469 | 30.22 | 0.9071 | Chen et al. [51] | 29.27 | 0.9147 | Chen et al. [51] | 31.81 | 0.9309 |
| WGWSNet [14] | 34.31 | 0.9460 | 30.16 | 0.9007 | WGWSNet [14] | 29.32 | 0.9207 | WGWSNet [14] | 32.38 | 0.9378 |
| WeatherDiff64 [15] | 35.83 | 0.9566 | 30.09 | 0.9041 | WeatherDiff64 [15] | 29.64 | 0.9312 | WeatherDiff64 [15] | 30.71 | 0.9312 |
| WeatherDiff128 [15] | 35.02 | 0.9516 | 29.58 | 0.8941 | WeatherDiff128 [15] | 29.72 | 0.9216 | WeatherDiff128 [15] | 29.66 | 0.9225 |
| AWRCP [16] | 36.92 | 0.9652 | 31.92 | 0.9341 | AWRCP [16] | 31.39 | 0.9329 | AWRCP [16] | 31.93 | 0.9314 |
| Histoformer [21] | 37.41 | 0.9656 | 32.16 | 0.9261 | Histoformer [21] | 32.08 | 0.9389 | Histoformer [21] | 33.06 | 0.9441 |
| T3-DiffWeather [89] | 37.55 | 0.9641 | 31.11 | 0.9180 | T3-DiffWeather [89] | 32.52 | 0.9339 | T3-DiffWeather [89] | 32.70 | 0.9414 |
| GridFormer [52] | 37.46 | 0.9640 | 31.71 | 0.9231 | GridFormer [52] | 31.87 | 0.9335 | GridFormer [52] | 32.39 | 0.9362 |
| CyclicPrompt [22] | 37.50 | 0.9655 | 32.16 | 0.9265 | CyclicPrompt [22] | 32.81 | 0.9371 | CyclicPrompt [22] | 32.57 | 0.9454 |
| WeatherMAR (Ours) | 38.14 | 0.9684 | 32.58 | 0.9274 | WeatherMAR (Ours) | 31.91 | 0.9396 | WeatherMAR (Ours) | 33.12 | 0.9452 |
| Method | NIQE ↓ | IL-NIQE ↓ |
|---|---|---|
| TransWeather [13] | 3.161 | 22.207 |
| WeatherDiff64 [15] | 2.985 | 22.121 |
| WeatherDiff128 [15] | 2.964 | 21.976 |
| WeatherMAR (Ours) | 2.803 | 21.617 |
| Method | PSNR ↑ | SSIM ↑ |
|---|---|---|
| : Conditional MAR baseline (mar_large) | 29.81 | 0.9204 |
| : + Joint sequence modeling (Equation (3)) | 30.08 | 0.9232 |
| : + Complementary masking (Section 3.3) | 31.64 | 0.9367 |
| : + Reverse supervision (Equation (11)) | 31.92 | 0.9396 |
| : + Progress-to-step schedule (Equation (20)) | 31.91 | 0.9396 |
| Method | PSNR ↑ | SSIM ↑ |
|---|---|---|
| : Only-clean masking (standard conditional; same as ) | 30.08 | 0.9232 |
| : Independent masking (both domains, i.i.d.) | 30.56 | 0.9288 |
| : Complementary masking () | 31.64 | 0.9367 |
| Method | Step Schedule | Total Steps | Params (M) | Mem (GB) | Time (s) ↓ | Speed-Up ↑ | PSNR/SSIM ↑ |
|---|---|---|---|---|---|---|---|
| Outdoor-Rain | |||||||
| WeatherMAR (fixed) | fixed | 3200 | 479 | 20.4 | 0.256 | 0.0% | 31.92/0.9396 |
| WeatherMAR + ProS | scheduled | 1788 | 479 | 20.3 | 0.224 | +12.5% | 31.91/0.9396 |
| Snow100K-L | |||||||
| WeatherMAR (fixed) | fixed | 3200 | 479 | 22.5 | 0.287 | 0.0% | 32.60/0.9274 |
| WeatherMAR + ProS | scheduled | 1788 | 479 | 22.3 | 0.250 | +12.8% | 32.58/0.9274 |
| Variant | Token Prediction Head | PSNR/SSIM ↑ |
|---|---|---|
| WeatherMAR-Reg | direct L2 regression | 31.24/0.9315 |
| WeatherMAR | diffusion head | 31.91/0.9396 |
| Parameter | Setting | Time (s) ↓ | PSNR ↑ | SSIM ↑ |
|---|---|---|---|---|
| Masking ratio r | 0.3 | – | 31.22 | 0.9335 |
| 0.5 | – | 31.91 | 0.9396 | |
| 0.7 | – | 31.03 | 0.9312 | |
| Loss weight | 0 | – | 31.64 | 0.9367 |
| 0.5 | – | 31.85 | 0.9384 | |
| 1.0 | – | 31.91 | 0.9396 | |
| Step range | 0.132 | 31.84 | 0.9387 | |
| 0.224 | 31.91 | 0.9396 | ||
| 0.265 | 31.95 | 0.9399 |
| Method | Resolution | Inference Mode | PSNR ↑ | SSIM ↑ |
|---|---|---|---|---|
| WeatherMAR | 256 × 256 | standard inference | 31.91 | 0.9396 |
| WeatherMAR | 720 × 480 | patch-based inference | 31.35 | 0.9314 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ma, J.; Lv, Q.; Tan, Z. WeatherMAR: Complementary Masking of Paired Tokens for Adverse-Weather Image Restoration. J. Imaging 2026, 12, 154. https://doi.org/10.3390/jimaging12040154
Ma J, Lv Q, Tan Z. WeatherMAR: Complementary Masking of Paired Tokens for Adverse-Weather Image Restoration. Journal of Imaging. 2026; 12(4):154. https://doi.org/10.3390/jimaging12040154
Chicago/Turabian StyleMa, Junyuan, Qunbo Lv, and Zheng Tan. 2026. "WeatherMAR: Complementary Masking of Paired Tokens for Adverse-Weather Image Restoration" Journal of Imaging 12, no. 4: 154. https://doi.org/10.3390/jimaging12040154
APA StyleMa, J., Lv, Q., & Tan, Z. (2026). WeatherMAR: Complementary Masking of Paired Tokens for Adverse-Weather Image Restoration. Journal of Imaging, 12(4), 154. https://doi.org/10.3390/jimaging12040154

