DMET: Dynamic Mask-Enhanced Transformer for Generalizable Deep Image Denoising
Abstract
1. Introduction
- We propose a new Dynamic Mask-Enhanced Transformer for image denoising, dubbed DMET. The proposed model integrates texture-guided adaptive input masking and masked hierarchical attention masking to improve generalization across diverse noise types. The input mask adaptively corrupts pixels based on local texture complexity, while the attention mask regularizes feature learning to bridge the training–testing gap.
- We introduce a texture-guided adaptive masking mechanism that utilizes masking based on local texture complexity, enforcing robust semantic recovery in smooth regions while preserving details in texture-rich areas. Additionally, we also design a masked hierarchical attention block, which combines shifted window multi-head self-attention for long-range dependencies and channel attention to recover fine details. A texture-guided gating weight is further applied to dynamically fuse these features.
- We conduct extensive experiments on public datasets, and the results show that the proposed DMET can achieve competitive performance over state-of-the-art models. An ablation study and visualization analysis are also conducted to demonstrate the rationality of the proposed method.
2. Related Work
2.1. Image Denoising
2.2. Generalization Problem
2.3. Channel Attention
2.4. Mask Modeling
3. Proposed Method
3.1. Texture-Guided Adaptive Masking
3.2. Masked Hierarchical Attention Block
3.3. Dynamic Feature Fusion
4. Experimental Results and Discussion
4.1. Experimental Settings
4.2. Quantitative Comparison
4.2.1. Generalization Performance
4.2.2. Evaluation on Speckle Noise
4.2.3. Evaluation on Salt-and-Pepper Noise
4.2.4. Evaluation on Monte Carlo Rendering Noise
4.3. Ablation Study
4.4. Computational Complexity
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Cheng, J.; Liang, D.; Tan, S. Transfer CLIP for generalizable image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 25974–25984. [Google Scholar]
- Joshi, A.; Akalwadi, N.; Mandi, C.; Desai, C.; Tabib, R.A.; Patil, U.; Mudenagudi, U. HNN: Hierarchical Noise-Deinterlace Net Towards Image Denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3007–3016. [Google Scholar]
- Zhou, Y.; Lin, J.; Ye, F.; Qu, Y.; Xie, Y. Efficient lightweight image denoising with triple attention transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Takamatsu, Japan, 6–12 July 2024; Volume 38, pp. 7704–7712. [Google Scholar]
- Hu, Y.; Tian, C.; Zhang, J.; Zhang, S. Efficient image denoising with heterogeneous kernel-based CNN. Neurocomputing 2024, 592, 127799. [Google Scholar] [CrossRef]
- Tian, C.; Zheng, M.; Lin, C.W.; Li, Z.; Zhang, D. Heterogeneous window transformer for image denoising. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 6621–6632. [Google Scholar] [CrossRef]
- Brooks, T.; Mildenhall, B.; Xue, T.; Chen, J.; Sharlet, D.; Barron, J.T. Unprocessing images for learned raw denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11036–11045. [Google Scholar]
- Wei, K.; Fu, Y.; Yang, J.; Huang, H. A physics-based noise formation model for extreme low-light raw denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2758–2767. [Google Scholar]
- Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; Zhang, L. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 2017, 26, 3142–3155. [Google Scholar] [CrossRef] [PubMed]
- Chen, J.; Chen, J.; Chao, H.; Yang, M. Image blind denoising with generative adversarial network based noise modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3155–3164. [Google Scholar]
- Abdelhamed, A.; Lin, S.; Brown, M.S. A high-quality denoising dataset for smartphone cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1692–1700. [Google Scholar]
- Yuan, Y.; Liu, S.; Zhang, J.; Zhang, Y.; Dong, C.; Lin, L. Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 701–710. [Google Scholar]
- Chen, H.; Gu, J.; Liu, Y.; Magid, S.A.; Dong, C.; Wang, Q.; Pfister, H.; Zhu, L. Masked image training for generalizable deep image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1692–1703. [Google Scholar]
- Buades, A.; Coll, B.; Morel, J.M. A non-local algorithm for image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; Volume 2, pp. 60–65. [Google Scholar]
- Gu, S.; Zhang, L.; Zuo, W.; Feng, X. Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2862–2869. [Google Scholar]
- Mairal, J.; Bach, F.; Ponce, J.; Sapiro, G.; Zisserman, A. Non-local sparse models for image restoration. In Proceedings of the IEEE International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 2272–2279. [Google Scholar]
- Dabov, K.; Foi, A.; Katkovnik, V.; Egiazarian, K. Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans. Image Process. 2007, 16, 2080–2095. [Google Scholar] [CrossRef] [PubMed]
- Chen, C.; Xiong, Z.; Tian, X.; Wu, F. Deep boosting for image denoising. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–18. [Google Scholar]
- Jia, X.; Liu, S.; Feng, X.; Zhang, L. Focnet: A fractional optimal control network for image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6054–6063. [Google Scholar]
- Zhang, Y.; Li, K.; Li, K.; Zhong, B.; Fu, Y. Residual non-local attention networks for image restoration. arXiv 2019, arXiv:1903.10082. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Chen, Z.; Zhang, Y.; Gu, J.; Kong, L.; Yuan, X. Cross aggregation transformer for image restoration. Adv. Neural Inf. Process. Syst. 2022, 35, 25478–25490. [Google Scholar]
- Yang, F.; Yang, H.; Fu, J.; Lu, H.; Guo, B. Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5791–5800. [Google Scholar]
- Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
- Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 1833–1844. [Google Scholar]
- Krull, A.; Buchholz, T.O.; Jug, F. Noise2void-learning denoising from single noisy images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2129–2137. [Google Scholar]
- Quan, Y.; Chen, M.; Pang, T.; Ji, H. Self2self with dropout: Learning self-supervised denoising from single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1890–1898. [Google Scholar]
- Huang, T.; Li, S.; Jia, X.; Lu, H.; Liu, J. Neighbor2neighbor: Self-supervised denoising from single noisy images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14781–14790. [Google Scholar]
- Ponnambalam, M.; Ponnambalam, M.; Jamal, S.S. A robust color image encryption scheme with complex whirl wind spiral chaotic system and quadrant-wise pixel permutation. Phys. Scr. 2024, 99, 105239. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-excite: Exploiting feature context in convolutional neural networks. Adv. Neural Inf. Process. Syst. 2018, 31, 9423–9433. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 3–5 June 2019; pp. 4171–4186. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A.; Bottou, L. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
- Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
- Chen, Y.; Pock, T. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1256–1272. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
- Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9653–9663. [Google Scholar]
- Timofte, R.; Gu, S.; Wu, J.; Van Gool, L.N. Challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 18–22. [Google Scholar]
- Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
- Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar]
- Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the IEEE International Conference on Computer Vision, Vancouver, BC, Canada, 7–14 July 2001; pp. 416–423. [Google Scholar]
- Anwar, S.; Barnes, N. Real image denoising with feature attention. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3155–3164. [Google Scholar]
- Kong, X.; Liu, X.; Gu, J.; Qiao, Y.; Dong, C. Reflash dropout in image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6002–6012. [Google Scholar]
- Firmino, A.; Frisvad, J.R.; Jensen, H.W. Progressive denoising of Monte Carlo rendered images. Comput. Graph. Forum 2022, 41, 1–11. [Google Scholar] [CrossRef]
Dataset | CBSD68 [43] | ||
---|---|---|---|
Noise Type | Speckle | ||
Metric | Method | 0.03 | 0.04 |
PSNR ↑ | DnCNN [8] | 26.90 | 24.84 |
RIDNet [44] | 27.03 | 24.87 | |
RNAN [19] | 26.28 | 24.28 | |
SwinIR [24] | 25.98 | 24.07 | |
Restormer [23] | 26.84 | 25.17 | |
Dropout [45] | 27.16 | 25.69 | |
Masked Training [12] | 29.49 | 28.53 | |
Ours | 29.68 | 28.99 | |
SSIM ↑ | DnCNN [8] | 0.7610 | 0.7035 |
RIDNet [44] | 0.7590 | 0.6999 | |
RNAN [19] | 0.7451 | 0.6870 | |
SwinIR [24] | 0.7362 | 0.6810 | |
Restormer [23] | 0.7667 | 0.7202 | |
Dropout [45] | 0.7804 | 0.7311 | |
Masked Training [12] | 0.8502 | 0.8169 | |
Ours | 0.8512 | 0.8277 |
Dataset | Urban 100 [42] | ||
---|---|---|---|
Noise Type | Salt-and-Pepper | ||
Metric | Method | 0.002 | 0.004 |
PSNR ↑ | DnCNN [8] | 24.01 | 20.55 |
RIDNet [44] | 24.56 | 20.88 | |
RNAN [19] | 23.01 | 19.87 | |
SwinIR [24] | 22.90 | 19.74 | |
Restormer [23] | 23.42 | 20.53 | |
Dropout [45] | 26.33 | 23.48 | |
Masked Training [12] | 28.58 | 26.93 | |
Ours | 28.89 | 28.66 | |
SSIM ↑ | DnCNN [8] | 0.7372 | 0.5828 |
RIDNet [44] | 0.7372 | 0.5835 | |
RNAN [19] | 0.7132 | 0.5582 | |
SwinIR [24] | 0.7075 | 0.5507 | |
Restormer [23] | 0.7145 | 0.5772 | |
Dropout [45] | 0.7591 | 0.6279 | |
Masked Training [12] | 0.8655 | 0.8074 | |
Ours | 0.8905 | 0.8804 |
Dataset | MC [46] | ||
---|---|---|---|
Noise Type | Monte Carlo-Rendered | ||
Metric | Method | 128 spp | 64 spp |
PSNR ↑ | DnCNN [8] | 29.94 | 26.28 |
RIDNet [44] | 29.96 | 26.27 | |
RNAN [19] | 29.86 | 26.26 | |
SwinIR [24] | 24.98 | 24.59 | |
Restormer [23] | 29.32 | 26.14 | |
Dropout [45] | 28.85 | 26.10 | |
Masked Training [12] | 30.62 | 28.25 | |
Ours | 30.80 | 28.46 | |
SSIM ↑ | DnCNN [8] | 0.7883 | 0.6779 |
RIDNet [44] | 0.7921 | 0.6788 | |
RNAN [19] | 0.7825 | 0.6743 | |
SwinIR [24] | 0.6598 | 0.5880 | |
Restormer [23] | 0.7627 | 0.6651 | |
Dropout [45] | 0.7753 | 0.6696 | |
Masked Training [12] | 0.8500 | 0.7694 | |
Ours | 0.8535 | 0.7822 |
Baseline | TAM | MHAB | DFF | PSNR ↑ | SSIM ↑ |
---|---|---|---|---|---|
✓ | 27.34 | 0.7654 | |||
✓ | ✓ | 28.12 | 0.8015 | ||
✓ | ✓ | ✓ | 28.55 | 0.8123 | |
✓ | ✓ | ✓ | ✓ | 28.99 | 0.8277 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhu, T.; Li, A.; Wang, Y.-G.; Su, W.; Jiang, D. DMET: Dynamic Mask-Enhanced Transformer for Generalizable Deep Image Denoising. Mathematics 2025, 13, 2167. https://doi.org/10.3390/math13132167
Zhu T, Li A, Wang Y-G, Su W, Jiang D. DMET: Dynamic Mask-Enhanced Transformer for Generalizable Deep Image Denoising. Mathematics. 2025; 13(13):2167. https://doi.org/10.3390/math13132167
Chicago/Turabian StyleZhu, Tong, Anqi Li, Yuan-Gen Wang, Wenkang Su, and Donghua Jiang. 2025. "DMET: Dynamic Mask-Enhanced Transformer for Generalizable Deep Image Denoising" Mathematics 13, no. 13: 2167. https://doi.org/10.3390/math13132167
APA StyleZhu, T., Li, A., Wang, Y.-G., Su, W., & Jiang, D. (2025). DMET: Dynamic Mask-Enhanced Transformer for Generalizable Deep Image Denoising. Mathematics, 13(13), 2167. https://doi.org/10.3390/math13132167