M2UNet: A Segmentation-Guided GAN with Attention-Enhanced U2-Net for Face Unmasking
Abstract
1. Introduction
- 1.
- We introduce M2UNet, a segmentation-guided GAN framework tailored to face unmasking, which explicitly leverages binary mask priors to better localize and restore occluded facial regions.
- 2.
- We propose the novel adaptation of the Nested U2-Net architecture—traditionally utilized for saliency detection—to the task of generative face inpainting. By integrating intra-stage multi-scale feature propagation, the rou CARU2-Net generator overcomes the information bottlenecks of standard U-Nets, enabling the precise recovery of fine-grained facial geometry and texture.
- 3.
- We provide a controlled evaluation setting by constructing a synthetic masked-face dataset using the MaskTheFace tool on CelebA, enabling a systematic assessment of face unmasking performance.
- 4.
- We conduct extensive experiments and demonstrate that M2UNet achieves superior performance compared to state-of-the-art methods, evaluated across five widely used metrics: the Peak Signal-to-Noise Ratio (PSNR), the Structural Similarity Index (SSIM), the Fréchet Inception Distance (FID), the L1 reconstruction error, and Learned Perceptual Image Patch Similarity (LPIPS).
2. Related Works
2.1. GAN-Based Methods
2.2. Transformer- and Diffusion-Based Methods
2.3. Multimodal-Based Methods
3. Method
3.1. Segmentation Stage
3.2. Inpainting Stage
- A.
- Generator (CARU2-Net). The generator is responsible for reconstructing complete face images from masked inputs while preserving both global structure and local detail. Inspired by U2-Net, CARU2-Net nests Conv-Attention Residual U-blocks (CARU) inside a larger U-shaped encoder–decoder architecture, enabling multi-scale feature extraction and fine-grained detail reconstruction. To achieve this, the masked face image is concatenated with its binary mask map and passed to the network. The architecture follows a U2-Net backbone built from Conv-Attention Residual U-blocks (CARU), which we denote as CARU2-Net. The generator is organized into two encoder stages, a bottleneck, and two decoder stages, as illustrated in Figure 4. Each stage employs CARU variants (CARU-7, CARU-6, and CARU-4) designed to capture multi-scale contextual information while preserving fine-grained features through residual and channel-attention mechanisms. A final convolutional layer followed by a activation produces the three-channel unmasked output.
- Encoder. The encoder consists of two hierarchical levels. The first level employs a CARU-7 module, composed of an input CAR layer followed by six CAR layers in the encoder path, interleaved with max-pooling for progressive downsampling. A dilated CAR layer with a dilation factor of 2 is introduced to enlarge the receptive field and improve contextual modeling. In the decoder path of the CARU block, the six CAR layers are mirrored by six upsampling operations. Skip connections are employed within the block, and the output of the input CAR layer is added to the final output, facilitating gradient flow and preserving local information. The second level employs a CARU-6 module, structurally similar to CARU-7 but with one fewer CAR layer (five instead of six). Both CARU-7 and CARU-6 modules are followed by max-pooling operations to progressively encode spatial hierarchies while maintaining discriminative feature representations.
- Bottleneck. At the network core lies the CARU-4 module. It begins with an input CAR layer, followed by three CAR layers in the encoder path with increasing dilation factors (1, 2, and 4), and symmetric layers in the decoder path with reversed dilation order. An additional CAR layer with a dilation factor of 8 is placed at the center of the module to further enlarge the receptive field without additional downsampling, enabling the robust modeling of long-range dependencies. Skip connections between the encoder and decoder sub-blocks within CARU-4 to facilitate effective information flow.
- Decoder. The decoder mirrors the encoder structure, using CARU-6 at the second level and CARU-7 at the first level. In contrast to the encoder, max-pooling operations are replaced by upsampling layers to recover spatial resolution. Skip connections concatenate encoder features with their corresponding decoder features, preserving fine spatial cues lost during downsampling. This design enables a sharper and more faithful reconstruction of facial details, even under large or irregular occlusions.
- Loss Function. The training of the generator is guided by a compound objective that balances realism, fidelity, and perceptual quality, as defined in Equation (1):where the weights are set to , , and . The reconstruction weight is assigned a significantly higher magnitude to enforce strict structural fidelity, ensuring that global facial geometry is preserved under large-area occlusions. This configuration aligns with established practices in face inpainting literature (e.g., GUMF [1], MuFIN [27]), where strong pixel-level supervision serves as an optimization anchor to prevent geometric distortion. The adversarial and perceptual terms act as complementary regularizers, refining high-frequency textures and enforcing semantic consistency to counteract the smoothing effects of the reconstruction loss.Adversarial Loss. The adversarial component drives the generator to synthesize outputs that the discriminator classifies as real. We employ the least-squares GAN (LSGAN) formulation [45] with one-sided label smoothing, as expressed in Equation (2):where D is the discriminator and denotes the generator output. Here, is the softened real label, which stabilizes adversarial training and reduces the risk of gradient saturation.Reconstruction Loss. To enforce pixel-level fidelity, we combine Smooth L1 loss with structural similarity (SSIM) into a hybrid objective, shown in Equation (3):where y is the ground-truth unmasked image. This hybrid formulation promotes both local accuracy and structural consistency.Perceptual Loss. The perceptual term measures semantic similarity in a pretrained VGG feature space, as given in Equation (4):where denotes the activation map of the l-th VGG layer. This encourages the generator to produce perceptually realistic textures and semantically faithful reconstructions.
Together, these architectural components and loss functions ensure that CARU2-Net achieves a balance between structural correctness, visual realism, and semantic coherence in face unmasking. - B.
- Discriminator. The discriminator is implemented as a five-layer convolutional network with spectral normalization [46], as illustrated in Figure 4. Spectral normalization stabilizes adversarial training by constraining the Lipschitz constant, preventing gradient explosion and mode collapse. Each convolutional layer employs kernels with a stride of 2 for the first three layers and a stride of 1 for the last two layers, combined with LeakyReLU activations (). The channel configuration progresses as , gradually increasing feature depth while reducing spatial resolution.To further improve stability, we adopt a least-squares adversarial formulation (LSGAN) with one-sided label smoothing [45]. Real samples are assigned a softened label, , and fake samples a label, , reducing overconfidence in the discriminator and mitigating oscillations during training. The discriminator loss is defined in Equation (5):where denotes the discriminator output for a real image, x, and for a generated image. Here, and are the softened real and fake labels, respectively.This design ensures that the discriminator effectively distinguishes real from generated samples while providing stable adversarial gradients that guide the generator toward photorealistic face reconstructions.
4. Experiments and Results
4.1. Dataset and Training Details
4.2. Results
- 1.
- Quantitative Comparisons We evaluate M2UNet using five widely adopted quality metrics: PSNR, SSIM [49], FID [50], loss, and LPIPS [51]. PSNR and SSIM measure fidelity and structural similarity, where higher values indicate better reconstruction quality. Conversely, lower values for FID, loss, and LPIPS correspond to more realistic image distributions, reduced pixel-wise errors, and improved perceptual similarity. Additionally, to address the critical requirement for practical deployment, we report three computational efficiency metrics: model parameters (Params), floating-point operations (FLOPs), and inference time.As is presented in Table 1, M2UNet achieves state-of-the-art restoration quality, surpassing all competing approaches in PSNR (31.34 dB) and SSIM (0.9576). Notably, M2UNet outperforms recent diffusion-based models (RePaint, CoPaint) and multimodal models (MuFIN) in fidelity metrics while maintaining a fraction of their computational cost. The generator contains only 3.17 million parameters, which is approximately half the size of the second-best GLCIC (6.07 M) and over 100× smaller than RePaint (552 M). Furthermore, with a computational cost of just 42.53 GFLOPs—the lowest among all compared models—M2UNet achieves an inference time of 0.12 s. This unique balance validates the effectiveness of the proposed Nested U2 architecture and CAR blocks, demonstrating that high-fidelity restoration can be achieved without the massive computational burden typically associated with modern diffusion or multimodal architectures.
- 2.
- Qualitative ComparisonsWhile quantitative metrics provide an objective evaluation, perceptual realism and visual fidelity are best assessed qualitatively. Figure 6 shows ten randomly selected masked face samples and the corresponding outputs of M2UNet compared with SOTA methods. M2UNet effectively restores occluded facial components such as the mouth, nose, and cheeks, producing sharper textures, consistent coloring, and natural expressions. In contrast, competing approaches often yield blurred, distorted, or artifact-prone outputs. Moreover, M2UNet demonstrates robustness across diverse mask types, sizes, and colors, as well as variations in age, gender, and skin tone. These observations confirm that the combined effect of segmentation guidance and the CARU2-Net generator leads to more perceptually convincing and identity-preserving reconstructions.
- 3.
- Limitations and Failure CasesWhile M2UNet demonstrates robust performance across quantitative and qualitative benchmarks, it exhibits certain limitations that define its operational boundaries. As illustrated in Figure 7, the model encounters challenges in scenarios involving extreme facial poses (e.g., full profile views), excessively large masks that occlude the majority of facial landmarks, or complex cases featuring multiple overlapping occlusions, such as a mask combined with a microphone or a hand. In these instances, the generator may fail to preserve global structural coherence.Furthermore, a notable performance gap exists when applying the framework to real-world masked faces (Figure 7, bottom row). Because the model is trained on simulated data, it occasionally fails to generalize to the intricate textures, varying lighting conditions, and diverse geometries characteristic of physical masks. These failures are primarily rooted in the inherent bias of the synthetic training set, which lacks “in-the-wild” variance and predominantly features frontal orientations.Consequently, bridging this synthetic-to-real gap represents a primary direction for future research. Enhancing the model’s robustness through the incorporation of diverse real-world datasets and the exploration of domain-invariant training strategies will be essential for ensuring reliable, high-fidelity unmasking under unconstrained conditions.
4.3. Ablation Study
5. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| M2UNet | Masked-to-Unmasked Network |
| CNN | Convolutional Neural Network |
| GAN | Generative Adversarial Network |
| M-Seg | Mask Segmentation |
| PSNR | Peak Signal-to-Noise Ratio |
| SSIM | Structural Similarity Index |
| FID | Fréchet Inception Distance |
| LPIPS | Learned Perceptual Image Patch Similarity |
| CAR | Conv-Attention Residual Block |
| CARU | Conv-Attention Residual U-blocks |
| CBAM | Convolutional Block Attention Module |
| SOTA | State-of-the-Art |
References
- Din, N.U.; Javed, K.; Bae, S.; Yi, J. A novel GAN-based network for unmasking of masked face. IEEE Access 2020, 8, 44276–44287. [Google Scholar] [CrossRef]
- Mahmoud, M.; Kasem, M.S.; Kang, H.S. A comprehensive survey of masked faces: Recognition, detection, and unmasking. arXiv 2024, arXiv:2405.05900. [Google Scholar] [CrossRef]
- Suvorov, R.; Logacheva, E.; Mashikhin, A.; Remizova, A.; Ashukha, A.; Silvestrov, A.; Kong, N.; Goka, H.; Park, K.; Lempitsky, V. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2149–2159. [Google Scholar]
- Senussi, M.F.; Abdalla, M.; Kasem, M.S.; Mahmoud, M.; Yagoub, B.; Kang, H.S. A Comprehensive Review on Light Field Occlusion Removal: Trends, Challenges, and Future Directions. IEEE Access 2025, 13, 42472–42493. [Google Scholar] [CrossRef]
- Zheng, C.; Cham, T.J.; Cai, J. Pluralistic free-form image completion. Int. J. Comput. Vis. 2021, 129, 2786–2805. [Google Scholar] [CrossRef]
- Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4471–4480. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Lin, Q.; Yan, B.; Li, J.; Tan, W. Mmfl: Multimodal fusion learning for text-guided image inpainting. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1094–1102. [Google Scholar] [CrossRef]
- Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv 2021, arXiv:2112.10741. [Google Scholar]
- Jassim, F.A. Image inpainting by Kriging interpolation technique. arXiv 2013, arXiv:1306.0139. [Google Scholar] [CrossRef]
- Alsalamah, M.; Amin, S. Medical image inpainting with RBF interpolation technique. Int. J. Adv. Comput. Sci. Appl. 2016, 7, 91–99. [Google Scholar] [CrossRef]
- Criminisi, A.; Pérez, P.; Toyama, K. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 2004, 13, 1200–1212. [Google Scholar] [CrossRef]
- Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.B. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 2009, 28, 24. [Google Scholar] [CrossRef]
- Liu, J.; Musialski, P.; Wonka, P.; Ye, J. Tensor completion for estimating missing values in visual data. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 208–220. [Google Scholar] [CrossRef] [PubMed]
- Simakov, D.; Caspi, Y.; Shechtman, E.; Irani, M. Summarizing visual data using bidirectional similarity. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1–8. [Google Scholar] [CrossRef]
- Darabi, S.; Shechtman, E.; Barnes, C.; Goldman, D.B.; Sen, P. Image melding: Combining inconsistent images using patch-based synthesis. ACM Trans. Graph. (TOG) 2012, 31, 1–10. [Google Scholar] [CrossRef]
- Bertalmio, M.; Sapiro, G.; Caselles, V.; Ballester, C. Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, 23–28 July 2000; pp. 417–424. [Google Scholar] [CrossRef]
- Biradar, R.L.; Kohir, V.V. A novel image inpainting technique based on median diffusion. Sadhana 2013, 38, 621–644. [Google Scholar] [CrossRef]
- Prasath, V.S.; Thanh, D.N.; Hai, N.H.; Cuong, N.X. Image restoration with total variation and iterative regularization parameter estimation. In Proceedings of the 8th International Symposium on Information and Communication Technology, Nha Trang, Vietnam, 7–8 December 2017; pp. 378–384. [Google Scholar] [CrossRef]
- Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph. (ToG) 2017, 36, 1–14. [Google Scholar] [CrossRef]
- Zuo, Z.; Zhao, L.; Li, A.; Wang, Z.; Zhang, Z.; Chen, J.; Xing, W.; Lu, D. Generative image inpainting with segmentation confusion adversarial training and contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; The Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2023; Volume 37, pp. 3888–3896. [Google Scholar] [CrossRef]
- Liu, J.; Gong, M.; Tang, Z.; Qin, A.K.; Li, H.; Jiang, F. Deep image inpainting with enhanced normalization and contextual attention. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6599–6614. [Google Scholar] [CrossRef]
- Yu, Y.; Zhan, F.; Wu, R.; Pan, J.; Cui, K.; Lu, S.; Ma, F.; Xie, X.; Miao, C. Diverse image inpainting with bidirectional and autoregressive transformers. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 69–78. [Google Scholar] [CrossRef]
- Li, W.; Lin, Z.; Zhou, K.; Qi, L.; Wang, Y.; Jia, J. Mat: Mask-aware transformer for large hole image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10758–10768. [Google Scholar]
- Lugmayr, A.; Danelljan, M.; Romero, A.; Yu, F.; Timofte, R.; Van Gool, L. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11461–11471. [Google Scholar]
- Liu, H.; Wang, Y.; Qian, B.; Wang, M.; Rui, Y. Structure matters: Tackling the semantic discrepancy in diffusion models for image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 8038–8047. [Google Scholar]
- Zhan, D.; Wu, J.; Luo, X.; Jin, Z. Learning from text: A multimodal face inpainting network for irregular holes. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7484–7497. [Google Scholar] [CrossRef]
- Nazeri, K.; Ng, E.; Joseph, T.; Qureshi, F.; Ebrahimi, M. Edgeconnect: Structure guided image inpainting using edge prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Yang, Y.; Guo, X. Generative landmark guided face inpainting. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Nanjing, China, 16–18 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 14–26. [Google Scholar]
- Li, Y.; Liu, S.; Yang, J.; Yang, M.H. Generative face completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3911–3919. [Google Scholar]
- Mahmoud, M.; Kang, H.S. Ganmasker: A two-stage generative adversarial network for high-quality face mask removal. Sensors 2023, 23, 7094. [Google Scholar] [CrossRef] [PubMed]
- Anwar, A.; Raychowdhury, A. Masked face recognition for secure authentication. arXiv 2020, arXiv:2008.11104. [Google Scholar] [CrossRef]
- Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
- Yeh, R.A.; Chen, C.; Yian Lim, T.; Schwing, A.G.; Hasegawa-Johnson, M.; Do, M.N. Semantic image inpainting with deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5485–5493. [Google Scholar]
- Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.C.; Tao, A.; Catanzaro, B. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 85–100. [Google Scholar]
- Xu, S.; Liu, D.; Xiong, Z. E2I: Generative inpainting from edge to image. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 1308–1322. [Google Scholar] [CrossRef]
- Wan, Z.; Zhang, J.; Chen, D.; Liao, J. High-fidelity pluralistic image completion with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 4692–4701. [Google Scholar]
- Liu, H.; Wang, Y.; Wang, M.; Rui, Y. Delving globally into texture and structure for image inpainting. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 1270–1278. [Google Scholar] [CrossRef]
- Zhang, G.; Ji, J.; Zhang, Y.; Yu, M.; Jaakkola, T.; Chang, S. Towards coherent image inpainting using denoising diffusion implicit models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 41164–41193. [Google Scholar]
- Zhang, L.; Chen, Q.; Hu, B.; Jiang, S. Text-guided neural image inpainting. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1302–1310. [Google Scholar] [CrossRef]
- Xiao, J.; Zhan, D.; Qi, H.; Jin, Z. When face completion meets irregular holes: An attributes guided deep inpainting network. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 3202–3210. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar]
- Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv 2018, arXiv:1802.05957. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8024–8035. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6626–6637. [Google Scholar] [CrossRef]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]








| Method | Restoration Quality | Computational Efficiency | ||||||
|---|---|---|---|---|---|---|---|---|
| PSNR↑ | SSIM↑ | FID↓ | L1↓ | LPIPS↓ | Params (M)↓ | FLOPs (G)↓ | Time (s)↓ | |
| GLCIC [20] | 26.7798 | 0.9392 | 16.5023 | 0.0289 | 0.0479 | 6.07 | 45.44 | 0.004 |
| GatedConv [6] | 23.3402 | 0.9069 | 90.7754 | 0.0435 | 0.0824 | 35.58 | 193.01 | 0.025 |
| GUMF [1] | 18.4316 | 0.6825 | 58.6500 | 0.2272 | 0.1919 | 75.49 | 59.29 | 0.004 |
| RePaint [25] | 27.6341 | 0.9314 | 9.9996 | 0.0274 | 0.0331 | 552.81 | 278,440 | 27.340 |
| CoPaint [40] | 27.3555 | 0.9078 | 14.7592 | 0.0338 | 0.0393 | 552.81 | 278,440 | 27.400 |
| SCAT [21] | 29.3478 | 0.9455 | 9.9740 | 0.0177 | 0.0300 | 15.20 | 72.88 | 0.011 |
| GANMasker [31] | 30.7759 | 0.9516 | 10.3323 | 0.0159 | 0.0328 | 35.58 | 193.01 | 0.025 |
| MuFIN [27] | 29.8727 | 0.9480 | 9.2469 | 0.0172 | 0.0253 | 29.92 | 3,234 | 0.022 |
| Ours (M2UNet) | 31.3375 | 0.9576 | 7.7004 | 0.0143 | 0.0219 | 3.17 | 42.53 | 0.120 |
| Method Variant | PSNR ↑ | SSIM ↑ | FID ↓ | L1 ↓ | LPIPS ↓ |
|---|---|---|---|---|---|
| Base (5-Stage, No Seg., No Att.) | 30.0289 | 0.9524 | 13.6166 | 0.0177 | 0.0360 |
| Base + Att. (5-Stage, No Seg.) | 30.8619 | 0.9556 | 10.1852 | 0.0155 | 0.0291 |
| Base + Seg. (5-Stage, No Att.) | 31.1188 | 0.9571 | 9.3011 | 0.0149 | 0.0258 |
| Base + Seg. (3-Stage, No Att.) | 30.9331 | 0.9566 | 9.8715 | 0.0153 | 0.0275 |
| Base + Seg. + Att. (3-Stage) | 31.1333 | 0.9577 | 9.1963 | 0.0148 | 0.0261 |
| Full M2UNet (5-Stage, Seg. + Att.) | 31.3375 | 0.9576 | 7.7004 | 0.0143 | 0.0219 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Mahmoud, M.; Senussi, M.F.; Abdalla, M.; Kasem, M.S.; Kang, H.-S. M2UNet: A Segmentation-Guided GAN with Attention-Enhanced U2-Net for Face Unmasking. Mathematics 2026, 14, 477. https://doi.org/10.3390/math14030477
Mahmoud M, Senussi MF, Abdalla M, Kasem MS, Kang H-S. M2UNet: A Segmentation-Guided GAN with Attention-Enhanced U2-Net for Face Unmasking. Mathematics. 2026; 14(3):477. https://doi.org/10.3390/math14030477
Chicago/Turabian StyleMahmoud, Mohamed, Mostafa Farouk Senussi, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, and Hyun-Soo Kang. 2026. "M2UNet: A Segmentation-Guided GAN with Attention-Enhanced U2-Net for Face Unmasking" Mathematics 14, no. 3: 477. https://doi.org/10.3390/math14030477
APA StyleMahmoud, M., Senussi, M. F., Abdalla, M., Kasem, M. S., & Kang, H.-S. (2026). M2UNet: A Segmentation-Guided GAN with Attention-Enhanced U2-Net for Face Unmasking. Mathematics, 14(3), 477. https://doi.org/10.3390/math14030477

