1. Introduction
1.1. Motivation
Image inpainting is a sophisticated process for reconstructing missing or damaged areas of an image [
1]. It also involves seamlessly removing unwanted elements (such as snow [
2] or shadows [
3,
4]). It is essential in fields such as photography, film production, and digital art, where preserving the integrity of visual content is critical. By enabling the restoration of images to their original state or enhancing them for aesthetic appeal [
5,
6], image inpainting plays a vital role in applications such as object removal and image restoration.
Recent advances in AI have significantly improved inpainting techniques, surpassing traditional methods that rely on local pixel information [
7]. The AI models offer greater accuracy and realism, particularly in reconstructing human faces, which require high precision and structural integrity. However, inpainting faces under fully occluding semantic masks remains a challenging and underexplored problem. Unlike random masks, semantic masks completely obscure key facial components, requiring models to infer entire structures from context while preserving overall facial coherence.
Semantic masks are important because they simulate real-world scenarios where specific facial regions are systematically occluded—for instance, due to medical conditions, protective coverings, or privacy-preserving applications. Addressing this challenge pushes AI models to capture not only local textures but also global facial structures, enabling more reliable, accurate, and context-aware inpainting. Mastering semantic-mask inpainting has practical implications in facial editing, virtual try-on, augmented reality, privacy-preserving facial recognition, and medical reconstruction, highlighting its significance as an open and impactful research direction.
1.2. Literature Review
Recent research in image inpainting has evolved across several architectural paradigms, including CNN-based models, GANs, transformers, and diffusion models. Traditional inpainting methods rely on information from surrounding pixels but often fall short in capturing complex features [
7]. Deep learning significantly enhances inpainting algorithms by enabling more accurate, detailed, and realistic reconstruction of missing regions. Models integrating convolutional operations [
8] and attention mechanisms [
9] effectively capture both fine-grained textures and high-level semantic features, thereby improving the accuracy of predictions in damaged regions. The encoder–decoder Convolutional Neural Network (CNN) model [
10] was trained to fill in the missing parts of an image with structures and textures that fit the existing image content. Generative Adversarial Networks (GANs) have further advanced inpainting by introducing adversarial training, which promotes more realistic and contextually accurate inpainted results [
11]. These developments reflect the shift from pixel-based reconstruction toward learning-based approaches that capture high-level image structures.
Enhanced GAN architectures have been developed to improve inpainting quality and flexibility in [
12,
13,
14]. LaMa [
12] is a new network architecture that uses Fourier convolutions with an image-wide receptive field and thus supports a large training mask. In [
13], a GAN with gated convolution was trained to perform inpainting using free-form masks, which can be either freely drawn by a user or generated automatically. This flexibility facilitates more natural and context-aware image editing. In [
14], an enhanced GAN-based model is proposed for high-resolution image inpainting, based on aggregated contextual transformations. Overall, these architectures aim to enhance GAN models’ ability to capture broader contextual information and handle increasingly complex patterns.
Transformers have become a powerful choice for image inpainting [
15,
16,
17]. The T-former model [
15] introduces a transformer-based approach with an efficient attention mechanism that reduces computational complexity, effectively addressing CNN limitations like local priors and fixed spatial parameters through resolution-dependent attention. Another transformer-based model [
16] handles large image holes by integrating both transformer and convolutional features effectively. The Continuous-Mask-Aware Transformer (CMT) [
17] introduces a continuous mask to capture token errors, improving masked self-attention with overlapping tokens and refining inpainting results iteratively. These transformer-based approaches emphasize global context modeling, which is particularly important when reconstructing large missing regions.
More recently, diffusion models—an alternative class of generative models—have been employed in image inpainting tasks [
18,
19,
20,
21,
22]. These models can be categorized as either preconditioned, which offer fast inference but are expensive to train, or postconditioned, which require no additional training but are computationally slower. LatentPaint [
21] bridges these two paradigms by employing forward–backward fusion in a latent space, enhanced with a novel propagation module. Similarly, Latent Diffusion Models (LDMs) proposed in [
19] exploit the latent space of powerful pretrained autoencoders, enabling high-resolution synthesis while reducing computational overhead. Together, these diffusion-based approaches demonstrate the growing role of probabilistic generative models in producing high-quality inpainting results.
Collectively, these studies illustrate the evolution of image inpainting from local pixel-based methods toward models capable of capturing complex semantic structures, global context, and high-quality probabilistic synthesis.
1.3. Applications of Semantic Masks
Semantic masks, which completely obscure specific facial components, may present a more challenging task for image inpainting compared to random masks that leave parts of the main components of the face visible. One key advantage of semantic masks is that they enable targeted modification, where a specific facial region can be altered or reconstructed while the remaining facial components remain unchanged. This capability requires more structured and context-aware restoration.
In facial recognition, semantic masks are used to protect privacy by selectively masking specific facial features (e.g., eyes or mouth) while leaving other identity-related regions untouched, thereby preserving recognition utility [
23,
24]. In medical imaging, semantic masks allow the reconstruction of damaged or surgically altered facial components while maintaining the surrounding facial structure, which supports surgical planning and recovery visualization [
25,
26]. In creative industries, semantic masks enable precise modification or restoration of individual facial features for digital art and content creation, while preserving the integrity of the remaining facial regions [
27]. These diverse applications highlight the broad utility of semantic masks in improving image restoration techniques across multiple domains.
Beyond these domains, semantic masks are increasingly relevant for face anonymization and data sharing, where sensitive facial attributes are obscured while non-sensitive regions are preserved to maintain data utility [
28]. Additionally, in augmented reality and virtual try-on applications, semantic masks can enable the realistic modification of specific components while leaving others unchanged, ensuring visual consistency and realism. Although existing work on virtual try-on primarily focuses on clothing [
29], similar semantic-masking principles can be applied to facial components (e.g., hair, makeup, or accessories) in future AR/face-editing applications.
Despite these applications, image inpainting under fully occluding semantic masks remains an open research direction, as accurately restoring a single facial component while preserving global facial coherence continues to pose significant challenges for current AI models.
1.4. Summary of Contributions
Performance Analysis of Pre-trained Models on Semantic Masks:
This paper evaluates state-of-the-art image inpainting methods for reconstructing human faces using semantic masks. We assess these methods’ capabilities to restore the main components of the human face, rather than relying on conventional random masks. While prior studies have examined large-hole and mask-aware inpainting, they predominantly employ irregular or random masks and do not systematically evaluate reconstruction when entire semantic facial components are fully occluded. Unlike random masking, which may leave portions of key facial structures visible, semantic masking completely obscures specific components, thereby removing structural cues and posing a fundamentally different reconstruction challenge.
Restoring human faces with semantic masks involves addressing the distinct challenges posed by each facial component. For example, hair requires modeling texture, flow, and color continuity; eyes demand symmetry, sharpness, and precise reconstruction of the iris and eyelashes; and the mouth involves the complex dynamics of teeth, lips, and proper alignment with the jawline.
Retraining Models to Improve Performance:
Additionally, we investigate the impact of different masking strategies on inpainting performance. To improve inpainting accuracy, we conduct three retraining processes using semantic masks, random masks, and a combination of both. The combined approach is proposed to improve the model’s contextual awareness and inpainting performance.
1.5. Paper Layout
Section 2 describes the benchmark setup, including model selection, dataset, computing machine, and evaluation metrics. In
Section 3, the selected models are compared at different resolutions (
Section 3.1), and their limitations are highlighted (
Section 3.2).
Section 4 focuses on retraining the best-performing model using various masking strategies: random, semantic, and mixed masks.
Section 5 concludes the paper.
3. Results and Comparative Analysis
In this section, we compare the performance of various image inpainting models at different resolutions across multiple mask classes. Our analysis reveals that models perform differently depending on both the type of mask used and the resolution of the input images. The results highlight the strengths and weaknesses of each model in handling specific facial features, such as eyes, mouth, and hair. The following subsections provide a detailed comparison based on three key evaluation metrics, FID, P-IDS, and U-IDS, followed by a discussion of the strengths and weaknesses of each model as observed across various test conditions.
3.1. Scores Comparison
Comparing the results of this study with those in [
16], we observe that semantic masks are more challenging to inpaint than random masks, as reflected in the evaluation metrics (higher FID values for semantic masks). Semantic masks fully obscure key facial features, whereas random masks may leave parts of these features visible, facilitating the inpainting process. The models selected for comparison operate at different resolutions, with some capable of handling multiple resolutions. They are categorized into three groups based on their resolution.
3.1.1. Resolution 256
The FID values for different methods at a low resolution of 256 across mask classes (A–L) are shown in
Table 4. Based on these FID values, MI-GAN ranks among the top three performing methods in 10 mask classes, Latent-based in 9 mask classes, and MAT in 8 mask classes. MI-GAN is the top-performing method in 4 mask classes, while Latent-based leads in 6 mask classes. As a relatively small model primarily designed for mobile devices, MI-GAN performs well. The changes in FID values across different mask classes and methods are consistent.
MI-GAN achieves the highest P-IDS value across all mask classes (see
Table 4). MI-GAN, LaMa, and MAT generally emerge as the top three performing methods across different mask classes based on P-IDS values. For U-IDS values (see
Table 4), MI-GAN achieves the highest U-IDS in 10 mask classes. MI-GAN, Latent-based, and MAT are generally the top three performing methods across different mask classes based on U-IDS values. The P-IDS and U-IDS values occasionally collapse to zero because the corresponding semantic masks fully occlude identity-defining facial components. In such cases, the identity recognition network used to compute these metrics cannot extract reliable identity embeddings from the reconstructed region, resulting in near-zero similarity scores. This behavior reflects the extreme difficulty of reconstructing identity-specific features when all structural information for a component is removed.
3.1.2. Resolution 512
The FID, P-IDS, and U-IDS values for different methods at a resolution of 512 across various mask classes are shown in
Table 4. Based on FID, MAT with the FFHQ model ranks among the top three performing methods across 12 mask classes, being the best-performing method in 7 of them. MAT with the CelebA model ranks among the top three performing methods across 11 mask classes, while the MADF model ranks among the top three performing methods across 9 mask classes. However, Co-Mod-GAN generally outperforms MADF in both P-IDS and U-IDS.
Figure 2 illustrates the performance of different models in restoring facial key components at a resolution of 512. The figure highlights some limitations of these models. For example, in eye inpainting, LaMa and MADF generate blurry and mixed eyes. For nose masks, LaMa generates a small nose and a dual mouth. With face masks, MAT increases the thickness of the eyelashes; LaMa generates shorter face and face reflections; Co-Mod-GAN produces visible borders around the nose; and MADF generates an additional eye, overriding the existing one. For hair masks, MADF produces hair that lacks a realistic pattern. For mouth and nose masks, MADF generates teeth that override the lips.
3.1.3. Resolution 1024
Only two models, Co-Mod-GAN and LaMa, were trained to handle high-resolution images. The values for FID, P-IDS, and U-IDS are presented in
Table 5. Co-Mod-GAN outperforms the LaMa model in 11 out of the 12 mask classes. The performance of Co-Mod-GAN and LaMa with high-resolution inpainting is illustrated in
Figure 3 and
Figure 4. This demonstrates the capability of Co-Mod-GAN to better adapt to complex inpainting tasks at higher resolutions.
Co-Mod-GAN demonstrates reliable performance for high-quality image inpainting, while LaMa, though capable of handling various resolutions, does not scale well at higher resolutions. This results in Co-Mod-GAN being the more effective model for tasks requiring high-resolution image restoration.
3.2. Findings and Limitations Across Models
Observations and limitations (summarized in
Table A1) were identified for each method based on the analysis of all the generated images. Although some methods achieve relatively low FID values, they still have limitations and may not always perform optimally. This indicates that no single method can be considered superior in all scenarios, and there is still room for improvement. Overall, these observations suggest that ongoing research and refinement are needed to develop more versatile and reliable inpainting methods. All inpainting models consume less than 0.3 s to inpaint one image, except for RePaint, which requires approximately 5 min. The computation time for each model is summarized in
Table 6.
The limitations include various inconsistencies across facial components. For example, eyes may appear unrealistic with issues like thick eyelashes, pupil misalignment, or full black eyes in some models. The mouth may exhibit problems such as merged lips, unrealistic teeth, or misaligned lower lips. Skin inconsistencies include visible mask borders, skin tones that do not blend well, and sometimes unnatural reflections or dark skin patches. Hair restoration often suffers from unrealistic textures, color continuity problems, or incomplete coverage, especially at higher resolutions. Ears may be missing, incomplete, or replaced by hair in some cases. Additionally, noses, while generally well restored in most models, sometimes exhibit incomplete or unrealistic results. These imperfections can significantly impact the overall realism of the generated images. Overall, while these models generally perform well in specific tasks, such limitations indicate there is still room for improvement in generating more accurate and seamless facial inpainting.
These observations indicate that current inpainting models often struggle to consistently restore semantic facial components, suggesting that relying on the current masking strategy during training may limit model robustness; motivated by these findings, the next section (
Section 4) explores alternative training strategies, while
Section 5 highlights key architectural limitations reported in the literature that may further contribute to these challenges.
4. Evaluation of Models Retrained with Semantic, Random, and Mixed Masks
The resolution of the images used for this evaluation is 512. Since MAT has demonstrated the greatest potential for inpainting images at this resolution, as shown in
Table 4, it was chosen for the experiments described in this section.
In this evaluation, MAT was retrained using semantic masks. To compare the performance of MAT trained with semantic masks against MAT trained with random masks, a version of MAT was also retrained with random masks under the same configuration to ensure fairness. It is worth noting that the retrained MAT with random masks achieved FID, P-IDS, and U-IDS values close to those of the pretrained MAT model provided by its authors. These models were then evaluated for faces inpainting with semantic masks.
MAT was retrained using 24,000 images and tested on a dataset of 6000 images. The MAT model was retrained using the Adam optimizer with a learning rate of and , . The batch size was set to 8, and the model was trained for 100 epochs. The batch size was limited to 8 to comply with hardware constraints, as increasing the batch size exceeded the available 40 GB GPU memory. The model parameters were initialized from scratch rather than using pretrained weights, as the inpainting task with semantic masks involves fully occluded facial regions and is more complex than inpainting with random masks. During training, standard data augmentation techniques were applied, including random horizontal flipping and rotation. No additional color jittering or geometric transformations were employed. Each training session took approximately 7 days and 13 h. The FID values achieved on semantic masks by training MAT with semantic masks are compared to the FID values achieved on semantic masks by MAT trained with random masks. It is shown that MAT trained with random masks outperforms MAT trained with semantic masks for inpainting faces with semantic masks.
MAT trained with random masks alone was observed to outperform MAT trained with semantic masks alone for inpainting tasks involving semantic masks. This result suggests that random masks may promote better generalization and reduce overfitting. Training with random masks exposes the model to a diverse range of missing regions, which may encourage it to learn broader contextual relationships across the entire image. This may lead to a more robust model capable of handling varied inpainting scenarios. In contrast, semantic masks provide predefined regions, which may lead to overfitting of specific features and reduce the model’s ability to generalize. As a result, MAT trained with random masks alone tends to produce more realistic inpainted images, achieving lower FID scores compared to MAT trained with semantic masks alone under the evaluated experimental setting.
To leverage the strengths of both random and semantic masks and improve inpainting performance, we retrain the MAT model using a combination of both mask types. This combined approach, referred to as the “mixed” masking strategy, exposes the model to a diverse set of regions for inpainting. By randomly selecting between random and semantic masks for each image, the model is forced to learn a broader range of contextual relationships. This allows it to adapt to varying types of missing information, which is particularly beneficial for inpainting complex features such as facial components.
The inclusion of semantic masks provides the model with more structured guidance for inpainting, while random masks encourage flexibility and better generalization by presenting the model with more unpredictable scenarios. This hybrid strategy enhances the model’s context-aware capabilities, leading to more realistic and accurate reconstructions of facial features, especially in regions where fine details are crucial. The model trained with mixed masks achieves a lower FID value across almost all mask indices, as shown in
Table 7 (statistical validation in
Appendix A.3). This suggests that combining the two masking strategies allows the model to outperform its counterparts trained with only random or only semantic masks, particularly for inpainting images with semantic masks. The mixed mask approach, therefore, strikes an optimal balance between flexibility and structure, enabling the model to handle a wider variety of inpainting tasks with greater fidelity and realism.
Based on
Table 7, there is a notable difference between the FID values for MAT trained with random, semantic, and mixed masks for the mask classes with indices (C, G, and I), corresponding to full mouth, hair, and face masks. To further illustrate these differences, we randomly select samples and visualize them in
Figure 5.
For the face mask, as shown in
Figure 5, MAT trained with random masks achieves a blended color, but the face appears short and inconsistent. MAT trained with semantic masks results in a face size and shape closer to the ground truth; however, it does not blend well with the surrounding color. The face exhibits a noticeably different color tone compared to the nose. MAT trained with mixed masks achieves better results than MAT trained with only random masks or only semantic masks.
For the hair mask, MAT trained with random masks may struggle with hair flow, while MAT trained with a semantic mask may face challenges with color continuity. Using MAT trained with mixed masks achieves a better balance between hair flow and color consistency, as shown in
Figure 5.
For the mouth mask, MAT trained with mixed masks may outperform both MAT trained with a semantic mask and MAT trained with a random mask, as shown in
Figure 5.
5. Structural Limitations of Current AI Models for Semantic Mask Inpainting
Even when trained or fine-tuned on datasets containing fully occluding semantic masks, current AI-based inpainting models such as GANs, CNNs, transformers, and diffusion models face inherent structural limitations. These limitations have been observed in prior studies [
7,
46,
47,
48]. Although most prior work evaluates irregular or partial masks, the same architectural constraints are likely to pose greater challenges in semantic-mask scenarios, where entire facial components are completely missing:
- 1.
Local Receptive Fields and Convolutional Bias: GAN-based and CNN architectures rely on local neighborhoods to infer missing pixels. Fully occluding semantic masks remove all local cues for entire facial components, so these models struggle to reconstruct shapes and structures without any visible reference. Studies report structural distortions and texture inconsistencies in large masked regions [
7,
46].
- 2.
Weak Global and Semantic Reasoning: Transformers and diffusion models capture some global context, but their training often emphasizes texture completion rather than high-level structural or semantic relationships. As a result, models can generate plausible textures but often misplace or deform components [
47,
48], such as eyes, noses, or mouths.
- 3.
Limited Structural Priors: Most models learn statistical correlations between observed pixels rather than explicit knowledge of facial anatomy [
46,
47]. Even with semantic-mask-specific training, the models lack encoded structural understanding, which is essential for accurate reconstruction of fully missing facial elements.
- 4.
Ambiguity in Fully Occluded Regions: Semantic masks create regions where multiple plausible reconstructions exist. Current models are not designed to resolve such ambiguities, and they tend to produce averaged or unrealistic results, rather than semantically correct features.
These structural challenges explain why training alone cannot fully overcome the difficulty of semantic masks. Achieving high-fidelity reconstruction likely requires architectures that explicitly encode facial geometry, component interdependencies, and semantic constraints, in addition to learning from data.
6. Conclusions
In conclusion, while AI-based inpainting methods show promising capabilities in restoring key facial components with realistic and contextually appropriate results, significant challenges remain, particularly in achieving seamless blending, handling complex facial structures, and preserving fine details such as eyes, teeth, hair, and eyelashes. These limitations indicate that fully accurate facial reconstruction under fully occluding semantic masks is still difficult. Addressing these challenges will likely require the development of new AI architectures or innovative model configurations specifically designed for fully occluded inpainting tasks. Additionally, while semantic masks pose a considerable obstacle, combining them with random masks in a hybrid approach enhances the model’s contextual understanding and improves inpainting performance. This strategy leads to more accurate and natural restoration of facial features across diverse settings, though complete reconstruction remains an open problem.
Future work may focus on developing new architectures that are component-aware, explicitly modeling individual facial regions (e.g., eyes, mouth, hair) while maintaining global facial coherence. Another promising direction is the investigation of adaptive masking strategies that dynamically combine random and semantic masks during training to further improve generalization across different occlusion types. Additionally, designing component-specific loss functions that prioritize accurate reconstruction of critical facial features may enhance the realism and fidelity of restored regions. Finally, extending the proposed mixed-masking strategy to other structured image domains beyond facial images can help evaluate its applicability and effectiveness in broader inpainting tasks.