Evaluating AI-Based Image Inpainting Techniques for Facial Components Restoration Using Semantic Masks

Sharadga, Hussein; Hayajneh, Abdullah; Serpedin, Erchin

doi:10.3390/ai7040119

Open AccessArticle

Evaluating AI-Based Image Inpainting Techniques for Facial Components Restoration Using Semantic Masks

by

Hussein Sharadga

^1,*

,

Abdullah Hayajneh

² and

Erchin Serpedin

³

¹

School of Engineering, Texas A&M International University, 5201 University Blvd, Laredo, TX 78041, USA

²

King Abdullah II School of Engineering, Princess Sumaya University for Technology, Khalil Al Saket St 112, Amman 1438, Jordan

³

Department of Electrical and Computer Engineering, College of Engineering, Texas A&M University, 3127 TAMU, College Station, TX 77843, USA

^*

Author to whom correspondence should be addressed.

AI 2026, 7(4), 119; https://doi.org/10.3390/ai7040119

Submission received: 8 February 2026 / Revised: 12 March 2026 / Accepted: 23 March 2026 / Published: 30 March 2026

(This article belongs to the Special Issue Deep Learning Technologies and Their Applications in Image Processing, Computer Vision, and Computational Intelligence)

Download

Browse Figures

Versions Notes

Abstract

This paper presents a comparative analysis of advanced AI-based techniques for human face inpainting using semantic masks that fully occlude targeted facial components. The primary objective is to evaluate the ability of image inpainting methods to accurately restore semantically meaningful facial features. Our results show that existing inpainting models face significant challenges when semantic masks completely obscure the underlying facial structures. In contrast to random masks, which leave partial visual cues, semantic masks remove all structural information, making reconstruction substantially more difficult. We assess the performance of generative adversarial networks (GANs), transformer-based models, and diffusion models in restoring fully occluded facial components. To address these challenges, we explore three retraining strategies: using semantic masks, using random masks, and a hybrid approach combining both. While the hybrid strategy leverages the complementary strengths of each mask type and improves contextual understanding, fully accurate reconstruction remains challenging. These findings demonstrate that inpainting under fully occluding semantic masks is a critical yet underexplored area, offering opportunities for developing new AI architectures and strategies for advanced facial reconstruction.

Keywords:

image inpainting; semantic masks; face restoration

1. Introduction

1.1. Motivation

Image inpainting is a sophisticated process for reconstructing missing or damaged areas of an image [1]. It also involves seamlessly removing unwanted elements (such as snow [2] or shadows [3,4]). It is essential in fields such as photography, film production, and digital art, where preserving the integrity of visual content is critical. By enabling the restoration of images to their original state or enhancing them for aesthetic appeal [5,6], image inpainting plays a vital role in applications such as object removal and image restoration.

Recent advances in AI have significantly improved inpainting techniques, surpassing traditional methods that rely on local pixel information [7]. The AI models offer greater accuracy and realism, particularly in reconstructing human faces, which require high precision and structural integrity. However, inpainting faces under fully occluding semantic masks remains a challenging and underexplored problem. Unlike random masks, semantic masks completely obscure key facial components, requiring models to infer entire structures from context while preserving overall facial coherence.

Semantic masks are important because they simulate real-world scenarios where specific facial regions are systematically occluded—for instance, due to medical conditions, protective coverings, or privacy-preserving applications. Addressing this challenge pushes AI models to capture not only local textures but also global facial structures, enabling more reliable, accurate, and context-aware inpainting. Mastering semantic-mask inpainting has practical implications in facial editing, virtual try-on, augmented reality, privacy-preserving facial recognition, and medical reconstruction, highlighting its significance as an open and impactful research direction.

1.2. Literature Review

Recent research in image inpainting has evolved across several architectural paradigms, including CNN-based models, GANs, transformers, and diffusion models. Traditional inpainting methods rely on information from surrounding pixels but often fall short in capturing complex features [7]. Deep learning significantly enhances inpainting algorithms by enabling more accurate, detailed, and realistic reconstruction of missing regions. Models integrating convolutional operations [8] and attention mechanisms [9] effectively capture both fine-grained textures and high-level semantic features, thereby improving the accuracy of predictions in damaged regions. The encoder–decoder Convolutional Neural Network (CNN) model [10] was trained to fill in the missing parts of an image with structures and textures that fit the existing image content. Generative Adversarial Networks (GANs) have further advanced inpainting by introducing adversarial training, which promotes more realistic and contextually accurate inpainted results [11]. These developments reflect the shift from pixel-based reconstruction toward learning-based approaches that capture high-level image structures.

Enhanced GAN architectures have been developed to improve inpainting quality and flexibility in [12,13,14]. LaMa [12] is a new network architecture that uses Fourier convolutions with an image-wide receptive field and thus supports a large training mask. In [13], a GAN with gated convolution was trained to perform inpainting using free-form masks, which can be either freely drawn by a user or generated automatically. This flexibility facilitates more natural and context-aware image editing. In [14], an enhanced GAN-based model is proposed for high-resolution image inpainting, based on aggregated contextual transformations. Overall, these architectures aim to enhance GAN models’ ability to capture broader contextual information and handle increasingly complex patterns.

Transformers have become a powerful choice for image inpainting [15,16,17]. The T-former model [15] introduces a transformer-based approach with an efficient attention mechanism that reduces computational complexity, effectively addressing CNN limitations like local priors and fixed spatial parameters through resolution-dependent attention. Another transformer-based model [16] handles large image holes by integrating both transformer and convolutional features effectively. The Continuous-Mask-Aware Transformer (CMT) [17] introduces a continuous mask to capture token errors, improving masked self-attention with overlapping tokens and refining inpainting results iteratively. These transformer-based approaches emphasize global context modeling, which is particularly important when reconstructing large missing regions.

More recently, diffusion models—an alternative class of generative models—have been employed in image inpainting tasks [18,19,20,21,22]. These models can be categorized as either preconditioned, which offer fast inference but are expensive to train, or postconditioned, which require no additional training but are computationally slower. LatentPaint [21] bridges these two paradigms by employing forward–backward fusion in a latent space, enhanced with a novel propagation module. Similarly, Latent Diffusion Models (LDMs) proposed in [19] exploit the latent space of powerful pretrained autoencoders, enabling high-resolution synthesis while reducing computational overhead. Together, these diffusion-based approaches demonstrate the growing role of probabilistic generative models in producing high-quality inpainting results.

Collectively, these studies illustrate the evolution of image inpainting from local pixel-based methods toward models capable of capturing complex semantic structures, global context, and high-quality probabilistic synthesis.

1.3. Applications of Semantic Masks

Semantic masks, which completely obscure specific facial components, may present a more challenging task for image inpainting compared to random masks that leave parts of the main components of the face visible. One key advantage of semantic masks is that they enable targeted modification, where a specific facial region can be altered or reconstructed while the remaining facial components remain unchanged. This capability requires more structured and context-aware restoration.

In facial recognition, semantic masks are used to protect privacy by selectively masking specific facial features (e.g., eyes or mouth) while leaving other identity-related regions untouched, thereby preserving recognition utility [23,24]. In medical imaging, semantic masks allow the reconstruction of damaged or surgically altered facial components while maintaining the surrounding facial structure, which supports surgical planning and recovery visualization [25,26]. In creative industries, semantic masks enable precise modification or restoration of individual facial features for digital art and content creation, while preserving the integrity of the remaining facial regions [27]. These diverse applications highlight the broad utility of semantic masks in improving image restoration techniques across multiple domains.

Beyond these domains, semantic masks are increasingly relevant for face anonymization and data sharing, where sensitive facial attributes are obscured while non-sensitive regions are preserved to maintain data utility [28]. Additionally, in augmented reality and virtual try-on applications, semantic masks can enable the realistic modification of specific components while leaving others unchanged, ensuring visual consistency and realism. Although existing work on virtual try-on primarily focuses on clothing [29], similar semantic-masking principles can be applied to facial components (e.g., hair, makeup, or accessories) in future AR/face-editing applications.

Despite these applications, image inpainting under fully occluding semantic masks remains an open research direction, as accurately restoring a single facial component while preserving global facial coherence continues to pose significant challenges for current AI models.

1.4. Summary of Contributions

Performance Analysis of Pre-trained Models on Semantic Masks:
This paper evaluates state-of-the-art image inpainting methods for reconstructing human faces using semantic masks. We assess these methods’ capabilities to restore the main components of the human face, rather than relying on conventional random masks. While prior studies have examined large-hole and mask-aware inpainting, they predominantly employ irregular or random masks and do not systematically evaluate reconstruction when entire semantic facial components are fully occluded. Unlike random masking, which may leave portions of key facial structures visible, semantic masking completely obscures specific components, thereby removing structural cues and posing a fundamentally different reconstruction challenge.
Restoring human faces with semantic masks involves addressing the distinct challenges posed by each facial component. For example, hair requires modeling texture, flow, and color continuity; eyes demand symmetry, sharpness, and precise reconstruction of the iris and eyelashes; and the mouth involves the complex dynamics of teeth, lips, and proper alignment with the jawline.
Retraining Models to Improve Performance:
Additionally, we investigate the impact of different masking strategies on inpainting performance. To improve inpainting accuracy, we conduct three retraining processes using semantic masks, random masks, and a combination of both. The combined approach is proposed to improve the model’s contextual awareness and inpainting performance.

1.5. Paper Layout

Section 2 describes the benchmark setup, including model selection, dataset, computing machine, and evaluation metrics. In Section 3, the selected models are compared at different resolutions (Section 3.1), and their limitations are highlighted (Section 3.2). Section 4 focuses on retraining the best-performing model using various masking strategies: random, semantic, and mixed masks. Section 5 concludes the paper.

2. Methods and Benchmark Setup

2.1. Model Selection from Candidate Pool

The methods considered for comparison are listed in Table 1. According to MAT [16], four methods—MAT, LaMa, Co-Mod-GAN, and MADF—were identified as the most effective for face inpainting among nine evaluated techniques (AOT-GAN, MAT, LaMa, Co-Mod-GAN, MADF, ICT, HiFill, DeepFill v2, EdgeConnect) [12,13,14,16,30,31,32,33,34]. Among these, MAT was found to be the most effective for inpainting faces with both small and large masks, as demonstrated in [16]. In [16], these methods were tested using free-form masks generated by sampling rectangles and brush strokes of random shapes, sizes and locations. In contrast, this study evaluates these methods using semantic masks to specifically assess their ability to restore key facial components. Therefore, among these nine methods, MAT, LaMa, Co-Mod-GAN, and MADF are adapted in the current work.

However, a newer model, CMT [17], was found to outperform MAT in facial inpainting, followed by MAT, in a comparison of five methods (PIC [39,40], ICT, BAT [38], MAT, and CMT) in [17]. Thus, CMT is also adopted in the current work. Two other newer models with great potential are considered in this work. The first approach, MI-GAN [36], is a lightweight model primarily designed to run on mobile devices, yet it achieves performance comparable to state-of-the-art inpainting methods (MAT, LaMa, HiFill, Co-Mod-GAN, SH-GAN [37], ZITS [41], LDM [19]). A pluralistic image inpainting model with large masks represents another new model [35]. This model is based on discrete latent codes and is referred to as “Latent or Latent-based" in this paper. Latent-based model [35] outperforms MAT, LaMa, MaskGIT [42], and PIC [40], as demonstrated in [35].

MAT [16] includes two models CelebA and FFHQ-512, both of which are evaluated in this study. The MAT CelebA model comes in two resolutions: 512 × 512 and 256 × 256, and both were tested. In this study, we used the base model of LaMa, as the authors [12] found that it outperformed other LaMa models on wide masks and did not significantly impact performance on narrow masks. LaMa was adapted to accept images and masks of any resolution [12]. Since the CelebA-HQ dataset has a resolution of 1024 × 1024, we initially used this higher resolution but found LaMa to be ineffective at such high resolutions. Consequently, we evaluated LaMa at resolutions of 512 × 512 and 256 × 256. Co-Mod-GAN presents two versions: one implemented in an older version of TensorFlow and one in PyTorch (version 2.11.0). We used the PyTorch version for compatibility. Co-Mod-GAN includes two models with resolutions of 512 × 512 and 1024 × 1024, both of which were evaluated in this study.

RePaint [18], the diffusion-based model, was initially considered but ultimately evaluated on only 10 images due to its computational inefficiency: it requires about 10 h to inpaint 10 images with 12 semantic mask classes (see mask classes in Table 2), with each class representing a different part of the face. RePaint takes about five minutes to inpaint a single image with one semantic mask class, making it impractical for large-scale testing.

In contrast, the other methods evaluated in this study, including MI-GAN, CMT, MAT, LaMa, Co-Mod-GAN, and MADF, can inpaint approximately 2000 images within five minutes for a single semantic mask class, making them much more suitable for the scope and scale of this comparison. Latent-based model takes 10 min to inpaint about 2000 images for a single semantic mask class, which is still acceptable.

2.2. Dataset

This study utilizes the CelebA-HQ dataset [43], which comprises about 30,000 high-quality images at a resolution of 1024 × 1024. The dataset provides a broad range of diversity in age, ethnicity, and background, as well as a variety of accessories like eyeglasses, sunglasses, and hats. It also includes detailed annotations for facial components, making it well-suited for training and evaluating face inpainting models with semantic masks. Some models require specific resolutions, so the images are rescaled accordingly to ensure compatibility.

2.3. Computing Resources

Table 3 outlines the specifications of the computing machine used for the experiments. It includes details on the GPU model, memory used, and other relevant components that contributed to the performance of the system during the testing phase. Efficient hardware resources played a crucial role in accelerating the training and inference processes, especially when working with high-resolution images and complex models. Additionally, the computational capacity allowed for multiple retraining scenarios and extensive evaluation of different inpainting methods.

2.4. Mask Generation

Various mask generation policies have been proposed including narrow masks, wide masks, box masks, box-and-narrow combination masks, and free-form masks [12,13]. The method of mask generation significantly influences overall model performance [12].

In this study, we evaluate the performance of different inpainting models using semantic masks. The dataset used in [44] includes images with corresponding semantic segmentations based on a 19-class model, where each pixel in the image is represented by an integer value from 0 to 18, corresponding to specific classes. Additionally, a more detailed segmentation is provided by a 34-class model from [44], which captures additional regions such as the cheeks, eyelids, forehead, jaw, and chin. However, the availability of labeled data for the 34-class model is limited [44,45].

In this study, we used the 19-class model for face semantic segmentation with the CelebA-HQ dataset from [43] (see an example in Figure 1). Each image in [43] is accompanied by 19 corresponding mask images in black and white. The CelebA-HQ dataset provides segmentations for various facial regions, including the eyes, eyebrows, nose, upper lip, mouth, lower lip, hair, neck, ears, eyeglasses, skin, clothing, necklace, earrings, and hat.

We generate 12 mask classes (A–L), as shown in Table 2. For each image, we randomly select a mask class from these 12 options during training. Each mask obscures only the components associated with the chosen class; multiple components from different mask classes are not masked simultaneously during training or testing.

2.5. Evaluation Metrics

To evaluate different inpainting methods, we use perceptual metrics (FID, P-IDS, and U-IDS). Metrics like PSNR and SSIM do not correlate well with human perception of image quality [16]. Among the perceptual metrics, FID is the most commonly used, as demonstrated in [12,16,17,30,31,35,36].

These metrics (FID, P-IDS, and U-IDS) operate in a deep feature space learned by the Inception network and have been extensively adopted in generative modeling and the image inpainting literature. Compared to pixel-wise measures, they better capture high-level perceptual structure and distributional similarity between real and generated images. In this work, their use is further supported by statistical analysis across multiple evaluation subsets, ensuring that reported improvements are consistent and statistically significant rather than artifacts of sampling variability.

FID (Fréchet Inception Distance): Measures the distance between real and generated image distributions in feature space. It is computed as follows:

FID (p_{r}, p_{g}) = {∥μ_{r} - μ_{g}∥}_{2}^{2} + Tr (Σ_{r} + Σ_{g} - 2 {(Σ_{r} Σ_{g})}^{1 / 2})

(1)

where

p_{r}

and

p_{g}

represent the real and generated image distributions, respectively. Additionally,

μ_{r}

and

μ_{g}

are the mean feature vectors for the real and generated images, while

Σ_{r}

and

Σ_{g}

denote the covariance matrices for these distributions.

P-IDS (Perceptual Inception Distance Score): A variant of FID used to assess perceptual similarity, designed to align more closely with human visual perception, is defined as follows:

P - IDS (p_{r}, p_{g}) = \frac{1}{N} \sum_{i = 1}^{N} (∥ f (x_{i}) - f (x_{i}^{'}) ∥_{2}^{2})

(2)

where

f (x)

is the feature vector extracted from the Inception network for image x, and

x_{i}

and

x_{i}^{'}

are the real and generated images, respectively. The metric averages the perceptual differences over all samples.

U-IDS (Unsupervised Inception Distance Score): Assesses the quality of generated images using an unsupervised approach with the Inception model, without the need for labeled data, and it is calculated as follows:

U - IDS (p_{r}, p_{g}) = \frac{1}{N} \sum_{i = 1}^{N} (∥ f_{u} (x_{i}) - f_{u} (x_{i}^{'}) ∥_{2}^{2})

(3)

where

f_{u} (x)

is the unsupervised feature vector extracted from the Inception network for image x, and

x_{i}

and

x_{i}^{'}

represent the real and generated images, respectively. This metric evaluates the similarity between distributions of real and generated images in the feature space.

3. Results and Comparative Analysis

In this section, we compare the performance of various image inpainting models at different resolutions across multiple mask classes. Our analysis reveals that models perform differently depending on both the type of mask used and the resolution of the input images. The results highlight the strengths and weaknesses of each model in handling specific facial features, such as eyes, mouth, and hair. The following subsections provide a detailed comparison based on three key evaluation metrics, FID, P-IDS, and U-IDS, followed by a discussion of the strengths and weaknesses of each model as observed across various test conditions.

3.1. Scores Comparison

Comparing the results of this study with those in [16], we observe that semantic masks are more challenging to inpaint than random masks, as reflected in the evaluation metrics (higher FID values for semantic masks). Semantic masks fully obscure key facial features, whereas random masks may leave parts of these features visible, facilitating the inpainting process. The models selected for comparison operate at different resolutions, with some capable of handling multiple resolutions. They are categorized into three groups based on their resolution.

3.1.1. Resolution 256

The FID values for different methods at a low resolution of 256 across mask classes (A–L) are shown in Table 4. Based on these FID values, MI-GAN ranks among the top three performing methods in 10 mask classes, Latent-based in 9 mask classes, and MAT in 8 mask classes. MI-GAN is the top-performing method in 4 mask classes, while Latent-based leads in 6 mask classes. As a relatively small model primarily designed for mobile devices, MI-GAN performs well. The changes in FID values across different mask classes and methods are consistent.

MI-GAN achieves the highest P-IDS value across all mask classes (see Table 4). MI-GAN, LaMa, and MAT generally emerge as the top three performing methods across different mask classes based on P-IDS values. For U-IDS values (see Table 4), MI-GAN achieves the highest U-IDS in 10 mask classes. MI-GAN, Latent-based, and MAT are generally the top three performing methods across different mask classes based on U-IDS values. The P-IDS and U-IDS values occasionally collapse to zero because the corresponding semantic masks fully occlude identity-defining facial components. In such cases, the identity recognition network used to compute these metrics cannot extract reliable identity embeddings from the reconstructed region, resulting in near-zero similarity scores. This behavior reflects the extreme difficulty of reconstructing identity-specific features when all structural information for a component is removed.

3.1.2. Resolution 512

The FID, P-IDS, and U-IDS values for different methods at a resolution of 512 across various mask classes are shown in Table 4. Based on FID, MAT with the FFHQ model ranks among the top three performing methods across 12 mask classes, being the best-performing method in 7 of them. MAT with the CelebA model ranks among the top three performing methods across 11 mask classes, while the MADF model ranks among the top three performing methods across 9 mask classes. However, Co-Mod-GAN generally outperforms MADF in both P-IDS and U-IDS.

Figure 2 illustrates the performance of different models in restoring facial key components at a resolution of 512. The figure highlights some limitations of these models. For example, in eye inpainting, LaMa and MADF generate blurry and mixed eyes. For nose masks, LaMa generates a small nose and a dual mouth. With face masks, MAT increases the thickness of the eyelashes; LaMa generates shorter face and face reflections; Co-Mod-GAN produces visible borders around the nose; and MADF generates an additional eye, overriding the existing one. For hair masks, MADF produces hair that lacks a realistic pattern. For mouth and nose masks, MADF generates teeth that override the lips.

3.1.3. Resolution 1024

Only two models, Co-Mod-GAN and LaMa, were trained to handle high-resolution images. The values for FID, P-IDS, and U-IDS are presented in Table 5. Co-Mod-GAN outperforms the LaMa model in 11 out of the 12 mask classes. The performance of Co-Mod-GAN and LaMa with high-resolution inpainting is illustrated in Figure 3 and Figure 4. This demonstrates the capability of Co-Mod-GAN to better adapt to complex inpainting tasks at higher resolutions.

Co-Mod-GAN demonstrates reliable performance for high-quality image inpainting, while LaMa, though capable of handling various resolutions, does not scale well at higher resolutions. This results in Co-Mod-GAN being the more effective model for tasks requiring high-resolution image restoration.

3.2. Findings and Limitations Across Models

Observations and limitations (summarized in Table A1) were identified for each method based on the analysis of all the generated images. Although some methods achieve relatively low FID values, they still have limitations and may not always perform optimally. This indicates that no single method can be considered superior in all scenarios, and there is still room for improvement. Overall, these observations suggest that ongoing research and refinement are needed to develop more versatile and reliable inpainting methods. All inpainting models consume less than 0.3 s to inpaint one image, except for RePaint, which requires approximately 5 min. The computation time for each model is summarized in Table 6.

The limitations include various inconsistencies across facial components. For example, eyes may appear unrealistic with issues like thick eyelashes, pupil misalignment, or full black eyes in some models. The mouth may exhibit problems such as merged lips, unrealistic teeth, or misaligned lower lips. Skin inconsistencies include visible mask borders, skin tones that do not blend well, and sometimes unnatural reflections or dark skin patches. Hair restoration often suffers from unrealistic textures, color continuity problems, or incomplete coverage, especially at higher resolutions. Ears may be missing, incomplete, or replaced by hair in some cases. Additionally, noses, while generally well restored in most models, sometimes exhibit incomplete or unrealistic results. These imperfections can significantly impact the overall realism of the generated images. Overall, while these models generally perform well in specific tasks, such limitations indicate there is still room for improvement in generating more accurate and seamless facial inpainting.

These observations indicate that current inpainting models often struggle to consistently restore semantic facial components, suggesting that relying on the current masking strategy during training may limit model robustness; motivated by these findings, the next section (Section 4) explores alternative training strategies, while Section 5 highlights key architectural limitations reported in the literature that may further contribute to these challenges.

4. Evaluation of Models Retrained with Semantic, Random, and Mixed Masks

The resolution of the images used for this evaluation is 512. Since MAT has demonstrated the greatest potential for inpainting images at this resolution, as shown in Table 4, it was chosen for the experiments described in this section.

In this evaluation, MAT was retrained using semantic masks. To compare the performance of MAT trained with semantic masks against MAT trained with random masks, a version of MAT was also retrained with random masks under the same configuration to ensure fairness. It is worth noting that the retrained MAT with random masks achieved FID, P-IDS, and U-IDS values close to those of the pretrained MAT model provided by its authors. These models were then evaluated for faces inpainting with semantic masks.

MAT was retrained using 24,000 images and tested on a dataset of 6000 images. The MAT model was retrained using the Adam optimizer with a learning rate of

1 \times 10^{- 4}

and

β_{1} = 0.9

,

β_{2} = 0.999

. The batch size was set to 8, and the model was trained for 100 epochs. The batch size was limited to 8 to comply with hardware constraints, as increasing the batch size exceeded the available 40 GB GPU memory. The model parameters were initialized from scratch rather than using pretrained weights, as the inpainting task with semantic masks involves fully occluded facial regions and is more complex than inpainting with random masks. During training, standard data augmentation techniques were applied, including random horizontal flipping and rotation. No additional color jittering or geometric transformations were employed. Each training session took approximately 7 days and 13 h. The FID values achieved on semantic masks by training MAT with semantic masks are compared to the FID values achieved on semantic masks by MAT trained with random masks. It is shown that MAT trained with random masks outperforms MAT trained with semantic masks for inpainting faces with semantic masks.

MAT trained with random masks alone was observed to outperform MAT trained with semantic masks alone for inpainting tasks involving semantic masks. This result suggests that random masks may promote better generalization and reduce overfitting. Training with random masks exposes the model to a diverse range of missing regions, which may encourage it to learn broader contextual relationships across the entire image. This may lead to a more robust model capable of handling varied inpainting scenarios. In contrast, semantic masks provide predefined regions, which may lead to overfitting of specific features and reduce the model’s ability to generalize. As a result, MAT trained with random masks alone tends to produce more realistic inpainted images, achieving lower FID scores compared to MAT trained with semantic masks alone under the evaluated experimental setting.

To leverage the strengths of both random and semantic masks and improve inpainting performance, we retrain the MAT model using a combination of both mask types. This combined approach, referred to as the “mixed” masking strategy, exposes the model to a diverse set of regions for inpainting. By randomly selecting between random and semantic masks for each image, the model is forced to learn a broader range of contextual relationships. This allows it to adapt to varying types of missing information, which is particularly beneficial for inpainting complex features such as facial components.

The inclusion of semantic masks provides the model with more structured guidance for inpainting, while random masks encourage flexibility and better generalization by presenting the model with more unpredictable scenarios. This hybrid strategy enhances the model’s context-aware capabilities, leading to more realistic and accurate reconstructions of facial features, especially in regions where fine details are crucial. The model trained with mixed masks achieves a lower FID value across almost all mask indices, as shown in Table 7 (statistical validation in Appendix A.3). This suggests that combining the two masking strategies allows the model to outperform its counterparts trained with only random or only semantic masks, particularly for inpainting images with semantic masks. The mixed mask approach, therefore, strikes an optimal balance between flexibility and structure, enabling the model to handle a wider variety of inpainting tasks with greater fidelity and realism.

Based on Table 7, there is a notable difference between the FID values for MAT trained with random, semantic, and mixed masks for the mask classes with indices (C, G, and I), corresponding to full mouth, hair, and face masks. To further illustrate these differences, we randomly select samples and visualize them in Figure 5.

For the face mask, as shown in Figure 5, MAT trained with random masks achieves a blended color, but the face appears short and inconsistent. MAT trained with semantic masks results in a face size and shape closer to the ground truth; however, it does not blend well with the surrounding color. The face exhibits a noticeably different color tone compared to the nose. MAT trained with mixed masks achieves better results than MAT trained with only random masks or only semantic masks.

For the hair mask, MAT trained with random masks may struggle with hair flow, while MAT trained with a semantic mask may face challenges with color continuity. Using MAT trained with mixed masks achieves a better balance between hair flow and color consistency, as shown in Figure 5.

For the mouth mask, MAT trained with mixed masks may outperform both MAT trained with a semantic mask and MAT trained with a random mask, as shown in Figure 5.

5. Structural Limitations of Current AI Models for Semantic Mask Inpainting

Even when trained or fine-tuned on datasets containing fully occluding semantic masks, current AI-based inpainting models such as GANs, CNNs, transformers, and diffusion models face inherent structural limitations. These limitations have been observed in prior studies [7,46,47,48]. Although most prior work evaluates irregular or partial masks, the same architectural constraints are likely to pose greater challenges in semantic-mask scenarios, where entire facial components are completely missing:

1.: Local Receptive Fields and Convolutional Bias: GAN-based and CNN architectures rely on local neighborhoods to infer missing pixels. Fully occluding semantic masks remove all local cues for entire facial components, so these models struggle to reconstruct shapes and structures without any visible reference. Studies report structural distortions and texture inconsistencies in large masked regions [7,46].
2.: Weak Global and Semantic Reasoning: Transformers and diffusion models capture some global context, but their training often emphasizes texture completion rather than high-level structural or semantic relationships. As a result, models can generate plausible textures but often misplace or deform components [47,48], such as eyes, noses, or mouths.
3.: Limited Structural Priors: Most models learn statistical correlations between observed pixels rather than explicit knowledge of facial anatomy [46,47]. Even with semantic-mask-specific training, the models lack encoded structural understanding, which is essential for accurate reconstruction of fully missing facial elements.
4.: Ambiguity in Fully Occluded Regions: Semantic masks create regions where multiple plausible reconstructions exist. Current models are not designed to resolve such ambiguities, and they tend to produce averaged or unrealistic results, rather than semantically correct features.

These structural challenges explain why training alone cannot fully overcome the difficulty of semantic masks. Achieving high-fidelity reconstruction likely requires architectures that explicitly encode facial geometry, component interdependencies, and semantic constraints, in addition to learning from data.

6. Conclusions

In conclusion, while AI-based inpainting methods show promising capabilities in restoring key facial components with realistic and contextually appropriate results, significant challenges remain, particularly in achieving seamless blending, handling complex facial structures, and preserving fine details such as eyes, teeth, hair, and eyelashes. These limitations indicate that fully accurate facial reconstruction under fully occluding semantic masks is still difficult. Addressing these challenges will likely require the development of new AI architectures or innovative model configurations specifically designed for fully occluded inpainting tasks. Additionally, while semantic masks pose a considerable obstacle, combining them with random masks in a hybrid approach enhances the model’s contextual understanding and improves inpainting performance. This strategy leads to more accurate and natural restoration of facial features across diverse settings, though complete reconstruction remains an open problem.

Future work may focus on developing new architectures that are component-aware, explicitly modeling individual facial regions (e.g., eyes, mouth, hair) while maintaining global facial coherence. Another promising direction is the investigation of adaptive masking strategies that dynamically combine random and semantic masks during training to further improve generalization across different occlusion types. Additionally, designing component-specific loss functions that prioritize accurate reconstruction of critical facial features may enhance the realism and fidelity of restored regions. Finally, extending the proposed mixed-masking strategy to other structured image domains beyond facial images can help evaluate its applicability and effectiveness in broader inpainting tasks.

Author Contributions

Conceptualization, H.S., A.H. and E.S.; methodology, H.S. and A.H.; software, H.S. and A.H.; validation, H.S., A.H. and E.S.; formal analysis, H.S. and A.H.; investigation, H.S., A.H. and E.S.; resources, H.S. and A.H.; data curation, H.S. and A.H.; writing—original draft preparation, H.S. and A.H.; writing—review and editing, H.S., A.H. and E.S.; visualization, H.S. and A.H.; supervision, H.S. and A.H.; project administration, H.S. and A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

This study utilized a publicly available dataset (CelebAMask-HQ). The images originate from public sources and are employed here exclusively for non-commercial scientific research. As the data are anonymized and publicly released by the original creators, no further ethics approval was required under the applicable guidelines.

Data Availability Statement

The facial images utilized in this study were sourced from the CelebAMask-HQ dataset, which is publicly available for non-commercial research purposes only [43]. The authors do not hold permission to redistribute the raw image data. All images presented herein are used strictly for academic research in accordance with the dataset’s licensing terms.

Acknowledgments

During the preparation of this work, the authors used ChatGPT (GPT-5 mini, free version) to improve the readability of the paper. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article. The article processing charge (APC) was funded by Texas A&M International University.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Detailed Limitations Across Models

As a complement to the discussion in Section 4, Table A1 summarizes the common limitations and observations across the evaluated inpainting models. This summary highlights recurring issues in facial component reconstruction and overall image realism, providing a quick reference for comparing model performance.

Table A1. Common limitations and observations for different inpainting models. “Realistic" refers to the visual plausibility of generated facial features compared to ground truth (GT) images, considering continuity, shape, and color consistency.

Model	Observations
MAT	Eye: May increase the size of the lacrimal caruncle and the thickness of the eyelashes.
	Mouth: Produces better results for full-mouth restoration rather than partial restoration; reduces gummy smiles and frequently generates plump lips.
	Face: Lightens skin tone, thins faces, and sometimes enlarges lips when not masked.
	Ears: May generate incomplete ears or only one ear visible while the other is replaced by hair.
	Hair: Hair often looks realistic in terms of flow and color continuity.
	Nose: Blends seamlessly with the skin and looks realistic.
Co-Mod-GAN	Eye: Eyes look realistic, but eyebrows may not blend seamlessly with the surrounding skin.
	Mouth: Lower lip color mismatch, visible borders, and unrealistic teeth.
	Face: Inpainted skin borders sometimes don’t blend well; dual eyes may appear.
	Ears: Generates incomplete ears.
	Hair: May use masked area to enlarge the face; hair continuity is not always maintained.
	Nose: Works well, but borders may not blend seamlessly.
MADF	Eye: Often unrealistic, missing pupils, blurry, or mixed.
	Mouth: Teeth often appear unrealistic; lips may blend with teeth or appear blurry.
	Face: Dual eyes/eyebrows, and inpainted skin borders may not blend well.
	Ears: May generate incomplete ears.
	Hair: Appears unrealistic with unnatural continuity.
	Nose: Blends seamlessly and looks realistic.
CMT	Eye: Eyes may have thick eyelashes; light or gray eyebrows.
	Mouth: Lips may fuse into a single piece without separation; teeth may be unrealistic.
	Face: Skin sometimes appears dark or blackish.
	Ears: One ear missing or replaced with hair.
	Hair: Hair appears unrealistic, often in chunks.
	Nose: Nose may be incomplete.
Latent-based	Eye: Eyes may have black eyelashes.
	Mouth: Teeth may be mixed, cracked, or uneven.
	Face: Mask borders visible; lashes enlarged.
	Ears: Masked ear may be inpainted with hair.
	Hair: Hair often looks realistic.
	Nose: Nose looks realistic.
MI-GAN	Eye: Eyes may have thick black eyelashes, pupil position issues, or different colors.
	Mouth: Teeth may be connected, cracked, or uneven; lips may be swollen.
	Face: Sometimes enlarges lower lip or thickens eyelashes; borders visible.
	Ears: Ears may be incomplete or missing.
	Hair: Part of the hair mask may be inpainted with ear or background elements.
	Nose: Nose looks realistic.
LaMa	Eye: Eyes may be fully black with no pupils.
	Mouth: Teeth may not be restored; lips may appear with missing or small teeth.
	Face: May generate face reflections if the face is masked.
	Ears: Does not restore ears, replacing them with hair.
	Hair: Hair restoration ineffective, especially at higher resolutions.
	Nose: Nose restoration fails above resolution 256.
RePaint	Eye: Pupils may have different colors; very black eyelashes.
	Mouth: Lower lip may be bigger, with missing lower jaw teeth.
	Face: Face looks realistic.
	Ears: Ears may be incomplete.
	Hair: Hair looks realistic.
	Nose: Nose looks realistic.

Appendix A.2. Zoomed-In Crops

For a detailed view of facial components from Figure 2, zoomed-in crops are shown in Figure A1.

Figure A1. Zoomed-in crops of key facial components from Figure 2.

Appendix A.3. Statistical Validation of FID Improvements for Mixed Mask Training

To assess the statistical robustness of the FID improvements reported in Table 7, we conducted a bootstrap analysis. From the full evaluation set of 6000 images, 100 bootstrap samples were generated by randomly selecting 1000 unique images per sample (no image was repeated within a sample, though images could appear in multiple samples). FID was computed for each sample, and this procedure was repeated across all mask classes and training strategies (Random, Semantic, and Mixed). The final FID values reported in Table A2 correspond to the mean and standard deviation across these bootstrap evaluations.

To evaluate statistical significance, we performed paired bootstrap tests comparing each pair of training strategies. The results indicate that Mixed mask training generally achieves lower FID values than both Random and Semantic training. For many mask classes, these differences are statistically significant with

p < 0.01

. In Table A2, * denotes cases where Mixed training significantly outperforms Random, while ^† denotes cases where Mixed training significantly outperforms Semantic.

Table A2. Statistical validation of FID values reported in Table 7. Values represent the mean ± standard deviation computed from 100 bootstrap samples of 1000 images each. Superscripts denote statistically significant differences (

p < 0.01

) where Mixed mask training outperforms another strategy: * vs. Random, ^† vs. Semantic. The two top-performing methods (lowest mean FID) for each mask class are highlighted in blue : dark = first, medium = second.

Table A2. Statistical validation of FID values reported in Table 7. Values represent the mean ± standard deviation computed from 100 bootstrap samples of 1000 images each. Superscripts denote statistically significant differences (

p < 0.01

) where Mixed mask training outperforms another strategy: * vs. Random, ^† vs. Semantic. The two top-performing methods (lowest mean FID) for each mask class are highlighted in blue : dark = first, medium = second.

	Mask Type
Idx	Random	Semantic	Mixed
A	1.47 ± 0.03 *	1.79 ± 0.04 ^†	1.26 ± 0.03
B	0.94 ± 0.02	0.96 ± 0.02	0.93 ± 0.02
C	2.03 ± 0.05 *	3.76 ± 0.09 ^†	1.60 ± 0.04
D	0.76 ± 0.02 *	1.21 ± 0.03 ^†	0.71 ± 0.02
E	0.75 ± 0.02 *	1.41 ± 0.03 ^†	0.57 ± 0.02
F	2.37 ± 0.05 *	2.07 ± 0.05 ^†	1.26 ± 0.04
G	12.66 ± 0.33 *	17.43 ± 0.42 ^†	7.15 ± 0.21
H	3.92 ± 0.10	3.93 ± 0.11	3.35 ± 0.09
I	8.22 ± 0.21 *	9.30 ± 0.24 ^†	5.71 ± 0.15
J	2.48 ± 0.06 *	4.01 ± 0.09 ^†	2.21 ± 0.05
K	3.59 ± 0.09 *	5.03 ± 0.12 ^†	3.05 ± 0.08
L	116.10 ± 3.23	114.90 ± 3.19	131.45 ± 3.51

References

Ashwini, R.; Vani, R.; Suvitha, S. Wholeness Unveiled: Pioneering the Fusion of Denoising and Inpainting for Mastery in Image Restoration. In Proceedings of the 2024 Fourth International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), Bhilai, India, 11–12 January 2024; pp. 1–6. [Google Scholar] [CrossRef]
Cheng, B.; Li, J.; Chen, Y.; Zeng, T. Snow Mask Guided Adaptive Residual Network for Image Snow Removal. Comput. Vis. Image Underst. 2023, 236, 103819. [Google Scholar] [CrossRef]
Guo, L.; Huang, S.; Liu, D.; Cheng, H.; Wen, B. Shadowformer: Global context helps image shadow removal. arXiv 2023, arXiv:2302.01650. [Google Scholar] [CrossRef]
Guo, L.; Wang, C.; Yang, W.; Huang, S.; Wang, Y.; Pfister, H.; Wen, B. ShadowDiffusion: When Degradation Prior Meets Diffusion Model for Shadow Removal. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14049–14058. [Google Scholar] [CrossRef]
Zhu, J.; Wen, J.; Hong, D.; Lin, Z.; Hong, W. UIR-ES: An unsupervised underwater image restoration framework with equivariance and Stein unbiased risk estimator. Image Vis. Comput. 2024, 151, 105285. [Google Scholar] [CrossRef]
Pan, X.; Zhai, H.; Yang, Y.; Chen, L.; Li, A. Improving multi-focus image fusion through Noisy image and feature difference network. Image Vis. Comput. 2024, 142, 104891. [Google Scholar] [CrossRef]
Xu, Z.; Zhang, X.; Chen, W.; Yao, M.; Liu, J.; Xu, T.; Wang, Z. A Review of Image Inpainting Methods Based on Deep Learning. Appl. Sci. 2023, 13, 11189. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Liu, H.; Jiang, B.; Song, Y.; Huang, W.; Yang, C. Rethinking Image Inpainting via a Mutual Encoder-Decoder with Feature Equalizations. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative Image Inpainting with Contextual Attention. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5505–5514. [Google Scholar] [CrossRef]
Suvorov, R.; Logacheva, E.; Mashikhin, A.; Remizova, A.; Ashukha, A.; Silvestrov, A.; Kong, N.; Goka, H.; Park, K.; Lempitsky, V.S. Resolution-robust Large Mask Inpainting with Fourier Convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 2149–2159. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T. Free-Form Image Inpainting With Gated Convolution. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–3 November 2019; pp. 4470–4479. [Google Scholar] [CrossRef]
Zeng, Y.; Fu, J.; Chao, H.; Guo, B. Aggregated Contextual Transformations for High-Resolution Image Inpainting. IEEE Trans. Vis. Comput. Graph. 2023, 29, 3266–3280. [Google Scholar] [CrossRef]
Deng, Y.; Hui, S.; Zhou, S.; Meng, D.; Wang, J. T-former: An Efficient Transformer for Image Inpainting. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 6559–6568. [Google Scholar] [CrossRef]
Li, W.; Lin, Z.; Zhou, K.; Qi, L.; Wang, Y.; Jia, J. MAT: Mask-Aware Transformer for Large Hole Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Ko, K.; Kim, C.S. Continuously masked transformer for image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 13169–13178. [Google Scholar]
Lugmayr, A.; Danelljan, M.; Romero, A.; Yu, F.; Timofte, R.; Van Gool, L. RePaint: Inpainting using Denoising Diffusion Probabilistic Models. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11451–11461. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10674–10685. [Google Scholar]
Avrahami, O.; Fried, O.; Lischinski, D. Blended Latent Diffusion. ACM Trans. Graph. 2023, 42, 149. [Google Scholar] [CrossRef]
Corneanu, C.; Gadde, R.; Martinez, A.M. Latentpaint: Image inpainting in latent space with diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 4334–4343. [Google Scholar]
Lee, S.L.; Kang, M.; Hou, J.U. Localization of diffusion model-based inpainting through the inter-intra similarity of frequency features. Image Vis. Comput. 2024, 148, 105138. [Google Scholar] [CrossRef]
Hu, S.; Liu, X.; Zhang, Y.; Li, M.; Zhang, L.Y.; Jin, H.; Wu, L. Protecting Facial Privacy: Generating Adversarial Identity Masks via Style-robust Makeup Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14994–15003. [Google Scholar]
Zhang, Y.; Fang, Y.; Cao, Y.; Wu, J. RBGAN: Realistic-generation and balanced-utility GAN for face de-identification. Image Vis. Comput. 2024, 141, 104868. [Google Scholar] [CrossRef]
Hayajneh, A.; Serpedin, E.; Stotland, M. Automatic Semantic In-Painting Image Normalization for Facial Anomaly Appraisal. In Proceedings of the 2024 32nd European Signal Processing Conference (EUSIPCO), Lyon, France, 26–30 August 2024; pp. 1501–1505. [Google Scholar] [CrossRef]
Hayajneh, A.; Serpedin, E.; Shaqfeh, M.; Glass, G.; Stotland, M.A. CleftGAN: Adapting A Style-Based Generative Adversarial Network To Create Images Depicting Cleft Lip Deformity. arXiv 2023, arXiv:2310.07969. [Google Scholar]
Zhu, P.; Abdal, R.; Qin, Y.; Wonka, P. SEAN: Image Synthesis With Semantic Region-Adaptive Normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5103–5112. [Google Scholar] [CrossRef]
Kim, H.; Pang, Z.; Zhao, L.; Su, X.; Lee, J.S. Semantic-aware Deidentification Generative Adversarial Networks for Identity Anonymization. Multimed. Tools Appl. 2023, 82, 15535–15551. [Google Scholar] [CrossRef]
Yang, Z.; Chen, J.; Shi, Y.; Li, H.; Chen, T.; Lin, L. OccluMix: Towards De-Occlusion Virtual Try-On by Semantically-Guided Mixup. IEEE Trans. Multimed. 2023, 25, 1477–1488. [Google Scholar] [CrossRef]
Zhao, S.; Cui, J.; Sheng, Y.; Dong, Y.; Liang, X.; Chang, E.I.; Xu, Y. Large Scale Image Completion via Co-Modulated Generative Adversarial Networks. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Zhu, M.; He, D.; Li, X.; Li, C.; Li, F.; Liu, X.; Ding, E. Image Inpainting by End-to-End Cascaded Refinement With Mask Awareness. IEEE Trans. Image Process. 2021, 30, 4855–4866. [Google Scholar] [CrossRef] [PubMed]
Wan, Z.; Zhang, J.; Chen, D.; Liao, J. High-Fidelity Pluralistic Image Completion with Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 4672–4681. [Google Scholar] [CrossRef]
Yi, Z.; Tang, Q.; Azizi, S.; Jang, D.; Xu, Z. Contextual Residual Aggregation for Ultra High-Resolution Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 7505–7514. [Google Scholar] [CrossRef]
Nazeri, K. EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning. arXiv 2019, arXiv:1901.00212. [Google Scholar] [CrossRef]
Chen, H.; Zhao, Y. Don’t Look into the Dark: Latent Codes for Pluralistic Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 7591–7600. [Google Scholar]
Sargsyan, A.; Navasardyan, S.; Xu, X.; Shi, H. MI-GAN: A Simple Baseline for Image Inpainting on Mobile Devices. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 7301–7311. [Google Scholar] [CrossRef]
Xu, X.; Navasardyan, S.; Tadevosyan, V.; Sargsyan, A.; Mu, Y.; Shi, H. Image Completion with Heterogeneously Filtered Spectral Hints. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 4580–4590. [Google Scholar] [CrossRef]
Yu, Y.; Zhan, F.; Wu, R.; Pan, J.; Cui, K.; Lu, S.; Ma, F.; Xie, X.; Miao, C. Diverse Image Inpainting with Bidirectional and Autoregressive Transformers. In Proceedings of the 29th ACM International Conference on Multimedia (ACM MM), Chengdu, China, 20–24 October 2021. [Google Scholar]
Zheng, C.; Cham, T.J.; Cai, J. Pluralistic Free-From Image Completion. Int. J. Comput. Vis. 2021, 129, 2786–2805. [Google Scholar] [CrossRef]
Zheng, C.; Cham, T.J.; Cai, J. Pluralistic Image Completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019; pp. 1438–1447. [Google Scholar]
Dong, Q.; Cao, C.; Fu, Y. Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11348–11358. [Google Scholar] [CrossRef]
Chang, H.; Zhang, H.; Jiang, L.; Liu, C.; Freeman, W.T. MaskGIT: Masked Generative Image Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 11305–11315. [Google Scholar]
Lee, C.H.; Liu, Z.; Wu, L.; Luo, P. MaskGAN: Towards Diverse and Interactive Facial Image Manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Baranchuk, D.; Rubachev, I.; Voynov, A.; Khrulkov, V.; Babenko, A. Label-Efficient Semantic Segmentation with Diffusion Models. arXiv 2021, arXiv:2112.03126. [Google Scholar]
Li, D.; Ling, H.; Kar, A.; Acuna, D.; Kim, S.W.; Kreis, K.; Torralba, A.; Fidler, S. DreamTeacher: Pretraining Image Backbones with Deep Generative Models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 16652–16662. [Google Scholar] [CrossRef]
Wang, M.; Lu, W.; Lyu, J.; Shi, K.; Zhao, H. Generative image inpainting with enhanced gated convolution and Transformers. Displays 2022, 75, 102321. [Google Scholar] [CrossRef]
Li, Z.; Han, N.; Wang, Y.; Zhang, Y.; Yan, J.; Du, Y.; Geng, G. Image inpainting based on CNN-Transformer framework via structure and texture restoration. Appl. Soft Comput. 2025, 170, 112671. [Google Scholar] [CrossRef]
Elharrouss, O.; Damseh, R.; Belkacem, A.N.; Badidi, E.; Lakas, A. Transformer-based image and video inpainting: Current challenges and future directions. Artif. Intell. Rev. 2025, 58, 124. [Google Scholar] [CrossRef]

Figure 1. Original image with overlaid segmentation from the CelebA-HQ dataset [43]. The mouth region is divided into three segments: upper lip, mouth, and lower lip. The eyes, eyebrows, ears, hair, neck, skin, nose, and earrings are also segmented. An example of the CelebA-HQ 19-class model for face semantic segmentation used in the current study.

Figure 2. Restoration of key facial components at

512 \times 512

resolution using various models. Some model limitations include: blurry or mixed eyes, a short nose, fused teeth, visible mask borders, blended lips, a shortened face, thick black eyelashes, unrealistic hair, and teeth overriding the lips. The “gt” refers to the ground truth image. (A, B, C, I, G, J, K) denote the semantic mask class indices listed in Table 2. The MAT method includes two models: CelebA (denoted as MAT) and FFHQ (denoted as MAT₁). Green boxes highlight the areas of focus for inpainting. View better when zoomed in. Zoomed crops of semantic regions are provided in Appendix A Figure A1.

Figure 2. Restoration of key facial components at

512 \times 512

resolution using various models. Some model limitations include: blurry or mixed eyes, a short nose, fused teeth, visible mask borders, blended lips, a shortened face, thick black eyelashes, unrealistic hair, and teeth overriding the lips. The “gt” refers to the ground truth image. (A, B, C, I, G, J, K) denote the semantic mask class indices listed in Table 2. The MAT method includes two models: CelebA (denoted as MAT) and FFHQ (denoted as MAT₁). Green boxes highlight the areas of focus for inpainting. View better when zoomed in. Zoomed crops of semantic regions are provided in Appendix A Figure A1.

Figure 3. Restoring face key components at 256, 512, and 1024 resolutions using the LaMa model. The LaMa model faces challenges in restoring effectively at resolutions higher than 256. The “gt” refers to the ground truth image. (A, B, F, I, G) denote the semantic mask class indices listed in Table 2. View better when zoomed in.

Figure 4. Restoring key facial components at 512 and 1024 resolutions using the Co-Mod-GAN model. The Co-Mod-GAN model performs effectively at high resolutions, such as 1024, and is comparable to its performance at 512. The “gt” refers to the ground truth image. (A, B, F, I, G) denote the semantic mask class indices listed in Table 2. View better when zoomed in.

Figure 5. Restoring key facial components using 512-resolution MAT models trained with random, semantic, and mixed masks. Training with mixed masks enhances inpainting performance. View better when zoomed in.

Table 1. Overview of different image inpainting methods, including publication year, resolutions, and corresponding model information. The methods selected for comparison are highlighted in bold: Latent-based, MI-GAN, CMT, MAT, RePaint, LaMa, Co-Mod-GAN, and MADF. Note: A total of 13 result extractions were performed, as some methods were evaluated across multiple resolutions and models.

Method	Year	Resolution(s)	Model(s)
Latent [35]	2024	256	CelebA
MI-GAN [36]	2023	256	FFHQ
CMT [17]	2023	256	CelebA
SH-GAN [37]	2023	512	FFHQ
MAT [16]	2022	512, 256	CelebA-256/512, FFHQ-512
RePaint [18]	2022	256	CelebA
ICT [32]	2022	256	FFHQ
AOT-GAN [14]	2022	512	CelebA
LaMa [12]	2021	1024, 512, 256	CelebA-base
Co-Mod-GAN [30]	2021	1024, 512	FFHQ
MADF [31]	2021	512	CelebA
BAT [38]	2021	256	CelebA
PIC [39]	2021	256	CelebA
HiFill [33]	2020	512	CelebA
DeepFill [13]	2019	256	CelebA
EdgeConnect [34]	2019	256	CelebA

Table 2. Mask class index and corresponding facial parts combinations. Note: “Ears” includes the right ear, left ear, and earring and “Eyes” includes the eyebrows. “Lip (L)” refers to the lower lip and “Lip (U)” refers to the upper lip. The face refers to the skin excluding the neck, eyes, nose, mouth, and ears. A “✓” indicates the masked parts, while an “✕” indicates unmasked parts.

Mask Index	Face	Eyes	Nose	Lip (L)	Lip (U)	Mouth	Hair	Ears
A	✕	✓	✕	✕	✕	✕	✕	✕
B	✕	✕	✓	✕	✕	✕	✕	✕
C	✕	✕	✕	✓	✓	✓	✕	✕
D	✕	✕	✕	✓	✕	✕	✕	✕
E	✕	✕	✕	✕	✓	✕	✕	✕
F	✕	✕	✕	✕	✕	✓	✕	✕
G	✕	✕	✕	✕	✕	✕	✓	✕
H	✕	✕	✕	✕	✕	✕	✕	✓
I	✓	✕	✕	✕	✕	✕	✕	✕
J	✕	✕	✓	✓	✓	✓	✕	✕
K	✕	✓	✓	✓	✓	✓	✕	✕
L	✓	✕	✕	✕	✕	✕	✓	✓

Table 3. System specifications of the computing machine used in the experiments.

Component	Specification
GPU Model	NVIDIA A800 Active
GPU Memory	40 GB per GPU
Number of GPUs	2
Driver Version	525.116.04
CUDA Version	12.0
CPU	Intel Xeon w9-3495X
CPU Cores	56 Cores, 1.9–4.8 GHz
RAM	256 GB (8 × 32 GB)
	4800 MT/s
Total Storage Usage	286 GB

Table 4. FID, P-IDS, and U-IDS values for various methods across mask classes (A–L) at resolutions of 256 and 512. The three top-performing methods (i.e., those with the lowest FID values, those with the highest P-IDS values, those with the highest U-IDS values) for each mask class at each resolution are highlighted in blue: dark = first, medium = second, light = third. The 512-resolution MAT includes two models: CelebA (denoted as MAT) and FFHQ (denoted as MAT₁). Based on the FID, P-IDS, and U-IDS values, MAT₁ consistently ranks among the top three models for different mask classes.

	256 Resolution					512 Resolution
FID
Idx	MI-GAN	Latent	MAT	CMT	LaMa	MAT	MAT₁	LaMa	Co-Mod.	MADF
A	1.18	1.69	1.80	2.10	1.78	1.14	1.21	5.97	1.51	2.20
B	1.09	1.41	1.39	1.38	0.95	0.73	0.66	2.15	1.10	0.73
C	2.66	1.74	2.61	2.68	3.73	2.03	1.37	3.68	2.26	1.43
D	0.92	1.33	1.41	1.23	1.20	0.66	0.60	1.22	0.91	0.59
E	1.28	1.15	1.26	1.24	1.40	0.77	0.65	1.31	0.81	0.63
F	1.73	2.15	2.79	2.88	2.05	2.05	1.38	3.50	1.63	1.17
G	10.54	8.54	9.76	14.11	17.38	8.93	7.41	21.87	8.97	16.55
H	3.26	3.78	3.87	4.01	3.91	3.44	2.96	3.96	3.99	3.27
I	6.33	7.60	6.13	5.71	9.25	5.42	5.06	11.18	10.14	6.48
J	3.07	2.16	2.83	2.89	4.00	2.31	1.62	5.81	3.04	1.71
K	3.56	2.80	3.47	3.74	5.00	2.93	2.36	11.50	3.96	3.17
L	66.48	46.35	91.71	288.99	114.35	56.72	85.56	176.68	97.91	114.76
P-IDS
Idx	MI-GAN	Latent	MAT	CMT	LaMa	MAT	MAT₁	LaMa	Co-Mod.	MADF
A	2.46	0.75	0.60	0.25	0.40	5.47	3.71	0.10	4.87	0.50
B	3.25	0.80	1.25	0.75	1.85	9.65	11.55	0.65	7.50	6.10
C	2.51	1.30	0.85	1.10	0.65	9.92	9.62	0.90	9.17	6.27
D	2.01	1.20	1.25	1.76	1.51	9.28	10.04	4.62	8.08	6.57
E	1.60	1.00	1.05	1.00	1.40	7.12	8.68	4.86	6.62	4.81
F	0.25	0.00	0.00	0.00	0.08	1.39	2.13	0.25	2.29	1.06
G	0.45	0.00	0.00	0.00	0.00	1.06	3.49	0.00	0.66	0.00
H	1.38	0.07	0.21	0.14	0.21	3.53	5.33	1.66	1.38	2.70
I	0.15	0.00	0.00	0.00	0.00	0.65	2.55	0.00	0.00	0.00
J	2.51	1.05	1.05	0.70	0.35	9.07	10.88	0.05	6.52	5.56
K	1.85	0.35	0.25	0.10	0.05	4.00	3.55	0.00	2.15	0.50
L	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
U-IDS
Idx	MI-GAN	Latent	MAT	CMT	LaMa	MAT	MAT₁	LaMa	Co-Mod.	MADF
A	9.66	4.94	4.82	2.11	3.01	12.73	10.82	0.13	11.02	1.98
B	14.10	7.87	9.35	8.70	11.72	22.92	26.15	2.72	18.53	20.43
C	8.45	9.35	6.52	6.27	2.48	20.28	22.96	2.96	19.47	16.84
D	13.75	9.23	8.18	9.71	11.57	22.96	25.77	15.13	23.28	21.20
E	12.46	10.53	9.60	10.43	9.68	23.72	25.58	15.15	21.49	23.14
F	0.57	0.25	0.04	0.00	0.20	2.45	4.62	0.94	3.92	4.17
G	1.36	0.00	0.10	0.00	0.00	3.08	7.23	0.00	1.11	0.03
H	3.15	0.41	0.73	0.66	0.83	7.33	12.66	4.50	4.05	7.12
I	0.70	0.00	0.15	0.15	0.00	2.92	7.07	0.00	0.00	0.02
J	9.00	7.22	5.51	5.01	1.90	19.65	20.85	0.10	14.24	14.11
K	6.17	2.92	2.20	0.72	0.25	9.40	9.20	0.00	6.57	1.50
L	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00

Table 5. FID, P-IDS, and U-IDS values for (1) Co-Mod-GAN and (2) LaMa at a high resolution of 1024 are reported across mask classes (A–L). The top-performing method (i.e., the one with the lowest FID value, the highest P-IDS value, or the highest U-IDS value) for each mask class is highlighted in blue.

	FID		P-IDS		U-IDS
Idx	Co-Mod.	LaMa	Co-Mod.	LaMa	Co-Mod.	LaMa
A	2.22	5.90	2.41	0.15	4.87	0.20
B	0.95	2.30	8.35	0.35	18.45	1.15
C	2.82	5.08	9.57	0.00	16.42	0.03
D	0.76	1.03	11.94	5.67	24.28	14.43
E	0.72	0.37	13.24	18.91	26.91	31.02
F	1.84	5.36	3.84	0.74	6.21	1.19
G	11.60	41.85	0.40	0.00	1.21	0.00
H	3.84	3.86	3.11	2.21	6.50	5.22
I	11.24	33.61	0.00	0.00	0.00	0.00
J	3.93	9.20	3.21	0.00	7.99	0.00
K	6.23	18.58	0.00	0.00	0.38	0.00
L	65.73	206.07	0.00	0.00	0.00	0.00

Table 6. Performance summary for various models. All times are for inpainting 2000 images except RePaint, which takes about 5 min per image.

Model	256 × 256	512 × 512	1024 × 1024
MAT	3 min	5 min	-
Co-Mod-GAN	-	1 min	2 min
MADF	-	6 min	-
CMT	-	1 min	-
Latent-based	-	10 min	-
MI-GAN	-	2 min	-
LaMa	0.5–1 min	1.5 min	5 min
RePaint	-	5 min/image	-

Table 7. Comparison of FID values for MAT trained with random, semantic, and mixed masks. “Mixed” refers to training with both random and semantic masks, where each image is randomly masked with either a random mask or a semantic mask. The two top-performing methods (i.e., those with the lowest FID values) for each mask class are highlighted in blue: dark = first, medium = second. The results show that MAT trained with mixed masks outperforms MAT trained with either only semantic masks or only random masks in terms of FID, particularly for inpainting images with semantic masks.

	Mask Type
Idx	Random	Semantic	Mixed
A	1.46	1.78	1.25
B	0.93	0.95	0.93
C	2.01	3.73	1.59
D	0.75	1.20	0.71
E	0.74	1.40	0.56
F	2.36	2.05	1.24
G	12.51	17.38	7.10
H	3.90	3.91	3.33
I	8.16	9.25	5.69
J	2.46	4.00	2.20
K	3.56	5.00	3.03
L	115.51	114.35	130.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sharadga, H.; Hayajneh, A.; Serpedin, E. Evaluating AI-Based Image Inpainting Techniques for Facial Components Restoration Using Semantic Masks. AI 2026, 7, 119. https://doi.org/10.3390/ai7040119

AMA Style

Sharadga H, Hayajneh A, Serpedin E. Evaluating AI-Based Image Inpainting Techniques for Facial Components Restoration Using Semantic Masks. AI. 2026; 7(4):119. https://doi.org/10.3390/ai7040119

Chicago/Turabian Style

Sharadga, Hussein, Abdullah Hayajneh, and Erchin Serpedin. 2026. "Evaluating AI-Based Image Inpainting Techniques for Facial Components Restoration Using Semantic Masks" AI 7, no. 4: 119. https://doi.org/10.3390/ai7040119

APA Style

Sharadga, H., Hayajneh, A., & Serpedin, E. (2026). Evaluating AI-Based Image Inpainting Techniques for Facial Components Restoration Using Semantic Masks. AI, 7(4), 119. https://doi.org/10.3390/ai7040119

Mask Index	Face	Eyes	Nose	Lip (L)	Lip (U)	Mouth	Hair	Ears
A	✕	✓	✕	✕	✕	✕	✕	✕
B	✕	✕	✓	✕	✕	✕	✕	✕
C	✕	✕	✕	✓	✓	✓	✕	✕
D	✕	✕	✕	✓	✕	✕	✕	✕
E	✕	✕	✕	✕	✓	✕	✕	✕
F	✕	✕	✕	✕	✕	✓	✕	✕
G	✕	✕	✕	✕	✕	✕	✓	✕
H	✕	✕	✕	✕	✕	✕	✕	✓
I	✓	✕	✕	✕	✕	✕	✕	✕
J	✕	✕	✓	✓	✓	✓	✕	✕
K	✕	✓	✓	✓	✓	✓	✕	✕
L	✓	✕	✕	✕	✕	✕	✓	✓

Mask Index	Face	Eyes	Nose	Lip (L)	Lip (U)	Mouth	Hair	Ears
A	✕	✓	✕	✕	✕	✕	✕	✕
B	✕	✕	✓	✕	✕	✕	✕	✕
C	✕	✕	✕	✓	✓	✓	✕	✕
D	✕	✕	✕	✓	✕	✕	✕	✕
E	✕	✕	✕	✕	✓	✕	✕	✕
F	✕	✕	✕	✕	✕	✓	✕	✕
G	✕	✕	✕	✕	✕	✕	✓	✕
H	✕	✕	✕	✕	✕	✕	✕	✓
I	✓	✕	✕	✕	✕	✕	✕	✕
J	✕	✕	✓	✓	✓	✓	✕	✕
K	✕	✓	✓	✓	✓	✓	✕	✕
L	✓	✕	✕	✕	✕	✕	✓	✓

Article Menu

Evaluating AI-Based Image Inpainting Techniques for Facial Components Restoration Using Semantic Masks

Abstract

1. Introduction

1.1. Motivation

1.2. Literature Review

1.3. Applications of Semantic Masks

1.4. Summary of Contributions

1.5. Paper Layout

2. Methods and Benchmark Setup

2.1. Model Selection from Candidate Pool

2.2. Dataset

2.3. Computing Resources

2.4. Mask Generation

2.5. Evaluation Metrics

3. Results and Comparative Analysis

3.1. Scores Comparison

3.1.1. Resolution 256

3.1.2. Resolution 512

3.1.3. Resolution 1024

3.2. Findings and Limitations Across Models

4. Evaluation of Models Retrained with Semantic, Random, and Mixed Masks

5. Structural Limitations of Current AI Models for Semantic Mask Inpainting

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Detailed Limitations Across Models

Appendix A.2. Zoomed-In Crops

Appendix A.3. Statistical Validation of FID Improvements for Mixed Mask Training

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Mask Index	Face	Eyes	Nose	Lip (L)	Lip (U)	Mouth	Hair	Ears
A	✕	✓	✕	✕	✕	✕	✕	✕
B	✕	✕	✓	✕	✕	✕	✕	✕
C	✕	✕	✕	✓	✓	✓	✕	✕
D	✕	✕	✕	✓	✕	✕	✕	✕
E	✕	✕	✕	✕	✓	✕	✕	✕
F	✕	✕	✕	✕	✕	✓	✕	✕
G	✕	✕	✕	✕	✕	✕	✓	✕
H	✕	✕	✕	✕	✕	✕	✕	✓
I	✓	✕	✕	✕	✕	✕	✕	✕
J	✕	✕	✓	✓	✓	✓	✕	✕
K	✕	✓	✓	✓	✓	✓	✕	✕
L	✓	✕	✕	✕	✕	✕	✓	✓