Deep Learning Framework for Facial Reconstruction Outcome Prediction: Integrating Image Inpainting and Depth Estimation for Computer-Assisted Surgical Planning

Bini, Fabiano; Manni, Guido; Marinozzi, Franco

doi:10.3390/app152312376

Open AccessArticle

Deep Learning Framework for Facial Reconstruction Outcome Prediction: Integrating Image Inpainting and Depth Estimation for Computer-Assisted Surgical Planning

by

Fabiano Bini

^1,*,†

,

Guido Manni

^2,†

and

Franco Marinozzi

¹

Department of Mechanical and Aerospace Engineering, Sapienza University of Rome, 00184 Rome, Italy

²

Unit of Computer Systems and Bioinformatics, Department of Engineering, University Campus Bio-Medico of Rome, 00128 Rome, Italy

^*

Author to whom correspondence should be addressed.

^†

These authors share first authorship.

Appl. Sci. 2025, 15(23), 12376; https://doi.org/10.3390/app152312376

Submission received: 30 October 2025 / Revised: 16 November 2025 / Accepted: 18 November 2025 / Published: 21 November 2025

(This article belongs to the Special Issue Artificial Intelligence Applications in Healthcare and Precision Medicine, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Facial reconstructive surgery requires precise preoperative planning to optimize functional and aesthetic outcomes, but current imaging technologies like CT and MRI do not offer visualization of expected post-surgical appearance, limiting surgical planning capabilities. We developed a deep learning framework integrating facial inpainting and monocular depth estimation models to predict surgical outcomes and enable 2D and 3D planning from clinical photographs. Three state-of-the-art inpainting architectures (LaMa, LGNet, MAT) and three monocular depth estimation approaches (ZoeDepth, Depth Anything V2, DepthPro) were evaluated using the FFHQ dataset for inpainting and C3I-SynFace dataset for depth estimation, with comprehensive quantitative metrics assessing reconstruction quality and depth accuracy. For anatomically specific facial features, LGNet demonstrated superior performance across eyebrows (PSNR: 25.11, SSIM: 0.75), eyes (PSNR: 20.08, SSIM: 0.53), nose (PSNR: 25.70, SSIM: 0.88), and mouth (PSNR: 22.39, SSIM: 0.75), with statistically significant differences confirmed by paired t-tests (p < 0.001) and large effect sizes (Cohen’s d = 2.25–6.33). DepthPro significantly outperformed competing depth estimation models with absolute relative difference of 0.1426 (78% improvement over Depth Anything V2: 0.6453 and ZoeDepth: 0.6509) and δ₁ accuracy of 0.8373 (versus 0.6697 and 0.5271 respectively). This novel framework addresses a critical gap in surgical planning by providing comprehensive preoperative visualization of potential outcomes from standard clinical photographs, supporting applications from maxillofacial reconstruction to orbital and nasal procedures.

Keywords:

facial reconstruction; deep learning; surgical planning; image inpainting; 3D reconstruction

1. Introduction

The human face is one of the most complex and significant anatomical structures, serving as the foundation for individual identity, social interaction and housing critical sensory organs for breathing, nutrition, and communication. When facial disfigurement occurs—whether congenital or acquired through trauma or disease—the impact extends beyond physical appearance to profoundly affect functional capabilities, psychological well-being, and social integration, making facial plastic reconstructive surgery essential for restoring both anatomical normalcy and optimizing functional–aesthetic outcomes.

Plastic reconstructive surgery encompasses multiple specialized disciplines, including maxillofacial surgery, otolaryngology, and ophthalmic surgery, which collaboratively work to address the intricate challenges of facial reconstruction. To plan these interventions, facial reconstruction specialists rely on a combination of imaging modalities to assess anatomical structures and guide their work. Maxillofacial surgeons primarily utilize computed tomography (CT) for detailed bone structure visualization, while otolaryngologists and ophthalmic surgeons often combine CT with magnetic resonance imaging (MRI) to better delineate soft tissue relationships and pathological boundaries [1,2,3]. These volumetric imaging techniques are frequently complemented by two-dimensional and three-dimensional surface photography, which captures external facial topology and enables precise anthropometric measurements [4,5].

Building upon these imaging capabilities, computer-assisted surgical planning techniques utilize CT and MRI data for preoperative planning and outcome prediction. Traditional approaches include anatomical template-based methods [6], which use standardized models derived from population studies, and patient-specific “mirroring” techniques, which use the uninjured facial side as a reconstruction template [7]. These methods enable surgeons to generate virtual three-dimensional representations of facial structures and simulate potential surgical interventions prior to actual procedures.

Despite these advances, current methods have important limitations. Template-based methods may not fully capture individual anatomical variations, while mirroring techniques operate under the assumption of perfect facial symmetry—rarely true even in healthy individuals [8]. Perhaps most critically, existing imaging and planning approaches cannot show what reconstructed facial features will look like after healing and tissue integration. Given that comprehensive surgical preparation correlates with reduced operating times, improved precision, and enhanced outcomes [9], this gap in predictive visualization substantially impacts surgical decision-making, patient counseling, and clinical results.

Recent deep learning research has explored two relevant approaches for facial reconstruction. Facial inpainting methods based on generative models [10,11,12] can reconstruct missing regions with high perceptual quality but produce only RGB outputs without geometric information. Conversely, monocular depth estimation methods [13,14,15] enable three-dimensional inference from single images but are designed for complete scenes and cannot predict depth for masked regions. Notably, recent advances in monocular depth estimation have demonstrated promising performance in medical imaging contexts, with foundation models showing robust generalization capabilities across diverse clinical scenarios [16,17].

However, surgical planning requires predicting both appearance and geometry simultaneously for reconstructed regions. This enables evaluation of spatial relationships and volumetric characteristics critical to functional and aesthetic outcomes.

To address this gap, we present a comprehensive framework that integrates state-of-the-art facial inpainting and monocular depth estimation models to enable three-dimensional visualization of predicted surgical results from standard photographs. This integration provides the first quantitative tool for RGBD outcome prediction in surgical planning, addressing a capability currently unavailable in clinical practice. We conduct systematic evaluation of multiple architectures for both tasks, providing quantitative analysis to identify optimal configurations for clinical applications.

Section 2 details our methodology, including system architecture, experimental design, and evaluation metrics. Section 3 presents results comparing model performance. Section 4 discusses clinical implications, limitations, and future research directions.

2. Materials and Methods

This section details our framework for facial reconstruction outcome prediction. We describe the system architecture integrating facial inpainting and depth estimation, the state-of-the-art models selected for evaluation, the datasets used for validation, and the performance metrics.

2.1. Proposed System Architecture

Our system architecture integrates three key components into a unified clinical framework (Figure 1, Table 1). First, a facial inpainting model reconstructs missing or damaged facial regions (Figure 1a). Next, a monocular depth estimation algorithm analyzes the reconstructed face to determine spatial relationships (Figure 1b). Finally, these elements enable full 3D reconstruction of the facial structure (Figure 1c). As illustrated in the workflow diagram, these components work in sequence to transform a 2D facial image with regions requiring reconstruction into a complete 3D model.

2.1.1. Image Acquisition and Preprocessing

The workflow initiates with image acquisition and preprocessing. Clinicians upload patient facial images through the interface module’s image upload panel (Figure 2a). The interface provides an interactive mask creation tool (Figure 2b) for specifying reconstruction regions through binary masking, with real-time preview capabilities to ensure accurate region selection. The interface accepts standard clinical photography formats (JPEG, PNG), enabling seamless integration with existing clinical photo documentation workflows.

2.1.2. Facial Inpainting Module

For facial reconstruction, the framework employs a deep learning-based inpainting approach. Input images undergo standardization through resizing (256 × 256 pixels) and normalization (pixel range [−1, 1]) before processing. The inpainting model generates anatomically consistent facial features within the masked region. The results of the inpainting process are displayed in the reconstruction preview panel (Figure 2c). Post-processing includes denormalization and restoration to original image dimensions through bilinear interpolation.

2.1.3. Depth Estimation and 3D Reconstruction Module

The 3D reconstruction component converts the inpainted 2D image into a detailed three-dimensional model through a series of processing steps. Initially, a monocular depth estimation module employing generates a depth map from the reconstructed image. This depth map is then combined with the reconstructed RGB image to create a pseudo-RGBD representation. The final 3D model is generated through three successive stages using Open3D [18]: (1) generation of an initial colored point cloud from the pseudo-RGBD data, (2) creation of a triangular mesh through Poisson surface reconstruction [19] with optimized octree depth, and (3) application of the reconstructed facial image as a texture map to ensure visual fidelity. The resulting 3D model is presented in an interactive OpenGL-based viewer, which provides comprehensive tools for model manipulation and analysis. To facilitate integration with existing clinical workflows, the system exports reconstructed models in standard formats (.obj) compatible with major surgical planning software.

2.2. Model Selection

We conducted a comprehensive evaluation of state-of-the-art models for both facial inpainting and depth estimation tasks to determine the optimal architecture for our reconstruction framework.

2.2.1. Facial Inpainting Models

To determine the optimal inpainting model for our framework, we selected three current state-of-the-art models for face inpainting tasks:

LaMa [10] represents a significant advancement in image inpainting through its ability to handle large missing regions, a critical feature for facial reconstruction tasks. The model achieves this through its Fast Fourier Convolution-based architecture [20], which enables efficient processing of global image information while maintaining local structure coherence.
LGNet [11] addresses the challenge of maintaining both local and global coherence through its multi-stage refinement approach. The model achieves this through progressive refinements that focus on local feature consistency and global structure harmony.
MAT [12] introduces a transformer-based solution specifically designed for handling complex mask shapes and maintaining feature consistency. Its attention mechanism enables the model to reference distant facial features when reconstructing missing regions.

These models were chosen based on the review [21], which identified them as the highest-performing on facial datasets. They represent distinct, high-impact architectural advancements, enabling a comprehensive evaluation against the challenges of facial reconstruction.

For our evaluation, we utilized these models with their original architectures and official pretrained weights, all of which were trained on the CelebA-HQ dataset [22].

2.2.2. Monocular Depth Estimation Models

To determine the optimal monocular depth estimation model, we evaluated three currently state-of-the-art models:

ZoeDepth [13] advances the field of monocular depth estimation through its metric-aware approach and robust performance across varied scenarios. The model’s architecture combines transformer-based feature extraction with a specialized metric bin module to capture the scale of the scene.
Depth Anything V2 [14] represents a recent advancement in general-purpose depth estimation, utilizing a teacher-student framework to achieve robust depth prediction. The model’s architecture leverages the semantic understanding capabilities of the DINOv2 backbone [23], enabling it to capture both coarse structure and fine details in depth estimation.
Depth Pro [15] is a foundation model that provides sharp, metric depth maps at high resolution with high-frequency detail. It utilizes an efficient multi-scale Vision Transformer architecture, enabling state-of-the-art performance in boundary accuracy, which is crucial for detailed 3D facial modeling.

These models were chosen as they represent the current state-of-the-art foundation models for depth estimation. For our evaluation, we utilized these models with their original architectures and official pretrained weights.

2.3. Datasets

For inpainting model evaluation, we utilized the Flickr-Faces-HQ (FFHQ) dataset [24], a high-quality image dataset consisting of 70,000 high-resolution facial images at 1024 × 1024 pixels. Images were resized to 256 × 256 pixels for consistency with model training specifications. To comprehensively simulate diverse clinical reconstruction scenarios, our validation pipeline employs two distinct masking strategies: Semantic Masks and Random Rectangular Masks (Figure 3).

Random Rectangular Masks are used to test model robustness against general occlusions and are generated at three distinct coverage levels (10%, 25%, and 50% of the face bounding box area) with randomly sampled aspect ratios (0.5 to 2.0). The 10–25% mask sizes correspond to common clinical scenarios such as localized feature reconstruction, while 50% masking serves as an extreme case to assess model robustness under challenging conditions. Conversely, Semantic Masks are essential for evaluating model performance on perceptually critical features. These masks are generated using 68-point facial landmarks, obtained from the DLIB face feature extractor [25], to precisely target six distinct anatomical regions independently: Left Eye, Right Eye, Left Eyebrow, Right Eyebrow, Nose, and Mouth. This feature-specific approach allows for granular analysis of reconstruction fidelity across the most important facial structures.

For depth validation, we utilized the C3I-SynFace dataset [26], a synthetic dataset specifically designed for facial depth evaluation. The dataset comprises 30 high-quality 3D human face models (15 female and 15 male) rendered in various orientations to assess the robustness of depth estimation algorithms. This synthetic approach provides accurate ground truth depth maps, enabling precise evaluation of depth estimation performance across different facial poses. Given the inherent scale ambiguity in monocular depth estimation, where predicted depth maps may differ from ground truth by an unknown scale factor, we employed median scaling to align predictions with ground truth for evaluation purposes. Specifically, predicted depth maps were scaled according to:

s = \frac{m e d i a n (d^{t})}{m e d i a n (d^{p})}

where

d^{t}

represents the ground truth depth,

d^{p}

represents the predicted depth, and

s

is the scaling factor applied to all predicted depth values. This median-based alignment approach is standard practice in monocular depth estimation evaluation [27], as it enables fair comparison of relative depth structure while accounting for the scale ambiguity inherent to the task.

2.4. Evaluation Metrics

Performance assessment utilized a comprehensive set of metrics following established evaluation protocols in the literature for inpainting [21,28].

2.4.1. Inpainting Quality Metrics

For inpainting models, we measured: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Root Mean Square Error logarithmic (RMSE-log), Fréchet Inception Distance (FID), and the Learned Perceptual Image Patch Similarity (LPIPS). The detailed definitions of these metrics are provided in Table 1.

2.4.2. Depth Estimation Metrics

Depth estimation models were evaluated using Absolute Relative Difference (Abs. Rel. Diff.), Squared Relative Error (Sq. Rel.), RMSE, RMSE-log, and accuracy thresholds (Accuracy(

δ

)), with metric definitions detailed in Table 2.

3. Results

In this section, we present the results of our comprehensive evaluation aimed at validating the efficacy and robustness of our proposed facial reconstruction pipeline. Figure 4 provides a qualitative overview of the complete framework, demonstrating reconstruction results across both anatomically specific semantic masks and random rectangular masks, along with corresponding depth maps and 3D reconstructions. Initially, in Section 3.1, we analyze the performance of the inpainting approaches across different mask sizes, which is crucial for understanding the system’s capability to reconstruct missing facial features. This is followed by Section 3.2, where we explore the application and validation of the depth estimation models, essential for maintaining anatomical accuracy in the reconstructed features.

3.1. Inpainting Performance Analysis

Our comprehensive evaluation of inpainting models examines performance across two complementary masking strategies. Given that this framework targets surgical reconstruction applications, we first analyze model performance on anatomically specific facial features using semantic masks, followed by statistical validation of these results, and conclude with robustness assessment using random rectangular masks.

The evaluation focusing on anatomically specific semantic masks, as detailed in Table 3, is crucial for assessing a face inpainting framework designed for surgical reconstruction, which typically involves precise, localized regions. For all six features—Left Eye, Right Eye, Left Eyebrow, Right Eyebrow, Nose, and Mouth—LGNet consistently demonstrated superior performance across most structural and fidelity metrics, validating its effectiveness for clinical scenarios.

For eyebrow reconstruction, LGNet achieved PSNR values of 25.1119 (left) and 24.9909 (right), substantially outperforming both LaMa and MAT. The SSIM scores for eyebrows (0.7542 and 0.7513) indicate strong structural preservation, which is critical given the fine texture and directional patterns characteristic of eyebrow hair.

Eye reconstruction results present a more challenging scenario, as evidenced by lower overall metric values across all models. LGNet maintained superiority with PSNR values of 20.0839 (left eye) and 19.9782 (right eye), though the relatively lower SSIM scores (0.5311 and 0.5254) reflect the inherent difficulty of reconstructing the complex textures and specular properties of the eye region. Notably, MAT achieved competitive LPIPS scores for eyes (0.0693 for left eye), suggesting better perceptual quality despite lower structural similarity metrics.

Nose reconstruction demonstrated the highest performance metrics among all facial features, with LGNet achieving PSNR (25.6988) and SSIM (0.8803) values. This superior performance likely reflects the nose’s relatively simpler geometric structure and more uniform texture compared to eyes and mouth. LGNet’s RMSE-log score of 0.1925 indicates particularly strong preservation of depth and shading information, which is crucial for maintaining the three-dimensional appearance of this prominent facial feature.

Mouth reconstruction presents unique challenges due to the presence of teeth, lips, and complex color transitions. LGNet achieved PSNR of 22.3937 and SSIM of 0.7456, demonstrating robust performance. However, the higher LPIPS values across all models for mouth reconstruction (0.1567 for LGNet, 0.1492 for MAT) suggest that perceptual quality remains challenging in this region, possibly due to the difficulty in maintaining the sharp contrast between teeth and surrounding tissue while preserving natural lip texture. LaMa consistently achieved the best RMSE-log scores across most facial features (e.g., 0.2200 for left eyebrow, 0.1925 for nose), indicating its strength in preserving relative intensity relationships and subtle tonal variations. However, its lower PSNR and SSIM scores suggest a trade-off between maintaining local detail preservation and overall structural accuracy.

To rigorously confirm the reliability of these observed performance differences, paired t-tests with effect size analysis were performed comparing the best-performing models across all six semantic mask types for the PSNR, SSIM, RMSE-LOG, and LPIPS metrics. All performance differences were highly statistically significant (p < 0.001 for all comparisons), with effect sizes indicating substantial practical importance. For structural metrics where LGNet outperformed MAT, Cohen’s d values were consistently large to very large across all facial features. For instance, the PSNR difference for Left Eyebrow reconstruction showed Cohen’s d = 5.93, representing an extremely large effect size and confirming that LGNet’s 2.21 dB advantage translates to meaningful reconstruction quality improvement. Similarly, for SSIM on the same feature, Cohen’s d = 6.09 indicated that LGNet’s structural preservation advantage is both statistically and practically significant. The nose reconstruction demonstrated the most substantial effect sizes (Cohen’s d = 6.33 for PSNR, d = 5.97 for SSIM), reinforcing LGNet’s exceptional performance for this anatomical structure. Eye reconstruction showed somewhat smaller but still very large effect sizes (Cohen’s d = 2.25 for left eye PSNR, d = 1.90 for right eye PSNR), reflecting both the inherent difficulty of this reconstruction task and LGNet’s consistent superiority. Notably, where MAT achieved better LPIPS scores, the effect sizes were moderate to large (Cohen’s d = 0.47 for left eye, d = 1.13 for mouth), confirming that MAT’s perceptual quality advantage, while statistically significant, represents a more modest practical improvement compared to LGNet’s structural metric dominance.

Finally, the robustness of the models was assessed using arbitrary Random Rectangular Masks, which mimic general occlusions and stress test the framework, as summarized in Table 4. Model performance exhibited a clear relationship with mask size, with all methods showing optimal results at smaller mask dimensions. For minimal masking (10%), LGNet exhibited superior performance across most metrics, including PSNR (31.4162), SSIM (0.9587), LPIPS (0.0159), and FID (0.0020), indicating superior quality for small masked regions. LaMa showed competitive performance here, achieving the best RMSE-log score (0.2671), suggesting strong preservation of fine detail. In the moderate masking scenario (25%), representative of more extensive facial feature reconstruction, LGNet maintained its strong performance with PSNR (25.3864), SSIM (0.8839), and superior perceptual metrics (LPIPS: 0.0462, FID: 0.0105). Even in the extreme stress test with 50% masking, where all models degraded, LGNet maintained relatively better metrics in most categories (PSNR: 20.2653, SSIM: 0.7344, LPIPS: 0.1153), demonstrating superior robustness. However, at this 50% level, MAT achieved the best FID score (0.0436), suggesting its reconstructions were perceived as slightly better in quality for very large area inpainting, though still underperforming in other structural metrics.

3.2. Monocular Depth Estimation Performance Analysis

Our quantitative evaluation of monocular depth estimation models compared the performance of three state-of-the-art approaches: Depth Anything V2, ZoeDepth, and DepthPro. The assessment utilized multiple complementary metrics to comprehensively evaluate depth prediction accuracy across different aspects of model performance **, with results presented in Table 5 revealing substantial performance differences among the evaluated architectures.

Examining the absolute relative difference and squared relative error metrics in the first two columns, DepthPro achieved values of 0.1426 and 0.0773, respectively. These results represent approximately 78% reduction in absolute relative error compared to Depth Anything V2 (0.6453) and ZoeDepth (0.6509), with the squared relative error demonstrating approximately 85% reduction relative to both competing methods (0.5072 and 0.5013, respectively). Such substantial improvements indicate enhanced capability in capturing absolute depth relationships within facial scenes. Depth Anything V2 and ZoeDepth exhibited comparable performance profiles across these metrics, with marginal differences suggesting approximately equivalent accuracy levels for basic depth estimation, though substantially inferior to DepthPro.

The middle columns of Table 5, presenting RMSE and RMSE-log metrics, provide insights into model performance across different depth ranges. DepthPro maintained consistent superiority across both metrics (RMSE: 0.5424; RMSE-log: 0.2282). Notably, while Depth Anything V2 demonstrated competitive RMSE (0.6411), ZoeDepth exhibited considerably higher error (0.8607), indicating limitations in maintaining metric-level precision despite reasonable relative error performance. The RMSE-log metric is particularly relevant for facial reconstruction applications, as it evaluates model performance in preserving relative depth relationships. DepthPro’s RMSE-log score of 0.2282 represents approximately 62% and 66% improvement over Depth Anything V2 (0.6042) and ZoeDepth (0.6061), respectively, indicating superior preservation of depth gradients critical for anatomically accurate three-dimensional facial reconstruction.

The threshold accuracy metrics (δ₁, δ₂, δ₃), shown in the rightmost columns of Table 5, quantify the proportion of predictions within specified relative error bounds. DepthPro achieved values of 0.8373, 0.9778, and 0.9886 for δ₁, δ₂, and δ₃ respectively, indicating that 84% of depth predictions fall within 25% of ground truth values, increasing to 99% at the δ₃ threshold. Depth Anything V2 demonstrated moderate threshold accuracies of 0.6697, 0.7144, and 0.7233, while ZoeDepth exhibited notably lower performance at 0.5271, 0.6811, and 0.7116. The δ₁ metric is particularly informative for clinical applications requiring high precision—ZoeDepth’s δ₁ value of 0.5271 indicates that fewer than 53% of predictions achieve accuracy within 25% of ground truth, potentially limiting reliability for surgical planning applications requiring consistent depth estimation accuracy.

To rigorously assess the statistical significance of observed performance differences, paired statistical analyses were conducted comparing DepthPro and DepthAnythingV2 on the absolute relative difference metric across 31,426 facial images. Both parametric (paired t-test) and non-parametric (Wilcoxon signed-rank test) approaches were employed to ensure robustness of findings. Results demonstrated highly statistically significant differences (t = −853.04, p < 0.001; W = 3.00, p < 0.001), with DepthPro exhibiting a mean absolute relative difference of 0.1426 compared to DepthAnythingV2’s 0.6453. The mean difference of −0.5026 (95% CI: [−0.5038, −0.5014]) was accompanied by Cohen’s d = −4.81, representing an extremely large effect size that substantially exceeds conventional thresholds (d > 0.8). This effect size magnitude indicates that the observed performance improvement constitutes a substantial practical advancement beyond statistical significance alone, with implications for the precision and reliability of depth-based three-dimensional facial reconstructions in surgical planning contexts.

These results demonstrate that DepthPro provides significantly superior depth estimation accuracy for facial reconstruction applications. The combination of superior performance across all evaluated metrics, high threshold accuracy values, and large effect size establishes DepthPro as the optimal architecture for clinical applications where precise depth estimation is crucial for surgical planning and three-dimensional visualization of predicted outcomes.

4. Discussion and Conclusions

This research presents a novel approach to facial reconstruction planning that leverages deep learning to address critical limitations in current preoperative visualization technologies. Through comprehensive evaluation of state-of-the-art models, our work demonstrates the technical feasibility of using AI-driven techniques to enhance surgical planning capabilities. The experimental results reveal two significant technical achievements with potential clinical applications. First, LGNet demonstrated superior performance in facial inpainting across anatomically specific semantic masks, which are most relevant for clinical reconstruction scenarios. The model achieved particularly strong results for nose reconstruction (PSNR: 25.6988, SSIM: 0.8803) and eyebrow reconstruction (PSNR: 25.1119 and 24.9909 for left and right, respectively), with statistical analysis confirming highly significant differences (p < 0.001) and large effect sizes (Cohen’s d ranging from 2.25 to 6.33) compared to competing architectures. For random rectangular masks simulating general occlusions, LGNet maintained optimal performance at 10–25% coverage levels (PSNR: 31.4162, SSIM: 0.9587 at 10%; PSNR: 25.3864, SSIM: 0.8839 at 25%), which correspond to common clinical reconstruction scenarios. Second, DepthPro showed consistently better performance in depth estimation tasks, particularly in preserving relative depth relationships as evidenced by superior absolute relative difference (0.1426) and RMSE-log scores (0.2282), representing approximately 78% and 62% improvement over competing methods, respectively. The threshold accuracy metrics further confirmed DepthPro’s superiority, with 83.7% of depth predictions falling within 25% of ground truth values. From a biomedical informatics perspective, these technical capabilities enable the prediction of surgical reconstruction outcomes from standard 2D photographs, addressing a fundamental limitation in preoperative planning where accurate visualization of potential results has traditionally not been possible. These technical achievements suggest promising applications across multiple surgical subspecialties. In maxillofacial surgery, the inpainting capabilities could predict the appearance of reconstructed facial features, while the depth estimation could help visualize the resulting three-dimensional structure of bone and soft tissue modifications. For otolaryngology, the system could assist in nasal reconstruction planning by combining accurate prediction of the reconstructed nasal appearance with depth information to assess both aesthetic outcomes and structural changes. In ophthalmic surgery, the inpainting could help visualize the appearance of orbital reconstructions, while the depth estimation could provide insights into the spatial relationships between the reconstructed orbit and surrounding facial structures.

However, several limitations warrant consideration. The evaluation was conducted on benchmark datasets (FFHQ for inpainting and C3I-SynFace for depth estimation) rather than clinical photographs, reflecting the broader absence of publicly available clinical datasets for facial reconstruction outcome prediction. While enabling rigorous technical evaluation, benchmark datasets may not fully capture the complexity and variability of real surgical reconstruction scenarios. The reliance on pretrained models, while computationally efficient, may not fully capture the nuances of specific surgical scenarios. Additionally, the current sequential pipeline—performing inpainting independently followed by depth estimation—does not leverage the potential synergies between RGB and depth modalities. A unified multimodal approach that jointly processes both RGB and depth information could enhance reconstruction quality by ensuring geometric consistency throughout the inpainting process.

Future research directions should focus on three key areas: (1) Development of clinical datasets for facial reconstruction, including pre-operative photographs, surgical masks indicating reconstruction regions, and post-operative outcomes across diverse surgical scenarios (maxillofacial, otolaryngologic, and ophthalmic procedures) with appropriate ethical approvals and patient consent. Such datasets would enable both fine-tuning of existing models on real surgical cases and clinically relevant validation through surgeon ratings of reconstruction quality, case studies, and task-level assessment of clinical benefit. (2) Investigation of end-to-end multimodal architectures that jointly optimize RGB and depth reconstruction, which could improve anatomical coherence by leveraging synergies between appearance and geometry. Specific implementations could include depth-guided inpainting methods [29,30]. These approaches could yield more plausible and geometrically consistent results by incorporating 3D geometric constraints during the inpainting process rather than as post-processing. However, evaluation of these approaches would require clinical datasets with paired RGB and ground truth depth information. (3) Prospective clinical validation studies assessing the system’s impact on surgical planning efficiency, patient counseling, and informed consent processes.

This work demonstrates how advanced computing methods can address critical challenges in surgical planning, establishing a foundation for future developments in computer-assisted reconstructive surgery.

Author Contributions

Conceptualization F.B. and F.M.; methodology G.M., F.B. and F.M.; software G.M.; validation G.M., F.B. and F.M.; formal analysis G.M. and F.B.; investigation G.M.; resources G.M.; data curation G.M.; writing—original draft preparation G.M.; writing—review and editing F.B. and F.M.; visualization G.M.; supervision F.B. and F.M.; project administration F.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Source code, documentation, and examples are available upon motivated request to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Shaye, D.A.; Tollefson, T.T.; Strong, E.B. Use of intraoperative computed tomography for maxillofacial reconstructive surgery. JAMA Facial Plast. Surg. 2015, 17, 113–119. [Google Scholar] [CrossRef] [PubMed]
Heiland, M.; Schulze, D.; Blake, F.; Schmelzle, R. Intraoperative imaging of zygomaticomaxillary complex fractures using a 3d c-arm system. Int. J. Oral Maxillofac. Surg. 2005, 34, 369–375. [Google Scholar] [CrossRef]
Tarassoli, S.P.; Shield, M.E.; Allen, R.S.; Jessop, Z.M.; Dobbs, T.D.; Whitaker, I.S. Facial reconstruction: A systematic review of current image acquisition and processing techniques. Front. Surg. 2020, 7, 537616. [Google Scholar] [CrossRef]
Afaq, S.; Jain, S.K.; Sharma, N.; Sharma, S. Acquisition of precision and reliability of modalities for facial reconstruction and aesthetic surgery: A systematic review. J. Pharm. Bioallied Sci. 2023, 15 (Suppl. S2), S849–S855. [Google Scholar] [CrossRef]
Monini, S.; Ripoli, S.; Filippi, C.; Fatuzzo, I.; Salerno, G.; Covelli, E.; Bini, F.; Marinozzi, F.; Marchelletta, S.; Manni, G.; et al. An objective, markerless videosystem for staging facial palsy. Eur. Arch. Otorhinolaryngol. 2021, 278, 3541–3550. [Google Scholar] [CrossRef]
Fuller, S.C.; Strong, E.B. Computer applications in facial plastic and reconstructive surgery. Curr. Opin. Otolaryngol. Head Neck Surg. 2007, 15, 233–237. [Google Scholar] [CrossRef] [PubMed]
Scolozzi, P. Applications of 3d orbital computer-assisted surgery (cas). J. Stomatol. Oral Maxillofac. Surg. 2017, 118, 217–223. [Google Scholar] [CrossRef]
Davis, K.S.; Vosler, P.S.; Yu, J.; Wang, E.W. Intraoperative image guidance improves outcomes in complex orbital reconstruction by novice surgeons. J. Oral Maxillofac. Surg. 2016, 74, 1410–1415. [Google Scholar] [CrossRef]
Luz, M.; Strauss, G.; Manzey, D. Impact of image-guided surgery on surgeons’ performance: A literature review. Int. J. Hum. Factors Ergon. 2016, 4, 229–263. [Google Scholar] [CrossRef]
Suvorov, R.; Logacheva, E.; Mashikhin, A.; Remizova, A.; Ashukha, A.; Silvestrov, A.; Kong, N.; Goka, H.; Park, K.; Lempitsky, V. Resolution-robust large mask inpainting with fourier convolutions. arXiv 2021, arXiv:2109.07161. [Google Scholar] [CrossRef]
Quan, W.; Zhang, R.; Zhang, Y.; Li, Z.; Wang, J.; Yan, D.-M. Image inpainting with local and global refinement. IEEE Trans. Image Process. 2022, 31, 2405–2420. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Lin, Z.; Zhou, K.; Qi, L.; Wang, Y.; Jia, J. Mat: Mask-aware transformer for large hole image inpainting. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10748–10758. [Google Scholar]
Bhat, S.F.; Birkl, R.; Wofk, D.; Wonka, P.; Müller, M. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv 2023, arXiv:2302.12288. [Google Scholar]
Yang, L.; Kang, B.; Huang, Z.; Xu, X.; Feng, J.; Zhao, H. Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Bochkovskii, A.; Delaunoy, A.; Germain, H.; Santos, M.; Zhou, Y.; Richter, S.R.; Koltun, V. Depth Pro: Sharp Monocular Metric Depth in Less Than a Second. arXiv 2024, arXiv:2410.02073. [Google Scholar] [CrossRef]
Han, J.J.; Acar, A.; Henry, C.; Wu, J.Y. Depth Anything in Medical Images: A Comparative Study. arXiv 2024, arXiv:2401.16600. [Google Scholar] [CrossRef]
Manni, G.; Lauretti, C.; Prata, F.; Papalia, R.; Zollo, L.; Soda, P. BodySLAM: A Generalized Monocular Visual SLAM Framework for Surgical Applications. arXiv 2024, arXiv:2408.03078. [Google Scholar] [CrossRef]
Zhou, Q.-Y.; Park, J.; Koltun, V. Open3D: A modern library for 3D data processing. arXiv 2018, arXiv:1801.09847. [Google Scholar] [CrossRef]
Kazhdan, M.M.; Bolitho, M.; Hoppe, H. Poisson surface reconstruction. In Proceedings of the Eurographics Symposium on Geometry Processing, Sardinia, Italy, 26–28 June 2006. [Google Scholar]
Chi, L.; Jiang, B.; Mu, Y. Fast Fourier Convolution. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020. [Google Scholar]
Quan, W.; Chen, J.; Liu, Y.; Yan, D.M.; Wonka, P. Deep Learning-Based Image and Video Inpainting: A Survey. Int. J. Comput. Vis. 2024, 132, 2367–2400. [Google Scholar] [CrossRef]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.V.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning robust visual features without supervision. arXiv 2024, arXiv:2304.07193. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4217–4228. [Google Scholar] [CrossRef]
Kazemi, V.; Sullivan, J. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1867–1874. [Google Scholar]
Basak, S.; Khan, F.; Javidnia, H.; Corcoran, P.; McDonnell, R.; Schukat, M. C3i-synface: A synthetic head pose and facial depth dataset using seed virtual human models. Data Brief 2023, 48, 109087. [Google Scholar] [CrossRef] [PubMed]
Gómez-Rodríguez, J.J.; Lamarca, J.; Morlana, J.; Tardós, J.D.; Montiel, J.M.M. SD-DefSLAM: Semi-Direct Monocular SLAM for Deformable and Intracorporeal Scenes. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 5170–5177. [Google Scholar]
Masoumian, A.; Rashwan, H.A.; Cristiano, J.; Asif, M.S.; Puig, D. Monocular Depth Estimation Using Deep Learning: A Review. Sensors 2022, 22, 5353. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Zhu, S.; Ge, Y.; Zeng, B.; Imran, M.A.; Abbasi, Q.H.; Cooper, J. Depth-guided Deep Video Inpainting. IEEE Trans. Multimed. 2023, 26, 5860–5871. [Google Scholar] [CrossRef]
Zhang, F.X.; Chen, S.; Xie, X.; Shum, H.P.H. Depth-Aware Endoscopic Video Inpainting. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI, Marrakesh, Morocco, 6–10 October 2024. [Google Scholar]

Figure 1. Workflow for the facial reconstruction system: (a) face reconstruction: input images are processed through an inpainting model to reconstruct missing facial regions; (b) monocular estimation phase: reconstructed images undergo monocular depth estimation; (c) 3D reconstruction: the system generates a 3D model using point cloud reconstruction and surface modeling techniques, combining RGB and depth information.

Figure 2. User interface workflow of the facial reconstruction system showing (a) the initial image upload interface, (b) the interactive mask creation tool for specifying reconstruction regions, and (c) the final reconstructed result displaying seamless facial feature integration.

Figure 3. Comprehensive masking strategies used to simulate facial reconstruction scenarios on the FFHQ dataset. Nine distinct mask types are employed across two strategies: (Top Row, Middle) Random Rectangular Masks (10%, 25%, and 50% coverage) test model robustness against arbitrary occlusions; (Top/Bottom Rows, Right) Anatomically Specific Semantic Masks (Left Eye, Left Eyebrow, Right Eye, Right Eyebrow, Nose, and Mouth) are generated using facial landmarks to evaluate reconstruction fidelity on perceptually critical features.

Figure 4. Qualitative results of the facial reconstruction framework. (Top row) Ground truth and reconstructed RGB images using LGNet across six anatomically specific semantic masks (Left Eyebrow, Left Eye, Right Eyebrow, Right Eye, Nose, Mouth) and three random rectangular masks (10%, 25%, 50% coverage). (Middle row) Corresponding depth maps generated by DepthPro, with warmer colors (orange) indicating distance to camera and cooler colors (blue/purple) representing proximity. (Bottom row) Final 3D facial reconstructions obtained through point cloud generation and Poisson surface reconstruction.

Table 1. Key Metrics for Evaluating Image Inpainting Quality. x: original image,

\bar{x}

: generated image,

N

: number of pixels,

μ

: mean,

σ

: standard deviation/covariance,

l

: network layer index,

w_{l}

: learned weights per layer,

φ_{l}

: feature extraction function at layer

l

,

M A X

: maximum pixel value,

c_{1}

,

c_{2}

: stability constants,

T r

: trace operator,

C o v_{x}

,

C o v_{\bar{x}}

: covariance matrices for original and generated images.

Table 1. Key Metrics for Evaluating Image Inpainting Quality. x: original image,

\bar{x}

: generated image,

N

: number of pixels,

μ

: mean,

σ

: standard deviation/covariance,

l

: network layer index,

w_{l}

: learned weights per layer,

φ_{l}

: feature extraction function at layer

l

,

M A X

: maximum pixel value,

c_{1}

,

c_{2}

: stability constants,

T r

: trace operator,

C o v_{x}

,

C o v_{\bar{x}}

: covariance matrices for original and generated images.

Metric	Formula	Description
PSNR	$20 \log_{10} (\frac{M A X}{\sqrt{M S E}})$	Signal-to-noise ratio in decibels, measuring peak error. Higher values indicate better quality reconstruction.
SSIM	$\frac{(2 μ_{x} μ_{\bar{x}} + c_{1}) (2 σ_{x \bar{x}} + c_{2})}{(μ_{x}^{2} + μ_{\bar{x}}^{2} + c_{1}) (σ_{x}^{2} + σ_{\bar{x}}^{2} + c_{2})}$	Structural similarity index measuring perceived quality through luminance, contrast, and structure correlation.
RMSE-log	$\sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(\log x_{i} - \log \bar{x_{i}})}^{2}}$	Root mean square error in logarithmic space, better handling wide range of values and relative differences.
FID	${\|μ_{x} - μ_{\bar{x}}\|}^{2} + T r (C o v_{x} + C o v_{\bar{x}} - 2 {(C o v_{x} C o v_{\bar{x}})}^{\frac{1}{2}})$	Inception feature-space distance between real and generated image distributions, correlating with human perception.
LPIPS	$\sum_{l = 1}^{L} w_{l} {\|φ_{l} (x) - φ_{l} (\bar{x})\|}^{2}$	Learned perceptual similarity using weighted distances between deep neural network features.

Table 2. Key Metrics for Evaluating Depth Prediction Accuracy.

d_{i}^{p}

: predicted depth at pixel

i

,

d_{i}^{t}

: true depth at pixel

i

,

N

: number of pixels,

δ

: accuracy threshold.

Table 2. Key Metrics for Evaluating Depth Prediction Accuracy.

d_{i}^{p}

: predicted depth at pixel

i

,

d_{i}^{t}

: true depth at pixel

i

,

N

: number of pixels,

δ

: accuracy threshold.

Metric	Formula	Description
Abs. Rel. Diff.	$\frac{1}{N} \frac{\sum_{i = 1}^{N} \|d_{i}^{p} - d_{i}^{t}\|}{d_{i}^{t}}$	Average of relative depth prediction errors, providing scale-invariant measure of prediction accuracy relative to true depth
Sq. Rel.	$\frac{1}{N} \sum_{i = 1}^{N} {(\frac{d_{i}^{p} - d_{i}^{t}}{d_{i}^{t}})}^{2}$	Emphasizes larger errors by squaring relative error, particularly sensitive to outliers in depth prediction
RMSE	$\sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(d_{i}^{p} - d_{i}^{t})}^{2}}$	Standard deviation of prediction errors in metric space, heavily penalizes large deviations from ground truth depth
RMSE-log	$\sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(\log d_{i}^{p} - \log d_{i}^{t})}^{2}}$	Evaluates errors in logarithmic space, better handling wide depth ranges while maintaining sensitivity to small depth differences
Accuracy(δ)	$\frac{1}{N} \sum_{i = 1}^{N} [\max (\frac{d_{i}^{p}}{d_{i}^{t}}, \frac{d_{i}^{t}}{d_{i}^{p}}) < δ]$	Proportion of predictions within relative threshold $δ$ of true values, commonly evaluated at $δ$ = 1.25, 1.25², 1.25³

Table 3. Quantitative comparison of inpainting models across six facial features (left eyebrow, left eye, nose, mouth, right eye and right eyebrow). Performance is evaluated using Peak Signal-to-Noise Ratio (PSNR ↑), Structural Similarity Index (SSIM ↑), Root Mean Square Error logarithmic (RMSE-log ↓), Learned Perceptual Image Patch Similarity (LPIPS ↓), and Fréchet Inception Distance (FID ↓). Arrows indicate whether lower (↓) or higher (↑) values are better. Bold values indicate best performance for each metric and mask size configuration.

Masking	Model	PSNR ↑	SSIM ↑	RMSE-log ↓	LPIPS ↓	FID ↓
Left Eyebrow	LaMa	20.1925	0.5694	0.3422	0.0720	1.2552
	LGNet	25.1119	0.7542	0.2200	0.0696	2.8648
	MAT	22.9053	0.6943	0.2748	0.0719	2.6511
Left Eye	LaMa	16.7562	0.342	0.7435	0.0994	1.6457
	LGNet	20.0839	0.5311	0.5334	0.0708	2.8785
	MAT	19.4222	0.5228	0.5539	0.0693	2.9148
Nose	LaMa	19.7346	0.7516	0.3319	0.1481	1.7137
	LGNet	25.6988	0.8803	0.1925	0.1193	2.7347
	MAT	23.6441	0.8527	0.2334	0.1186	2.6879
Mouth	LaMa	17.9572	0.5447	0.4530	0.1985	1.9385
	LGNet	22.3937	0.7456	0.2922	0.1567	3.0683
	MAT	21.0938	0.7092	0.3327	0.1492	3.0341
Right Eye	LaMa	16.7474	0.3342	0.7409	0.0940	1.7031
	LGNet	19.9782	0.5254	0.5363	0.0694	2.8752
	MAT	19.4241	0.5210	0.5524	0.0712	2.8912
Right Eyebrow	LaMa	20.0869	0.5690	0.3530	0.0737	1.2399
	LGNet	24.9909	0.7513	0.2253	0.0711	2.8796
	MAT	22.6738	0.6896	0.2828	0.0729	2.6594

Table 4. Quantitative comparison of inpainting models across different mask sizes (10%, 25%, and 50%). Performance is evaluated using Peak Signal-to-Noise Ratio (PSNR ↑), Structural Similarity Index (SSIM ↑), Root Mean Square Error logarithmic (RMSE-log ↓), Learned Perceptual Image Patch Similarity (LPIPS ↓), and Fréchet Inception Distance (FID ↓). Arrows indicate whether lower (↓) or higher (↑) values are better. Bold values indicate best performance for each metric and mask size configuration.

Masking	Model	PSNR ↑	SSIM ↑	RMSE-log ↓	LPIPS ↓	FID ↓
10%	LaMa	29.2582	0.9538	0.2671	0.0198	0.0030
	LGNet	31.4162	0.9587	0.5945	0.0159	0.0020
	MAT	11.6165	0.3972	1.4479	0.3565	0.1571
25%	LaMa	22.7765	0.8666	0.4563	0.0706	0.0529
	LGNet	25.3864	0.8839	0.6643	0.0462	0.0105
	MAT	13.3344	0.5346	1.2856	0.2658	0.0923
50%	LaMa	18.0442	0.7130	0.7147	0.1826	0.3642
	LGNet	20.2653	0.7344	0.7976	0.1153	0.0573
	MAT	16.9843	0.7276	0.9723	0.1409	0.0436

Table 5. Quantitative comparison of depth estimation metrics between Depth Anything, ZoeDepth and DepthPro. δ₁, δ₂ and δ₃ represent accuracy thresholds at 1.25, 1.25², and 1.25³ respectively. Arrows indicate whether lower (↓) or higher (↑) values are better. Bold values indicate best performance for each metric.

Model	Abs. Rel. ↓	Sq. Rel. ↓	RMSE ↓	RMSE-log ↓	δ₁ ↑	δ₂ ↑	δ₃ ↑
Depth Anything V2	0.6453	0.5072	0.6411	0.6042	0.6697	0.7144	0.7233
ZoeDepth	0.6509	0.5013	0.8607	0.6061	0.5271	0.6811	0.7116
DepthPro	0.1426	0.0773	0.5424	0.2282	0.8373	0.9778	0.9886

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bini, F.; Manni, G.; Marinozzi, F. Deep Learning Framework for Facial Reconstruction Outcome Prediction: Integrating Image Inpainting and Depth Estimation for Computer-Assisted Surgical Planning. Appl. Sci. 2025, 15, 12376. https://doi.org/10.3390/app152312376

AMA Style

Bini F, Manni G, Marinozzi F. Deep Learning Framework for Facial Reconstruction Outcome Prediction: Integrating Image Inpainting and Depth Estimation for Computer-Assisted Surgical Planning. Applied Sciences. 2025; 15(23):12376. https://doi.org/10.3390/app152312376

Chicago/Turabian Style

Bini, Fabiano, Guido Manni, and Franco Marinozzi. 2025. "Deep Learning Framework for Facial Reconstruction Outcome Prediction: Integrating Image Inpainting and Depth Estimation for Computer-Assisted Surgical Planning" Applied Sciences 15, no. 23: 12376. https://doi.org/10.3390/app152312376

APA Style

Bini, F., Manni, G., & Marinozzi, F. (2025). Deep Learning Framework for Facial Reconstruction Outcome Prediction: Integrating Image Inpainting and Depth Estimation for Computer-Assisted Surgical Planning. Applied Sciences, 15(23), 12376. https://doi.org/10.3390/app152312376

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning Framework for Facial Reconstruction Outcome Prediction: Integrating Image Inpainting and Depth Estimation for Computer-Assisted Surgical Planning

Abstract

1. Introduction

2. Materials and Methods

2.1. Proposed System Architecture

2.1.1. Image Acquisition and Preprocessing

2.1.2. Facial Inpainting Module

2.1.3. Depth Estimation and 3D Reconstruction Module

2.2. Model Selection

2.2.1. Facial Inpainting Models

2.2.2. Monocular Depth Estimation Models

2.3. Datasets

2.4. Evaluation Metrics

2.4.1. Inpainting Quality Metrics

2.4.2. Depth Estimation Metrics

3. Results

3.1. Inpainting Performance Analysis

3.2. Monocular Depth Estimation Performance Analysis

4. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI