Next Article in Journal
Two-Dimensional Simulation of Multiple-Acoustic-Wave Scattering by a Human Body Model Inside an Acoustic Enclosed Space
Previous Article in Journal
Performance Evaluation of WRF Model for Short-Term Forecasting of Solar Irradiance—Post-Processing Approach for Global Horizontal Irradiance and Direct Normal Irradiance for Solar Energy Applications in Italy
Previous Article in Special Issue
Semantic-Guided Spatial and Temporal Fusion Framework for Enhancing Monocular Video Depth Estimation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Geometric Feature Enhancement for Robust Facial Landmark Detection in Makeup Paper Templates

1
Graduate Institute of Applied Science and Engineering, Fu Jen Catholic University, New Taipei City 243062, Taiwan
2
Library and Information Center, Lee-Ming Institute of Technology, New Taipei City 243083, Taiwan
3
Department of Computer Science and Information Engineering, Fu Jen Catholic University, New Taipei City 242062, Taiwan
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(2), 977; https://doi.org/10.3390/app16020977 (registering DOI)
Submission received: 24 November 2025 / Revised: 4 January 2026 / Accepted: 16 January 2026 / Published: 18 January 2026
(This article belongs to the Special Issue Advances in Computer Vision and Digital Image Processing)

Abstract

Traditional scoring of makeup face templates in beauty skill assessments heavily relies on manual judgment, leading to inconsistencies and subjective bias. Hand-drawn templates often exhibit proportion distortions, asymmetry, and occlusions that reduce the accuracy of conventional facial landmark detection algorithms. This study proposes a novel approach that integrates Geometric Feature Enhancement (GFE) with Dlib’s 68-landmark detection to improve the robustness and precision of landmark localization. A comprehensive comparison among Haar Cascade, MTCNN-MobileNetV2, and Dlib was conducted using a curated dataset of 11,600 hand-drawn facial templates. The proposed GFE-enhanced Dlib achieved 60.5% accuracy—outperforming MTCNN (23.4%) and Haar (20.3%) by approximately 37 percentage points, with precision and F1-score improvements exceeding 20% and 25%, respectively. The results demonstrate that the proposed method significantly enhances detection accuracy and scoring consistency, providing a reliable framework for automated beauty skill evaluation, and laying a solid foundation for future applications such as digital archiving and style-guided synthesis.

1. Introduction

In global beauty skill certification examinations, the Makeup Face Template competition has become a crucial assessment for evaluating professional competencies. Participants are required to design makeup works that conform to standardized aesthetic guidelines while accounting for diverse facial shapes and contours. However, the interpretation of these standards is often subjective, varying considerably among evaluators and leading to inconsistencies in scoring. Conventional scoring methods predominantly rely on evaluators’ individual judgments, rendering results susceptible to personal bias and experiential differences. Such subjectivity raises concerns regarding fairness and may negatively influence candidates’ performance and career advancement [1,2].
Similar competitions are also conducted in Taiwan, where the increasing number of participants has exposed the inefficiencies and high costs of manual evaluation, particularly in large-scale examinations. Additionally, regional and cultural variations in beauty standards further contribute to inconsistent evaluation criteria and subjective interpretation [3].
With the rapid advancement of artificial intelligence (AI) and image recognition technologies, new solutions have emerged to address these challenges. Nonetheless, developing an automated scoring system for Makeup Face Templates remains difficult due to challenges in recognizing hand-drawn features [4,5], the absence of standardized scoring metrics [6], and limited training data. Unlike traditional facial recognition systems that depend on three-dimensional depth cues and natural facial symmetry, makeup templates are purely two-dimensional with exaggerated proportions, symbolic contours, and stylistic abstraction. These characteristics disrupt conventional feature extraction algorithms and significantly reduce recognition accuracy.
Existing facial feature detection technologies demonstrate substantial declines in performance when processing highly stylized or incomplete facial depictions. Standard deep learning models such as MTCNN and VGG-Face, which perform well in real-world facial recognition, often fail to generalize effectively to makeup template images due to inconsistent outlines, artistic distortions, and non-photorealistic textures [7,8,9,10]. These limitations underscore the necessity of developing a specialized approach tailored to the unique attributes of makeup face templates.
Therefore, this study aims to analyze the application of machine learning and facial feature detection techniques in evaluating makeup face templates, providing a technical foundation for future automated scoring systems. The objectives of this research are threefold: (1) to systematically compare facial detection methods—Haar Cascade, MTCNN-MobileNetV2, and Dlib—on stylized makeup templates; (2) to propose an enhanced detection pipeline incorporating geometric feature enhancement (GFE) to address stylization-related challenges; and (3) to validate the robustness of the proposed method across diverse artistic distortions.
The main contributions of this study are summarized as follows:
Proposing a hybrid facial landmark detection framework that emphasizes geometric structure over pixel-based intensity, thereby improving adaptability to artistic variation in makeup templates.
Demonstrating that the integration of GFE with Dlib significantly improves detection accuracy and stability under exaggerated or distorted facial configurations.
Establishing a technical baseline for future applications of automated facial analysis in beauty education and standardized assessment systems.
Recognizing facial structures in 2D makeup-paper templates differs markedly from general facial recognition tasks. Unlike photographic facial images, hand-drawn templates lack depth cues, contain stylized or exaggerated contours, and exhibit inconsistent line qualities and shading behaviors. These fundamental domain differences significantly weaken the reliability of conventional feature extraction and landmark localization algorithms. To clarify these distinctions, Table 1 summarizes the major differences between general facial recognition and 2D makeup-paper template analysis, highlighting the unique challenges addressed in this study.
The remainder of this paper is organized as follows: Section 2 describes the dataset and preprocessing procedures, Section 3 introduces the proposed geometric feature enhancement method and the experimental setup, Section 4 presents the experimental results and discussion, and Section 5 concludes the study.

2. Materials and Methods

2.1. Dataset Preparation

This study utilized an expanded dataset comprising 11,600 hand-drawn makeup face template images generated for non-commercial academic research purposes. The dataset was constructed by applying systematic image-processing-based data augmentation to an original set of 200 hand-drawn makeup face templates obtained from educational and beauty skill competition contexts and authorized for academic research use.
All augmented samples were generated by the authors using controlled transformations, including blurring, noise addition, contrast reduction, and geometric distortion, to simulate stylistic variations commonly observed in hand-drawn makeup illustrations. No additional third-party image sources were introduced during this process. All samples were anonymized and used solely for academic research and methodological validation. The dataset does not contain real facial photographs or personally identifiable information and is not publicly redistributed.
All images were digitized at a resolution of 300 dpi and categorized into five representative facial shapes: Circle, Diamond, Rectangle, Square, and Triangle. The resulting dataset exhibits substantial variation in line quality, shading intensity, color application, and stylistic exaggeration, reflecting real-world assessment conditions and aligning with prior discussions on stylistic diversity and aesthetic evaluation in visual media [3,11,12,13,14].
Each sample was manually annotated by trained experts using the Dlib 68-point facial landmark scheme, ensuring consistent landmark placement across different drawing styles [15,16]. Images exhibiting incomplete facial contours, missing key features, or extreme distortion were excluded from the dataset prior to analysis.
The data were partitioned into training and testing subsets with an 80:20 split. To improve generalization, we applied extensive data augmentation was applied through rotation (±15°), scaling (0.8×–1.2×), horizontal flipping, brightness and contrast adjustment, cropping, affine transformation, and Gaussian noise injection, following common practices in face analysis and aesthetic assessment pipelines [11,17,18]. These augmentations produced the expanded dataset used in detection and classification experiments.
Table 2 summarizes the distribution of the 11,600-image dataset across training, validation, and test subsets for the five face-shape categories. This balanced allocation ensures coverage consistency during model evaluation.
Figure 1 Representative examples of the five facial-shape categories used in this study—Circle, Diamond, Rectangle, Square, and Triangle. Each template is paired with its corresponding makeup design. These hand-drawn samples illustrate the geometric diversity, stylized contour variation, and artistic exaggeration present in the dataset, which collectively define the challenge addressed in this work.

2.2. Baseline Algorithms

Three commonly used facial detection and landmark localization algorithms were selected for comparative evaluation, representing traditional cascaded features, deep learning-based multi-stage detection, and regression-based landmark prediction [7,15,16,18,19,20,21].

2.2.1. Haar Cascade

A traditional classifier based on Haar Cascade like features and integral image computation, originally proposed for rapid object and face detection in real-time applications [20]. Although computationally lightweight, Haar Cascade is known to degrade under stylized or exaggerated line drawings, especially when contours deviate from natural facial statistics [18,19].

2.2.2. MTCNN-MobileNetV2

A deep learning-based multi-stage detector combining Multi-task Cascaded Convolutional Neural Networks (MTCNN) with MobileNetV2 for efficient facial structure extraction [7,19,21]. Its hierarchical feature representation enables robust detection on photographic faces but may struggle with abstract or sparse contour drawings, particularly in non-photorealistic or hand-drawn domains [18,22]. The three-stage pipeline of MTCNN (P-Net, R-Net, and O-Net) is illustrated in Figure 2, highlighting the hierarchical processing flow used for candidate proposal, refinement, and final landmark localization.

2.2.3. Dlib

Dlib employs an ensemble of regression trees for 68-point landmark prediction, implemented in the widely used Dlib C++ library [15,16]. Known for its stability and computational efficiency, Dlib serves both as a baseline model and as the foundational predictor for the proposed geometric enhancement framework. All models were evaluated under identical preprocessing procedures—including grayscale conversion, histogram equalization, and geometric alignment—to ensure fair comparison across detection methods [18]. An example of the baseline Dlib landmark output is shown in Figure 3, illustrating its point-wise localization capability prior to geometric refinement.

2.3. Proposed Method: Geometric Feature Enhancement (GFE)

Hand-drawn makeup templates often contain stylized distortions, such as exaggerated contours, uneven proportions, inconsistent shading, or incomplete features. These distortions weaken landmark consistency for conventional detectors, similar to robustness issues reported in face analysis under occlusion, heavy makeup, and non-standard imaging conditions [23,24,25,26,27]. To address this, we propose Geometric Feature Enhancement (GFE), a refinement layer designed to regularize facial geometry after initial landmark prediction. The complete procedural steps of GFE are provided in Appendix A, Algorithm A1.
The GFE module operates as a hybrid extension to Dlib’s 68-point predictor and incorporates three geometric constraints, conceptually aligned with prior work that combines pixel-level and geometric features to improve facial representation stability [28,29,30].
  • Region Segmentation
    Facial landmarks are grouped into seven structural zones: forehead, eyebrows, eyes, nose, lips, cheeks, and jawline. Each region is processed independently to preserve local shape consistency.
  • Proportional Normalization
    Inter-region distances (e.g., brow–eye height ratio, jaw width ratio, nose–lip distance) are corrected based on reference facial proportions, compensating for stylization induced distortion.
  • Curvature Correction
    Cubic Bézier curve fitting is applied along contour landmarks (jawline, eyebrows, lips) to smooth irregularities and restore plausible curvature.
These geometric priors are embedded into Dlib’s iterative regression loop [16], yielding refined landmarks that exhibit improved robustness against artistic abstraction, shape exaggeration, and partial occlusion frequently observed in makeup template drawings [18,25,31].
Unlike prior studies, our GFE focuses specifically on hand-drawn makeup templates, introducing region-specific geometric corrections that have not been explored in the context of beauty skill assessments.

2.4. Geometric Methods for Facial Landmark Refinement in 2D Makeup-Paper Templates

Facial feature recognition in 2D makeup paper templates presents distinct challenges compared to conventional facial recognition tasks. Traditional methods, including pixel-based classical algorithms and deep learning models, typically rely on clear and detailed facial features. However, in the context of 2D makeup paper templates, diverse artistic styles, facial occlusions, and the absence of three-dimensional depth information significantly degrade recognition performance. In makeup paper competitions, participants may deliberately exaggerate, abstract, or simplify facial features, further limiting the efficacy of conventional approaches. Additionally, occluded facial contours, varying lighting conditions, and stylistic inconsistencies further complicate accurate recognition. To address these limitations, this study shifts from traditional pixel-based methodologies to a geometry-centric approach.
This research introduces a novel framework based on geometric feature analysis, specifically designed for facial recognition in 2D makeup paper templates. The proposed method extracts stable geometric parameters, such as interocular distance, facial boundary ratios, chin curvature, and cheekbone width. These features effectively capture structural facial characteristics while demonstrating robustness against lighting variations, occlusions, and stylistic diversity. By overcoming the limitations of conventional methods, this approach provides a stable and accurate solution for facial feature recognition in this unique domain. Furthermore, by integrating geometric feature extraction with regional analysis, the proposed method enhances recognition accuracy and offers reliable technical support for aesthetic scoring and artistic evaluation, contributing to advancements in both research and practical applications.
The overall processing pipeline of the proposed method is illustrated in Figure 4, detailing the step-by-step workflow from image input, preprocessing, face detection, landmark-based region extraction, to coverage analysis and optional scoring. This structured framework enhances interpretability and supports reproducibility in practical deployment.

2.4.1. Proposed Geometric Features and Methods

This study presents a geometric feature analysis method tailored for facial recognition in 2D makeup paper templates, addressing the limitations of traditional image feature extraction in the presence of diverse artistic styles and blurred facial features. The key geometric analysis and processing techniques implemented in this study are described in the following subsections.
Interocular Distance Measurement
Using the get_face_landmarks function, 68 facial landmark positions are accurately extracted from the input image. The get_face_landmarks function is an author-defined wrapper implemented on top of the Dlib 68-point facial landmark detection library, rather than a native Python function. The geometric distance between the outer corners of the eyes (Landmarks 36 and 45) is computed as a stable feature for facial proportion estimation. This feature exhibits strong robustness against variations in lighting, viewing angles, and artistic styles, ensuring consistent recognition accuracy across diverse visual conditions.
Facial Boundary Measurement
Landmarks 0 and 16 are designated as the left and right facial boundary points, while Landmark 8 (chin midpoint) is used to measure facial length and width, forming the outer facial boundary and facilitating basic face shape estimation. In 2D makeup paper templates, where facial contours are generally stable, this method enables extended geometric analyses, such as contour curvature evaluation, to enhance the recognition of detailed facial features.
Chin Curvature Measurement
Curve fitting is applied to the chin area to generate a mask, followed by curvature analysis to extract additional geometric information. Chin boundary points (Landmarks 6 to 10) are utilized to compute curvature values, effectively distinguishing subtle variations in facial shape. In 2D makeup paper templates, chin curvature is especially critical, as it closely correlates with overall facial structure, enhancing recognition accuracy.
Cheekbone Width Measurement
Key landmarks (Landmarks 4 and 12) are used to determine the maximum cheekbone width, serving as a crucial geometric parameter for face shape recognition. By integrating outer boundary points with inward landmarks, the cheek area is modeled, and its maximum width is measured, further improving recognition accuracy.

2.4.2. Facial Symmetry Analysis

The geometric center of the face is defined to assess the left–right symmetry of facial features, a key criterion in aesthetic evaluation. Geometric center analysis further facilitates calculations of cheekbone width and other facial shape characteristics, providing quantitative support for aesthetic scoring and artistic assessment. Additionally, contour analysis refines face shape classification—such as distinguishing between round and square faces—by examining the ratio of chin to cheekbone width.
Leveraging the stable facial contours in 2D makeup paper templates, this study introduces extended geometric calculations for regional feature expansion along facial contours. By incorporating curvature variations and proportional measurements, this method significantly enhances the recognition of structural facial features. Even in cases where facial features are blurred or occluded, precise feature extraction is achieved through geometric structure localization. The modular implementation ensures high stability and adaptability, making this approach a robust solution for facial feature analysis in specialized applications. The geometric segmentation of each facial feature is illustrated in Figure 5.
By integrating the aforementioned methods, features from key facial regions—including the forehead, eyes, nose, lips, chin, and cheeks—are extracted and visualized using polygonal segmentation. This approach effectively delineates facial boundaries, allowing for precise handling of diverse face contours in 2D makeup paper templates and improving recognition accuracy. Through masking techniques and region-based geometric feature extraction, this method enables detailed analysis and aesthetic scoring.
The proposed geometric feature analysis provides a stable and efficient solution, thereby ensuring accurate facial recognition within the specific context of 2D makeup paper templates. As illustrated in Figure 6, the geometric feature-based approach clearly defines distinct facial segments across different face shapes. Future advancements may incorporate a broader range of facial models and integrate deep learning techniques to further enhance adaptability and recognition accuracy.
To ensure robustness and interpretability, the proposed Geometric Feature Enhancement (GFE) method incorporates several structural parameters derived from the Dlib-68 facial landmark model and custom-defined facial zones. These include interocular distance, chin curvature, cheek width, facial symmetry, and geometric center alignment—all of which play a critical role in verifying landmark plausibility and computing feature coverage. Each parameter was configured based on established anthropometric standards and facial geometry analysis literature. For example, the normalized interocular distance threshold was set at ±10%, and the symmetry deviation was bounded within a 25% tolerance of bilateral landmark pairs, enabling discrimination between natural variance and structural distortion. In addition, we incorporated three geometry-based parameters—forehead height ratio (α), rotation tolerance angle (θ), and cosmetic pixel intensity threshold (τ)—which were extrapolated from the spatial segmentation and region grouping strategies presented by Liu et al. [28]. The α parameter defines the estimated height of the forehead as a proportion of the face height (set at α = 0.25), extending beyond the brow region as suggested by landmark groupings. The θ parameter (±10°) accounts for head pose variations and ensures robustness against in-plane rotations of the face. The τ parameter (90–180) defines a luminance threshold for identifying cosmetic coverage in facial regions, inspired by intensity-based attention mechanisms discussed in the cited work. These parameters reflect the adaptation of landmark-driven regional analysis to the hand-drawn and stylistically variable nature of makeup templates. To evaluate the sensitivity of these parameters, we conducted a univariate sensitivity analysis on a validation set, modifying each threshold incrementally (e.g., ±10%, ±20%) while monitoring detection stability and region coverage rate. Results showed that the GFE method remained stable under moderate deviations, with significant performance drops only when thresholds exceeded physiologically plausible ranges (e.g., 50% symmetry deviation). These parameter settings were not only literature-informed and empirically validated but also emphasize reproducibility and explainability. This approach improves transparency in parameter design and ensures stable algorithmic behavior. Furthermore, as a deterministic module relying on basic geometric calculations without iterative optimization, the GFE process introduces negligible computational overhead, ensuring high efficiency for real-time assessment.
To further formalize the Geometric Feature Enhancement (GFE) strategy, we introduce a set of mathematical formulations that quantitatively characterize the structural relationships among critical facial landmarks. These formulations serve as fundamental metrics for evaluating landmark localization accuracy, facial proportion consistency, and geometric symmetry, particularly under stylized or distorted facial conditions.
By translating visual features into robust geometric descriptors, the proposed framework enhances the stability of facial feature alignment and establishes a solid theoretical basis for future scoring and assessment models. The key symbols and their corresponding definitions employed in the GFE-based evaluation process are systematically summarized in Table 3, as shown below.

2.4.3. Geometric Feature Enhancement (GFE)—Formal Representation

Let the set of facial landmarks provided by the Dlib-68 model be:
P = { p 1 , p 2 , , p 68 } , p i R 2
Each point p i Represents a facial landmark with Cartesian coordinates x i , y i .
Interocular   Distance :   D eye = | p 36 p 45 |
Measures the Euclidean distance between the outer corners of both eyes to assess horizontal face alignment.
Facial   Aspect   Ratio :   r f = w f h f = | p 1 p 17 | | p 27 p 8 |
Defines the ratio between facial width and height, supporting face shape classification and normalization.
Facial   Symmetry   Index :   R sym = 1 n i = 1 n x i L x i R w f
where x i L   and   x i R are x-coordinates of symmetrical landmark pairs, normalized by facial width w f . This metric captures horizontal symmetry deviations.
Chin   Curvature   Estimation : Given   five   landmark   points   on   the   chin   { p 6 , p 7 , p 8 , p 9 , p 10 } , the   curvature   is   approximated   by :   C chin = max i = 6 10 d 2 y i d x i 2
A higher curvature index may indicate a sharper or rounder chin structure.
Geometric   Center   Offset :   Δ g c = | 1 n i = 1 n p i c ideal |
where c ideal represents the theoretical center point based on a reference template. This metric quantifies deviations in overall landmark alignment.
Weighted   Geometric   Scoring   Function   ( for   future   automation ) :   S geo = i = 1 5 w i f i g i
where g i { d eye , r f , R sym , C chin , Δ g c } ,   f i are normalization functions, and w i 0,1 are learned or predefined weights summing to 1.

2.5. Full Detection–Refinement–Classification Pipeline (GFE-Dlib System)

To ensure experimental consistency, all models were evaluated under a unified workflow incorporating detection, landmark prediction, geometric refinement, and face-shape classification. This design follows common face analysis pipelines that decompose processing into detection, alignment/landmarking, feature extraction, and classification stages [17,18,32]. A full description of the end-to-end workflow is provided in Algorithm A2 (Appendix A). Briefly, the pipeline integrates baseline Dlib landmarking with GFE refinement, then extracts geometric descriptors, and finally performs face-shape classification.

2.6. Experimental Setup

All experiments were conducted on a Windows 11 workstation equipped with:
  • Intel i7-13700 CPU (Intel Corporation, Santa Clara, CA, USA)
  • 32 GB RAM
  • NVIDIA RTX 4050 GPU (8 GB VRAM) (NVIDIA Corporation, Santa Clara, CA, USA)
  • Python 3.11/Anaconda environment
  • OpenCV 4.9.0
  • Dlib 19.24 [15]
A five-fold cross-validation scheme was employed to evaluate generalization, consistent with standard practices in face recognition and landmark detection research [17,18,32]. Models were trained for 150 epochs with:
  • batch size = 32
  • adaptive learning rate = 0.001
  • early stopping to prevent overfitting
  • fixed random seeds for reproducibility
All models used identical data splits, augmentations, and preprocessing pipelines to ensure fair comparison across methods [18].

2.7. Evaluation Metrics

Quantitative performance was evaluated using:
  • Accuracy
  • Precision
  • Recall
  • F1-score
  • Mean Normalized Error (MNE)
These metrics are widely adopted in facial recognition, landmark detection, and aesthetic evaluation studies [11,17,18,24]. MNE is defined as the mean Euclidean distance between predicted and ground-truth landmarks, normalized by the interocular distance, a common normalization strategy in landmark localization benchmarks [16,18,24].
Statistical significance between competing models was evaluated using paired t-tests (p < 0.05), following typical evaluation protocols for comparing face analysis methods [18,24].
Additional qualitative assessments were conducted through landmark overlay comparisons, error heatmaps, and stylization-intensity analyses, which are commonly used to visually interpret model behavior and error patterns in face and aesthetic analysis [11,12,13,33].
This study does not involve human participants, animals, or any personally identifiable information. Ethical approval and informed consent statements are therefore not applicable.

3. Results

3.1. Performance of Haar Cascade

This study evaluates the performance of the Haar [20] feature-based face detection algorithm in recognizing makeup face templates, focusing on its accuracy and precision across various face shapes and attributes. The assessment utilizes confusion matrices and performance metrics. Originally developed for object detection, Haar features have been widely applied in facial recognition. In this study, they were used to analyze key facial features in hand-drawn makeup images, with detection performance—measured through accuracy, recall, and F1-score—evaluated using a support vector machine (SVM) classifier optimized via GridSearchCV. The dataset consisted of 11,600 augmented hand-drawn makeup face images spanning five distinct face shapes, with comprehensive evaluation conducted using confusion matrices, precision, recall, F1-scores, and ROC-AUC metrics.
Figure 7 The confusion matrix highlights the algorithm’s limitations in multi-class classification tasks. The horizontal axis represents predicted labels, while the vertical axis denotes true labels, with diagonal values indicating correctly classified instances. The algorithm successfully classified 5 Circle, 20 Diamond, 89 Rectangle, 330 Square, and 97 Triangle instances. However, notable misclassification patterns emerged, revealing a bias in handling structurally similar face shapes. Specifically, the high misclassification rate for the Square category suggests that extracted features from other shapes often overlap with or generalize to Square features.
These findings indicate that the Haar feature-based algorithm struggles with multi-class classification involving similar structural features in 2D hand-drawn images. The significant classification bias, particularly toward Square features, underscores the algorithm’s limitations in distinguishing nuanced geometric variations. While Haar features remain efficient for basic detection tasks, they lack the robustness needed for complex applications. This suggests that more advanced feature extraction techniques, particularly deep learning-based methods, could be explored in the future to further enhance classification accuracy and robustness in makeup template recognition.
Table 4 presents the performance of the Haar algorithm across four key metrics: accuracy, precision, recall, and F1-score. The algorithm achieves an accuracy of 0.3109, indicating low overall correctness in multi-class classification and highlighting its limited adaptability to complex datasets. The precision, at 0.5319, suggests that while predictions for certain classes are relatively accurate, the algorithm struggles with consistent recognition across all categories. The recall, which stands at 0.3109, is notably lower than precision, reflecting poor sensitivity in detecting all relevant samples, particularly in cases where similar structural features lead to frequent misclassification. The F1-score, recorded at 0.2523, further underscores the algorithm’s difficulty in balancing precision and recall, reinforcing its shortcomings in handling diverse datasets.
Overall, the Haar algorithm exhibits imbalanced performance in multi-class classification, with a noticeable disparity between precision and recall. While it demonstrates better predictive accuracy for certain classes, its ability to provide comprehensive coverage across all categories remains inadequate. The low F1-score further confirms its struggles in distinguishing between similar facial structures, aligning with the confusion matrix analysis, which indicates a bias in recognizing structurally related face shapes in hand-drawn makeup templates.
Figure 8 illustrates the ROC-AUC analysis of the Haar algorithm across five classification targets: Circle, Diamond, Rectangle, Square, and Triangle. The ROC curve evaluates classifier performance at various thresholds, while the area under the curve (AUC) measures its overall discriminatory ability, with values ranging from 0 to 1, where higher scores indicate better classification performance. Among all categories, the algorithm performs best in recognizing Rectangle face shapes, achieving the highest AUC value of 0.65, while the Diamond category records the lowest AUC value of 0.57, indicating weaker classification performance. Additionally, portions of the ROC curve that are close to the diagonal (dashed line) suggest that, for certain classes, the algorithm performs at a level similar to random guessing. In general, the Haar algorithm’s AUC values remain below 0.7, indicating only moderate classification performance on the hand-drawn makeup dataset.

3.2. Performance of MTCNN-MobileNetV2

This study investigates the application of MTCNN-MobileNetV2 (Multi-task Cascaded Convolutional Networks) [7,21] in facial detection and makeup face template scoring systems. With its multi-stage convolutional architecture, MTCNN-MobileNetV2 enables high-precision facial landmark detection and alignment, making it a core component of the system. A comprehensive evaluation was conducted to assess its effectiveness in multi-class face shape classification, utilizing confusion matrices and key performance metrics, including Accuracy, Recall, F1-score, and ROC-AUC. The study also examined MTCNN-MobileNetV2’s adaptability and challenges in handling hand-drawn makeup applications, identifying both its strengths and limitations to inform system optimization.
Figure 9 presents the confusion matrix analysis, where the horizontal axis represents predicted labels, and the vertical axis denotes true labels. Each cell value indicates the number of predictions per class, with darker colors representing higher counts. Correct classifications are shown along the diagonal, including 345 Circle, 120 Diamond, 151 Rectangle, 134 Square, and 113 Triangle instances. The most frequent misclassification occurred in the Square category, which was incorrectly classified as Circle 214 times. Notably, Circle achieved the highest precision and lowest error rate.
The confusion matrix highlights the performance of MTCNN-MobileNetV2 in multi-class classification for hand-drawn makeup applications. While the algorithm performs best in Circle classification, it exhibits higher error rates in distinguishing Rectangle and Square face shapes, underscoring its difficulty in handling categories with similar structural features. These findings provide critical insights into the algorithm’s limitations and suggest specific directions for future improvements to enhance classification accuracy and robustness.
Table 5 presents the performance evaluation of MTCNN-MobileNetV2 in 2D makeup face template classification using four key metrics. The model achieves an accuracy of 0.4960, indicating moderate overall correctness and suggesting acceptable adaptability to multi-class classification. The precision, at 0.8385, demonstrates the model’s strong ability to correctly classify positive predictions. However, the recall of 0.4960 highlights its limited capability in identifying all true positive samples, indicating potential issues with sensitivity. The F1-score, approximately 0.5188, suggests room for improvement in balancing precision and recall. Overall, while MTCNN-MobileNetV2 excels in precision, its accuracy and recall remain suboptimal, indicating instability in handling diverse and complex datasets. The observed classification biases in certain categories provide valuable insights for optimizing classification algorithms in makeup face template scoring systems.
Figure 10 illustrates the overall performance of MTCNN-MobileNetV2, with AUC values ranging from 0.77 to 0.79, reflecting stable but moderate classification performance in multi-class tasks. While the model exhibits similar performance across most categories, lower AUC values in certain classes suggest a difficulty in distinguishing class boundaries. The Rectangle category demonstrates the best discrimination ability, likely due to its distinctive structural features. The ROC curve analysis confirms that while MTCNN-MobileNetV2 exhibits some recognition capability in 2D makeup face template classification, it has not yet reached high accuracy levels. These findings provide critical insights for further model optimization to enhance classification accuracy and robustness.

3.3. Performance of GFE-Enhanced Dlib

This study evaluates the performance of Dlib’s 68 landmark detection technique in facial feature detection and makeup face template classification. By comparing confusion matrices and key performance metrics, including Accuracy, Recall, Precision, and F1-score, the practical applicability and adaptability of Dlib were thoroughly analyzed. The findings highlight Dlib’s strengths and limitations in planar makeup image processing, providing a valuable reference for future research.
Figure 11 illustrates the confusion matrix for the Dlib-based classifier, offering detailed insights into its facial shape recognition performance. The matrix’s diagonal elements indicate correctly classified instances, with Rectangle (301) and Square (295) showing the highest accuracy. This suggests that Dlib, supported by its 68-point landmark model, excels in identifying categories with strong geometric regularity and well-distributed landmarks.
Circle and Diamond face shapes also achieve reasonably high correct classification rates (Circle = 246, Diamond = 246), yet they exhibit higher dispersion in predictions across adjacent classes. For example:
Circle was often confused with Square (91 cases), suggesting difficulty in differentiating between round and softly angular contours.
Diamond showed misclassifications into Square (50) and Triangle (24), which may stem from overlaps in cheekbone prominence and chin angles.
The Triangle category, though generally well recognized (213 correct), suffered from notable misclassification into Square (90 cases) and Diamond (24 cases), hinting at geometric ambiguity introduced by artistic variations in chin-point definition and jaw structure.
Overall, Dlib demonstrates superior robustness compared to Haar and MTCNN-MobileNetV2, particularly in geometric consistency. However, its performance is still affected by inter-class shape proximity, especially in categories sharing visual symmetry or curvature. These results validate the need for geometric feature enhancement (GFE) strategies—such as symmetry indexing and curvature analysis—as adopted in this study.
Table 6 presents the performance evaluation of the Dlib algorithm in makeup face template classification, assessed using four key metrics: Accuracy, Precision, Recall, and F1-score. The results indicate that Dlib achieves stable and high performance, with all metrics approaching 0.8. Specifically, the accuracy of 0.7477 reflects a high proportion of correct predictions in multi-class classification tasks, while the precision of 0.7833 demonstrates the algorithm’s effectiveness in correctly identifying positive samples. The recall of 0.7477 confirms its capability to detect all relevant samples, and the F1-score of 0.7527, as the harmonic mean of precision and recall, highlights its balanced performance across these metrics. The close alignment of these values further underscores Dlib’s stability and reliability in multi-class classification for makeup face templates, particularly when processing images with distinct features.
Despite its strong overall performance, Dlib exhibits some limitations in handling extreme boundary samples or images with blurred features, where deep learning-based models may perform better. These findings offer valuable insights for further optimizing Dlib or integrating it with other model technologies to enhance classification performance.
Figure 12 illustrates the ROC curve (Receiver Operating Characteristic Curve) of the Dlib algorithm in makeup face template classification. The Rectangle category achieves the highest AUC value of 0.97, indicating the algorithm’s strongest discriminatory ability in this category. Other categories also achieve high AUC values, with all curves positioned well above the random classification baseline (diagonal line), confirming Dlib’s consistent recognition capability across multiple face shape categories.
Overall, the ROC curve and AUC analysis highlight Dlib’s ability to effectively distinguish multi-class features using its 68 landmark detection technique, with exceptional performance in the Rectangle category. While Dlib exhibits slight performance variations across categories, it consistently outperforms Haar and MTCNN-MobileNetV2. Therefore, Dlib provides reliable technical support for multi-class makeup face template classification.
Based on the above analysis, the Dlib method demonstrates stable and superior performance in makeup face template classification, particularly for categories with distinct geometric features, where its recognition capability surpasses traditional algorithms. However, challenges persist in high-confusion categories, likely due to geometric feature similarities between classes. Future enhancements could focus on increasing dataset diversity or integrating deep learning techniques to further improve overall recognition accuracy, ensuring broader applicability in multi-class makeup face template classification. Figure 13 illustrates the Dlib 68 landmark detection technique’s ability to accurately detect all five face shape categories, reinforcing its effectiveness in facial feature analysis within makeup template scoring systems.
According to the analysis of the three feature extraction techniques, this study systematically evaluated the performance, adaptability, and limitations of various facial recognition technologies on makeup face template samples. The findings reveal that while FaceNet and VGG-Face excel in traditional photo-based recognition, their performance declines in makeup face template classification due to the limited information provided by color blocks and geometric features, restricting their feature extraction capabilities.
The Haar Cascade method, leveraging local feature extraction, enhances classifier performance with computational efficiency. However, it struggles with samples exhibiting high feature similarity, limiting its effectiveness in complex classification tasks. In contrast, MTCNN-MobileNetV2, with its hierarchical neural network architecture, demonstrated superior adaptability in makeup face template applications, excelling in facial feature localization under diverse sample conditions and outperforming Haar Cascade in both precision and speed.
The Dlib 68 landmark detection technique exhibited outstanding performance in facial alignment and feature localization, particularly for challenging datasets and varied conditions. As a widely applied method in facial expression analysis, face recognition, virtual makeup, and pose estimation, Dlib’s 68 landmark technique showcased exceptional stability and reliability in this study. Its performance in makeup face template classification surpassed that of Haar Cascade and MTCNN-MobileNetV2, particularly in multi-class scenarios, establishing a robust technical foundation for future system optimization and applications.

3.4. Experimental Results

The objective of this experimental phase is to evaluate the performance of three facial detection techniques—Haar Cascade, MTCNN-MobileNetV2, and Dlib—on 200 original makeup face template samples, comparing different feature extraction methods to determine the most effective approach. The evaluation is based on several key metrics. Total detections measure the number of successfully identified facial feature points, providing an overall assessment of detection capability across diverse facial layouts. True Positives (TP) quantify the correctly detected features aligned with ground-truth annotations, serving as an indicator of geometric accuracy. False Positives (FP) represent incorrectly detected or misaligned feature points, helping to assess the false detection rate and model precision. False Negatives (FN), where existing features were not detected, are also recorded to reflect recall sensitivity.
In addition, landmark matching accuracy is calculated based on the percentage of Dlib 68 facial keypoints correctly aligned within a pre-defined Euclidean distance threshold from the annotated reference. These metrics are averaged across all face shapes and makeup styles, including cases with varying occlusions and stylized distortions. Together, this comprehensive evaluation framework enables a rigorous comparison of each technique’s effectiveness and robustness in detecting facial features on hand-drawn makeup face templates.
Table 7 presents the performance comparison of three facial detection techniques, Haar Cascade, MTCNN-MobileNetV2, and Dlib, under standard conditions. The table includes data on the total number of detected faces, true positives, and false positives across 200 images. Haar Cascade detected 36 faces, of which 23 were true positives and 13 were false positives. MTCNN-MobileNetV2 identified 30 faces, with 28 true positives and two false positives. Dlib, in contrast, successfully detected all 200 faces, achieving 100% true positives with no false positives.
These results highlight the accuracy and reliability of each technique under standard conditions. Figure 14 further illustrates that both Dlib and MTCNN-MobileNetV2 maintain consistent performance in true positive detections, with Dlib demonstrating the best results in terms of false positive rate, as it produced no false detections. While Haar Cascade exhibited a higher number of false positives, it still retained a certain level of detection capability, making it a viable option in specific scenarios despite its limitations in accuracy.
Table 8 presents the performance comparison of three facial detection techniques, Haar Cascade, MTCNN-MobileNetV2, and Dlib, under augmented image conditions using a sample size of 11,600 images. The evaluation examines each technique’s robustness and adaptability to various augmentation scenarios, simulating diverse application conditions. The augmentations applied included rotation (angles from −30° to 30° at 15° intervals), scaling (factors of 0.6, 0.8, 1.2, and 1.4), translation (horizontal or vertical shifts of −40 to 40 pixels at 20-pixel intervals), horizontal flipping, brightness and contrast adjustments (−40 to 40 in steps of 20), random noise addition (standard deviations of 5, 10, and 15), random cropping (80% and 70% of the image height and width), and random affine transformations, including angle, scaling, and shearing. These augmentation techniques were designed to assess the ability of each detection method to maintain accuracy under challenging conditions.
Figure 15 illustrates that both Dlib and MTCNN-MobileNetV2 retained stable detection accuracy, comparable to their performance under standard conditions. Among the three techniques, Dlib exhibited the best results, maintaining the lowest false positive rate and minimizing unclassified cases, reinforcing its robustness and reliability in complex facial detection tasks.
Through the comparison of standard sample conditions and augmented techniques, this study observed that while Dlib exhibits excellent recognition accuracy, it faces challenges when facial features are not clearly drawn. To address this limitation, an innovative geometric feature analysis method was proposed to enhance the stability and accuracy of facial recognition under adverse conditions. This approach emphasizes geometric calculations in assisting with the identification of facial regions, rather than relying solely on image clarity.
The method begins by selecting the distance between the eyes as a stable geometric feature. Facial detection techniques are used to locate the eyes and measure their distance, a feature that remains highly consistent despite variations in lighting and angles. Additionally, facial boundaries are analyzed by measuring face length and width, with their ratio calculated to determine basic face shapes.
Chin curvature is identified as another key feature, where curve fitting techniques are applied to calculate chin edge curvature, providing further insights into facial structure. The maximum distance between cheekbones is also measured to enhance facial shape recognition. For symmetry analysis, a geometric facial center point is defined as a fundamental reference for feature extraction and alignment. This center point further facilitates the calculation of cheekbone width and other facial proportions.
To refine contour features, the chin-to-cheekbone width ratio is analyzed to distinguish different face shapes, such as round or square. Symmetry evaluation includes the alignment of eyes, eyebrows, nose, and mouth, which serves as a crucial metric for aesthetic assessment.
By integrating these geometric-based methods, this study demonstrates the ability to accurately identify facial features, even when blurred or incomplete, providing a robust technical foundation and a valuable experimental reference for future facial recognition applications.

4. Discussion

This section examines the performance variations in the proposed Geometric Feature Enhancement (GFE) method across different face shapes and makeup styles, as well as the challenges and limitations encountered in practical applications. To simulate the impact of unclear facial features in hand-drawn makeup face templates, a dataset of 11,600 original samples, categorized into five face shapes—Circle, Diamond, Rectangle, Square, and Triangle—was subjected to various image processing transformations. These transformations were designed to replicate common distortions that may arise in artistic face illustrations:
  • Blurring (Blur): Gaussian Blur and Average Blur were applied to simulate the loss of clarity in facial boundaries, a frequent issue in hand-drawn illustrations.
  • Noise Addition (Add Noise): Gaussian Noise and Salt & Pepper Noise were introduced to replicate disturbances caused by paper texture inconsistencies or uneven brush strokes.
  • Contrast Reduction (Reduce Contrast): Image contrast was reduced to obscure fine facial details, mimicking issues such as unclear ink traces or inconsistent lighting.
  • Random Distortion (Random Distortion): Minor Perspective Transform and Elastic Distortion were applied to simulate misalignments in facial feature placement due to artistic variations in hand-drawn templates.
These experimental conditions provide a systematic framework for evaluating the robustness and adaptability of the GFE-based approach, offering insights into its effectiveness in handling various facial distortions in makeup face template recognition. To further analyze the impact of these distortions, five sets of processed samples (Circle_R, Diamond_R, Rectangle_R, Square_R, Triangle_R) were generated using different image processing techniques. A comparative analysis was conducted to assess the recognition performance of these samples across multiple algorithms, highlighting the strengths and limitations of the proposed method in diverse conditions. It should be noted that the hand-drawn templates were produced within specific educational and competition contexts, which may introduce stylistic or training-related biases that should be considered when interpreting the evaluation results.

4.1. Overall Model Performance Comparison

This study evaluates the performance of three facial recognition methods—Dlib with Geometric Feature Enhancement (GFE), Haar Cascade, and MTCNN-MobileNetV2—on makeup face template samples featuring five distinct face shapes and various makeup styles. To assess the effectiveness of each method in makeup face template recognition, key evaluation metrics were computed, including recognition accuracy (Accuracy), precision (Precision), recall (Recall), and F1-score. These metrics provide a comprehensive analysis of each method’s suitability for handling the challenges posed by artistic facial illustrations and diverse stylistic variations. Specifically, accuracy reflects the overall proportion of correctly predicted samples, precision indicates the proportion of true positive predictions among all positive predictions, recall measures the model’s ability to capture all true positive cases, and F1-score provides a balanced measure by combining precision and recall into a single harmonic average. These definitions offer a more interpretable basis for comparing classifier performance across varying input conditions.
To further validate the reliability of the proposed method, we conducted a 10-fold Stratified K-Fold cross-validation using the GFE-enhanced Dlib framework. In each fold, the dataset was randomly shuffled and partitioned, and a new SVM classifier was trained and evaluated. Performance metrics—including accuracy, precision, recall, and F1-score—were computed for each fold, and their mean and standard deviation were calculated to assess statistical robustness.
As shown in Table 9, the three detection methods exhibited distinct strengths and weaknesses across all evaluation metrics.
  • Haar Cascade exhibited the lowest performance across all evaluation metrics, indicating its inefficacy in recognizing facial features within makeup face templates. Its recall value (<0.21) suggests a high rate of false negatives, meaning it fails to correctly detect most facial features. This result underscores the limitations of Haar Cascade in handling artistic and stylistically diverse face illustrations.
  • MTCNN-MobileNetV2 achieved the highest precision (0.7232) among the evaluated methods, indicating its ability to accurately classify certain sample types. However, its low recall (0.2345) suggests that it fails to detect a substantial portion of actual facial features. This trade-off indicates that while MTCNN-MobileNetV2 performs well when it does identify a feature, it struggles with overall facial feature detection consistency in makeup face templates.
  • Dlib with Geometric Feature Enhancement (GFE) outperformed the other methods by a wide margin, with accuracy reaching 0.6052. Notably, its precision (0.6087) and recall (0.6052) are both above 0.60, indicating a good balance between correctness and completeness. This demonstrates superior adaptability and robustness in facial feature recognition for the makeup templates. Overall, integrating geometric feature analysis into Dlib significantly enhanced performance, making it a more reliable approach for artistic facial illustrations.

4.2. GFE Performance by Face Shape

This subsection analyzes the impact of face shape variations on recognition accuracy under blurred conditions. The performance of three facial recognition methods—Haar Cascade, MTCNN-MobileNetV2, and Dlib with Geometric Feature Enhancement (GFE)—is evaluated across different face shapes. By assessing recognition accuracy for each method, this study aims to determine how variations in facial structure influence recognition robustness and effectiveness in makeup face template analysis.
As shown in Table 10, the GFE-enhanced Dlib model demonstrated the highest level of robustness across all face-shape categories under reduced-resolution conditions.
  • For Haar Cascade, the accuracy on Circle and Diamond shapes was 0%, indicating a failure to correctly identify any sample of those shapes. It performed somewhat better on Square-shaped faces (33.41% accuracy), but overall, this method was unable to effectively recognize facial features in the makeup template images.
  • MTCNN-MobileNetV2 showed the best result on Circle-shaped faces (34.28% accuracy), but its accuracy dropped sharply for Rectangle (8.79%) and Triangle (7.12%) shapes. This suggests that MTCNN-MobileNetV2 has difficulty handling the highly exaggerated or abstract features in certain hand-drawn face shapes.
  • Dlib with Geometric Feature Enhancement (GFE) approach achieved the highest accuracy on all face shapes, demonstrating consistently strong performance even on blurred samples. In particular, it handled Rectangle (77.72%) and Diamond (60.78%) shapes very well—far exceeding the other methods’ accuracies on those shapes—suggesting that Dlib + GFE is especially effective for face shapes with clear geometric structures. Overall, these results highlight the GFE-based approach’s superior adaptability and robustness on stylized makeup face templates.
Although Dlib with Geometric Feature Enhancement (GFE) demonstrated superior performance in this study, several challenges and limitations persist. The hand-drawn nature of makeup face templates introduces significant variability in artistic styles, which can result in slight misalignments of facial feature points. These inconsistencies may negatively impact the accuracy of geometric feature extraction, leading to occasional recognition errors.
Despite these challenges, the findings in this section confirm that Dlib + GFE consistently outperforms Haar Cascade and MTCNN-MobileNetV2 across various conditions. Its ability to adapt to different face shapes and artistic styles highlights its higher robustness and stability, making it a more effective solution for recognizing facial features in makeup face templates.

4.3. Key Insights from Experimental Results

The experimental findings across the 200-image original dataset and the 11,600-image augmented dataset reveal consistent performance trends among the evaluated detection and recognition methods. The original Dlib detector [15,16] demonstrated superior detection stability, producing the lowest false-positive and unclassified counts under both standard and augmented conditions. The GFE-enhanced Dlib model further strengthened recognition performance, achieving balanced classification metrics (Accuracy = 0.6052, Precision = 0.6087, Recall = 0.6052, F1 = 0.6012) and outperforming Haar Cascade [20] and MTCNN-MobileNetV2 [7,21] across all face-shape categories. These results indicate that geometric refinement plays a critical role in stabilizing landmark detection for stylized, hand-drawn makeup templates characterized by exaggerated proportions, inconsistent contours, and variations in line clarity.
The stratified analysis by face shape further highlights the robustness of the GFE-enhanced Dlib model, particularly for face types with clearer or elongated geometric structures, such as Rectangle_R (77.72%) and Diamond_R (60.78%). In contrast, conventional photorealistic detectors—including Haar Cascade [20] and MTCNN-MobileNetV2 [7,21]—degraded substantially under stylization, underscoring the challenge of adapting existing facial detection models to non-photorealistic artistic domains.

4.4. Comparison with Previous Studies

The findings of this study align with and extend existing research on facial analysis under non-standard imaging conditions. Previous studies have highlighted the inherent challenges of aesthetic scoring and feature extraction when models are applied to inputs exhibiting stylistic variation, abstract contours, or uneven score distributions [1,2]. These works also observed that deep-learning-based methods often overfit to photorealistic features and struggle to generalize in the presence of artistic distortions or symbolic visual representations.
Studies focusing on masked or occluded faces, including Eman et al. (2023) [34], demonstrated the need for feature-level compensation strategies (e.g., PCA-based dimensionality reduction) when key visual cues are obstructed. Likewise, Saabia et al. (2018) [35] and Jaber et al. (2022) [30] highlighted the effectiveness of optimization-driven feature selection and filter-based geometric enhancement when facial structures deviate from standard proportions. These prior findings resonate with the improved performance of the GFE-based approach in the present study, which applies region segmentation, proportional normalization, and curvature correction to counteract artistic distortions in makeup template drawings.
Furthermore, recent advances in explainable AI (XAI) for educational applications—such as rubric-based scoring frameworks and attention-guided modeling—highlight the need for stable geometric grounding when building interpretable scoring mechanisms. Our GFE strategy similarly enhances geometric consistency, offering a more reliable basis for classroom feedback, automated evaluation, and standardization in artistic assessment contexts. To synthesize representative prior work and clarify the remaining research gaps, Table 11 summarizes key contributions and limitations of related studies in facial analysis and aesthetic assessment.

4.5. Contributions of This Study

This work offers several notable contributions to the field of facial analysis for non-photorealistic artistic illustrations:
  • A comprehensive evaluation of classic and modern face detectors on stylized, hand-drawn makeup templates, providing one of the first comprehensive large-scale comparisons in this domain.
  • The development of a geometric feature enhancement (GFE) module that improves landmark stability by applying proportion-based corrections and curvature adjustments across seven semantic facial regions.
  • Demonstrating significant performance gains of the GFE-enhanced Dlib model on both the original and augmented datasets, which consistently outperforms Haar Cascade and MTCNN-MobileNetV2.
  • A stratified analysis by face shape, revealing how geometric structure influences recognition performance under stylization.
  • A technical foundation for automated scoring systems in beauty skill education, offering pathways to standardized evaluation, instructor feedback tools, and potential formative assessment interfaces.
These contributions collectively advance the feasibility of AI-driven assessment in skill-competition environments where drawings follow stylized, symbolic, or exaggerated design conventions rather than natural facial imagery.

4.6. Limitations

Despite the promising outcomes, several limitations should be acknowledged.
First, extremely unclear, low-contrast, or partially missing facial contours remain challenging for all evaluated models, including the GFE-enhanced Dlib.
Second, the dataset, although large after augmentation (11,600 images), originates from a limited set of competitions and drawing styles, which may not fully capture the diversity of global artistic conventions. In addition, the performance of the proposed geometry-based framework may vary under different drawing styles or template variations. Templates with relatively clear structural outlines and consistent proportional relationships tend to benefit more from geometric constraints, whereas highly abstract, loosely sketched, or stylistically exaggerated drawings may introduce ambiguity in landmark localization. Variations in line thickness, contour continuity, and regional exaggeration can affect the stability of geometric feature extraction, highlighting the inherent difficulty of achieving uniformly robust performance across all artistic styles.
Third, the current system focuses exclusively on landmark detection and classification accuracy; it does not yet perform full scoring rubric integration, shading consistency assessment, or detailed region-by-region stroke quality analysis.
Fourth, the GFE module, while effective, is still dependent on the initial Dlib landmark estimates and may thus inherit any upstream detection errors.
These limitations suggest the need for broader datasets, stronger robustness methods, and tighter integration between geometric reasoning and deep feature extraction.

4.7. Future Research Directions

Future work may proceed along several promising directions. First, expanding the dataset to include cross-cultural and cross-institutional drawing styles would enhance model generalization and reduce evaluation bias. Second, future work may explore the integration of deep-learning-based face mesh methods (e.g., MediaPipe Face Mesh with 468 landmarks), which could potentially improve region-level precision and support finer-grained scoring systems, subject to their adaptability to stylized, non-photorealistic inputs. Third, incorporating explainable AI (XAI) techniques, such as SHAP (Shapley Additive Explanations) or Grad-CAM (Gradient-weighted Class Activation Mapping), could provide interpretable region-weighting maps to assist educators in giving feedback. Fourth, combining recognition with generative models may enable automatic suggestion of corrected facial proportions or idealized reference templates for formative instruction. Finally, future research may explore multimodal scoring pipelines that integrate geometry, shading analysis, symmetry assessment, and texture consistency for more holistic beauty-skill evaluation. These future directions aim to address the current limitations and further enhance the system’s reliability and applicability in diverse contexts.

5. Conclusions

In summary, this study is the first to integrate geometry-based facial landmark refinement into the evaluation of hand-drawn makeup templates, demonstrating its effectiveness in improving scoring consistency. This work investigated the challenges of facial landmark detection in stylized, hand-drawn makeup face templates and showed that integrating Geometric Feature Enhancement (GFE) with Dlib substantially improves detection robustness under artistic distortions. Across both the original (200 images) and augmented (11,600 images) datasets, the GFE-enhanced Dlib model consistently outperformed Haar Cascade and MTCNN-MobileNetV2 in accuracy, false-positive control, and stability across diverse face shapes. Stratified analysis further confirmed that geometric refinement is particularly effective for face shapes with clearer structural cues, such as Rectangle and Diamond, while maintaining stable performance under highly stylized contours.
From an experimental perspective, the analyses reported in Section 4 further demonstrate that the proposed GFE-enhanced Dlib framework maintains more stable recognition performance under a wide range of stylization and distortion conditions. The results indicate that performance varies across face shapes, with particularly strong robustness observed for geometrically structured templates such as Rectangle and Diamond, while highly stylized or abstract face drawings remain more challenging for all evaluated methods. These observations are consistent across both the original and augmented datasets and reflect the practical constraints encountered in hand-drawn makeup template analysis.
These findings underscore the importance of geometry-aware processing for non-photorealistic facial illustrations, providing a solid foundation for the future development of automated scoring systems in beauty skill assessments. The demonstrated improvements in landmark stability, error reduction, and cross-style consistency position the proposed approach as a practical technical baseline for standardized evaluation, educational feedback, and broader applications involving stylized human-face representations. Beyond makeup face templates, the geometry-aware design of the proposed framework suggests potential applicability to other non-photorealistic facial representations, such as sketches or educational illustrations, although empirical validation on these domains remains a topic for future research.

Author Contributions

Conceptualization, C.C. and C.-H.H.; Methodology, C.C. and Y.-Y.F.; Software, C.C.; Writing—original draft, C.C.; Writing—review & editing, C.C., Y.-Y.F. and C.-H.H.; Supervision, Y.-Y.F.; Project administration, Y.-Y.F.; Funding acquisition, Y.-Y.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council (Taiwan) under Grant numbers NSTC 113-2221-E-030-015 and NSTC 114-2221-E-030-017.

Institutional Review Board Statement

Not applicable. This study did not involve human participants or animals.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors acknowledge the use of ChatGPT-5 (OpenAI) as a language assistance tool to improve the English language and readability of this manuscript. All content has been carefully reviewed and revised by the authors to ensure its accuracy, validity, and academic rigor.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
GFEGeometric Feature Enhancement
DlibDlib Facial Landmark Detection Library
MTCNNMulti-Task Cascaded Convolutional Neural Network
SVMSupport Vector Machine
MNEMean Normalized Error
FPFalse Positive
TPTrue Positive
FNFalse Negative
AIArtificial Intelligence
XAIExplainable Artificial Intelligence

Appendix A

This appendix provides the full procedural specifications of the algorithms used in this study. Algorithm A1 presents the complete geometric enhancement process, while Algorithm A2 summarizes the full detection–refinement–classification pipeline used for evaluation.
Algorithm A1. Geometric Feature Enhancement (GFE) Procedure
Input:
 L—initial 68-point landmark predictions
Output:
 L*—refined landmark set after geometric enhancement
Step 1: Region Segmentation
1.1 Group landmarks into semantic regions
 R = {forehead, eyebrows, eyes, nose, lips, cheeks, jawline}.
Step 2: Proportional Normalization
2.1 Compute structural ratios:
 –brow–eye height
 –nose–lip distance
 –cheekbone width
 –chin depth
2.2 Normalize each region to reference geometric proportions.
Step 3: Curvature Correction
3.1 Fit cubic Bézier curves along eyebrow, lip, and jawline contours.
3.2 Adjust outlier points toward the smoothed curve paths.
Return: L*
Algorithm A2. Full Detection–Refinement–Classification Pipeline
Input:
 I—hand-drawn face template
 M_det—face detector (Haar, MTCNN_MobileNetV2, or original Dlib)
 M_lmk—Dlib 68-point landmark predictor
 GFE()—geometric enhancement module (Algorithm A1)
 M_cls—classifier for face-shape prediction
Output:
 y_pred—predicted face-shape label
 L*—refined landmark set
Step 1: Preprocessing
1.1 Convert image to grayscale and normalize.
1.2 Apply brightness/contrast correction.
1.3 Resize for detector compatibility.
Step 2: Face Detection
2.1 Perform detection using M_det.
2.2 If detection fails → return unclassified.
Step 3: Landmark Initialization
3.1 Crop detected face region.
3.2 Predict initial 68 landmarks with M_lmk.
3.3 Segment landmarks into structural facial regions.
Step 4: Geometric Enhancement
4.1 Apply GFE to obtain refined landmarks L*.
Step 5: Feature Extraction
5.1 Compute geometric descriptors:
 –interocular distance
 –boundary ratios
 –chin curvature
 –cheekbone width
 –global symmetry index
5.2 Construct feature vector F.
Step 6: Shape Classification
6.1 Predict face-shape label: y_pred = M_cls(F).
6.2 Return y_pred and L*.

References

  1. Virakul, S. Makeup: A Genderless Form of Artistic Expression Explored by Content Creators and Their Followers. J. Stud. Res. 2023, 11, 1–16. [Google Scholar] [CrossRef]
  2. Gemtou, E. Subjectivity in Art History and Art Criticism. Rupkatha J. Interdiscip. Stud. Humanit. 2010, 2, 2–13. [Google Scholar] [CrossRef]
  3. Valencia, J.; Pineda, G.G.; Pineda, V.G.; Valencia-Arias, A.; Arcila-Diaz, J.; de la Puente, R.T. Using machine learning to predict artistic styles: An analysis of trends and the research agenda. Artif. Intell. Rev. 2024, 57, 118. [Google Scholar] [CrossRef]
  4. Egon, K.; Potter, K.; Lord, M.L. AI in Art and Creativity: Exploring the Boundaries of Human–Machine Collaboration. OSF Preprint 2023, preprint. [Google Scholar] [CrossRef]
  5. Ugail, H.; Stork, D.G.; Edwards, H.; Seward, S.C.; Brooke, C. Deep Transfer Learning for Visual Analysis and Attribution of Paintings by Raphael. Herit. Sci. 2023, 11, 268. [Google Scholar] [CrossRef]
  6. Gupta, A.; Mithun, N.C.; Rudolph, C.; Roy-Chowdhury, A.K. Deep Learning Based Identity Verification in Renaissance Portraits. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; IEEE: New York, NY, USA, 2018; pp. 1–6. [Google Scholar] [CrossRef]
  7. Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
  8. Deng, J.; Guo, J.; Ververas, E.; Kotsia, I.; Zafeiriou, S. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 5202–5211. [Google Scholar] [CrossRef]
  9. Al-Nuimi, A.M.; Mohammed, G.J. Face Direction Estimation based on Mediapipe Landmarks. In Proceedings of the 2021 7th International Conference on Contemporary Information Technology and Mathematics (ICCITM), Mosul, Iraq, 25–26 August 2021; IEEE: New York, NY, USA, 2021; pp. 185–190. [Google Scholar] [CrossRef]
  10. Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep Face Recognition. In Proceedings of the British Machine Vision Conference (BMVC), Swansea, UK, 7–10 September 2015; Xie, X., Jones, M.W., Tam, G.K.L., Eds.; BMVA Press: Malvern, UK, 2015; pp. 41.1–41.12. [Google Scholar] [CrossRef]
  11. Li, C.; Chen, T. Aesthetic Visual Quality Assessment of Paintings. IEEE J. Sel. Top. Signal Process. 2009, 3, 236–252. [Google Scholar] [CrossRef]
  12. Liu, L.; Guo, X.; Bai, R.; Li, W. Image Aesthetic Assessment Based on Attention Mechanisms and Holistic Nested Edge Detection. In Proceedings of the 2022 Asia Conference on Advanced Robotics, Automation, and Control Engineering (ARACE), Qingdao, China, 26–28 August 2022; IEEE: New York, NY, USA, 2022; pp. 70–75. [Google Scholar] [CrossRef]
  13. Lee, J.-T.; Lee, C.; Kim, C.-S. Property-Specific Aesthetic Assessment with Unsupervised Aesthetic Property Discovery. IEEE Access 2019, 7, 114349–114362. [Google Scholar] [CrossRef]
  14. Kao, Y.; Wang, C.; Huang, K. Visual aesthetic quality assessment with a regression model. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; IEEE: New York, NY, USA, 2015; pp. 1583–1587. [Google Scholar] [CrossRef]
  15. Dlib C++ Library. Available online: http://dlib.net (accessed on 1 February 2025).
  16. Kazemi, V.; Sullivan, J. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: New York, NY, USA, 2014; pp. 1867–1874. [Google Scholar] [CrossRef]
  17. Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA, 2015; pp. 815–823. [Google Scholar] [CrossRef]
  18. Feng, Y.; Yu, S.; Peng, H.; Li, Y.-R.; Zhang, J. Detect Faces Efficiently: A Survey and Evaluations. IEEE Trans. Biom. Behav. Identity Sci. 2022, 4, 1–18. [Google Scholar] [CrossRef]
  19. Kaur, S.; Sharma, D. Comparative Study of Face Detection Using Cascaded Haar, Hog and MTCNN Algorithms. In Proceedings of the 2023 3rd International Conference on Advancement in Electronics & Communication Engineering (AECE), Ghaziabad, India, 23–24 November 2023; IEEE: New York, NY, USA, 2023; pp. 536–541. [Google Scholar] [CrossRef]
  20. Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001; IEEE: New York, NY, USA, 2001; p. I. [Google Scholar] [CrossRef]
  21. Zhang, N.; Luo, J.; Gao, W. Research on Face Detection Technology Based on MTCNN. In Proceedings of the 2020 International Conference on Computer Network, Electronic and Automation (ICCNEA), Xi’an, China, 25–27 September 2020; IEEE: New York, NY, USA, 2020; pp. 154–158. [Google Scholar] [CrossRef]
  22. Areeb, Q.M.; Imam, R.; Fatima, N.; Nadeem, M. AI Art Critic: Artistic Classification of Poster Images using Neural Networks. In Proceedings of the 2021 International Conference on Data Analytics for Business and Industry (ICDABI), Sakheer, Bahrain, 25–26 October 2021; IEEE: New York, NY, USA, 2021; pp. 37–41. [Google Scholar] [CrossRef]
  23. Lyu, Y.; Jiang, Y.; He, Z.; Peng, B.; Liu, Y.; Dong, J. 3D-Aware Adversarial Makeup Generation for Facial Privacy Protection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13438–13453. [Google Scholar] [CrossRef]
  24. Rathnayake, R.; Madhushan, N.; Jeeva, A.; Darshani, D.; Subasinghe, A.; Silva, B.N.; Wijesinghe, L.P.; Wijenayake, U. Current Trends in Human Pupil Localization: A Review. IEEE Access 2023, 11, 115836–115853. [Google Scholar] [CrossRef]
  25. Zhang, W.; Zhao, X.; Morvan, J.-M.; Chen, L. Improving Shadow Suppression for Illumination Robust Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 611–624. [Google Scholar] [CrossRef]
  26. Zhang, H.; Wang, Z.; Hou, J. Makeup Removal for Face Verification Based Upon Deep Learning. In Proceedings of the 2021 IEEE 6th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 22–24 October 2021; IEEE: New York, NY, USA, 2021; pp. 446–450. [Google Scholar] [CrossRef]
  27. Padmashree, G.; Kotegar, K.A. Skin Segmentation-Based Disguised Face Recognition Using Deep Learning. IEEE Access 2024, 12, 51056–51072. [Google Scholar] [CrossRef]
  28. Liu, C.; Hirota, K.; Ma, J.; Jia, Z.; Dai, Y. Facial Expression Recognition Using Hybrid Features of Pixel and Geometry. IEEE Access 2021, 9, 18876–18889. [Google Scholar] [CrossRef]
  29. Kim, J.-H.; Kim, B.-G.; Roy, P.P.; Jeong, D.-M. Efficient Facial Expression Recognition Algorithm Based on Hierarchical Deep Neural Network Structure. IEEE Access 2019, 7, 41273–41285. [Google Scholar] [CrossRef]
  30. Jaber, A.G.; Muniyandi, R.C.; Usman, O.L.; Singh, H.K.R. A Hybrid Method of Enhancing Accuracy of Facial Recognition System Using Gabor Filter and Stacked Sparse Autoencoders Deep Neural Network. Appl. Sci. 2022, 12, 11052. [Google Scholar] [CrossRef]
  31. Zheng, S.; Xu, Z.; Li, Z.; Cai, Y.; Han, M.; Ji, Y. An Intelligent Scoring Method for Sketch Portrait Based on Attention Convolution Neural Network. In Proceedings of the 2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta), Haikou, China, 15–18 December 2022; IEEE: New York, NY, USA, 2022; pp. 1058–1064. [Google Scholar] [CrossRef]
  32. Shreya, R.; Mulgund, A.P.; Hiremath, S.; A, S.H.; Koundinya, A.K. Comparative Analysis of Traditional and Machine Learning Based Face Recognition Models. In Proceedings of the 2023 IEEE 2nd International Conference on Data, Decision and Systems (ICDDS), Mangaluru, India, 1–2 December 2023; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
  33. John, T.A.; Balasubramanian, V.N.; Jawahar, C.V. Canonical Saliency Maps: Decoding Deep Face Models. IEEE Trans. Biom. Behav. Identity Sci. 2021, 3, 561–572. [Google Scholar] [CrossRef]
  34. Eman, M.; Mahmoud, T.M.; Ibrahim, M.M.; Abd El-Hafeez, T. Innovative Hybrid Approach for Masked Face Recognition Using Pretrained Mask Detection and Segmentation, Robust PCA, and KNN Classifier. Sensors 2023, 23, 6727. [Google Scholar] [CrossRef] [PubMed]
  35. Saabia, A.A.R.; El-Hafeez, T.A.; Zaki, A.M. Face Recognition Based on Grey Wolf Optimization for Feature Selection. In Proceedings of the International Conference on Advanced Intelligent Systems and Informatics (AISI), Cairo, Egypt, 1–3 September 2018; Springer: Cham, Switzerland, 2018; pp. 273–283. [Google Scholar] [CrossRef]
Figure 1. Representative portrait templates for five face shape categories—Circle, Diamond, Rectangle, Square, and Triangle—each paired with a corresponding makeup design. These hand drawn templates form the basis of the proposed dataset and illustrate the diversity of facial structures addressed in this study.
Figure 1. Representative portrait templates for five face shape categories—Circle, Diamond, Rectangle, Square, and Triangle—each paired with a corresponding makeup design. These hand drawn templates form the basis of the proposed dataset and illustrate the diversity of facial structures addressed in this study.
Applsci 16 00977 g001
Figure 2. Author-generated illustration of the MTCNN face-detection architecture used as a baseline model in this study. This figure is an original illustration created by the authors and does not involve any third-party copyrighted material. The system consists of three cascaded stages: (1) P-Net for candidate proposal, (2) R-Net for refinement, and (3) O-Net for final facial landmark localization. The illustration demonstrates the hierarchical processing flow used by the baseline MTCNN model [7]. Different colored bounding boxes indicate the outputs of the P-Net (yellow), R-Net (blue), and O-Net (green) stages, while red dots denote the detected facial landmarks.
Figure 2. Author-generated illustration of the MTCNN face-detection architecture used as a baseline model in this study. This figure is an original illustration created by the authors and does not involve any third-party copyrighted material. The system consists of three cascaded stages: (1) P-Net for candidate proposal, (2) R-Net for refinement, and (3) O-Net for final facial landmark localization. The illustration demonstrates the hierarchical processing flow used by the baseline MTCNN model [7]. Different colored bounding boxes indicate the outputs of the P-Net (yellow), R-Net (blue), and O-Net (green) stages, while red dots denote the detected facial landmarks.
Applsci 16 00977 g002
Figure 3. Author-generated illustration of Dlib’s 68-point facial landmark detection. The red dots represent the detected facial landmarks.
Figure 3. Author-generated illustration of Dlib’s 68-point facial landmark detection. The red dots represent the detected facial landmarks.
Applsci 16 00977 g003
Figure 4. Workflow of the proposed GFE-based facial recognition system for 2D makeup face template analysis. The system begins with image input and preprocessing, followed by facial landmark detection using Dlib. Geometric regions are extracted and evaluated based on makeup pixel coverage and structural rules. All outputs are logged, and optional scoring modules may be integrated in future implementations.
Figure 4. Workflow of the proposed GFE-based facial recognition system for 2D makeup face template analysis. The system begins with image input and preprocessing, followed by facial landmark detection using Dlib. Geometric regions are extracted and evaluated based on makeup pixel coverage and structural rules. All outputs are logged, and optional scoring modules may be integrated in future implementations.
Applsci 16 00977 g004
Figure 5. Geometric feature segmentation results for makeup face template analysis. The figure illustrates the regional extraction of structural facial components—such as eyes, eyebrows, nose, lips, and jawline—based on contour curvature and proportional measurements, supporting robust facial feature recognition under varied conditions. The green frames indicate the segmented facial regions corresponding to different geometric features.
Figure 5. Geometric feature segmentation results for makeup face template analysis. The figure illustrates the regional extraction of structural facial components—such as eyes, eyebrows, nose, lips, and jawline—based on contour curvature and proportional measurements, supporting robust facial feature recognition under varied conditions. The green frames indicate the segmented facial regions corresponding to different geometric features.
Applsci 16 00977 g005
Figure 6. Geometric feature-based definition of facial regions across five distinct face shapes. Each region—such as the forehead, eyes, nose, lips, and chin—is delineated based on contour geometry and facial proportions, enabling consistent structural representation across shape categories in 2D makeup face template analysis. Different colored frames indicate the segmented facial regions corresponding to specific geometric features.
Figure 6. Geometric feature-based definition of facial regions across five distinct face shapes. Each region—such as the forehead, eyes, nose, lips, and chin—is delineated based on contour geometry and facial proportions, enabling consistent structural representation across shape categories in 2D makeup face template analysis. Different colored frames indicate the segmented facial regions corresponding to specific geometric features.
Applsci 16 00977 g006
Figure 7. Confusion Matrix of the Haar Cascade classifier across five face shape categories in the 11,600-image augmented dataset. The diagonal entries represent correctly classified instances, with the Haar model achieving highest accuracy on the Square shape (330 correct), while exhibiting the most severe misclassification for Circle and Diamond, which were frequently predicted as Square. The dominance of misclassification in the Square column suggests feature overlap and low discriminative power under artistic distortion.
Figure 7. Confusion Matrix of the Haar Cascade classifier across five face shape categories in the 11,600-image augmented dataset. The diagonal entries represent correctly classified instances, with the Haar model achieving highest accuracy on the Square shape (330 correct), while exhibiting the most severe misclassification for Circle and Diamond, which were frequently predicted as Square. The dominance of misclassification in the Square column suggests feature overlap and low discriminative power under artistic distortion.
Applsci 16 00977 g007
Figure 8. ROC curves for the Haar Cascade classifier across five facial shape categories. They illustrate the trade-off between true positive rate and false positive rate at varying thresholds. The highest AUC value (0.65) was recorded for Rectangle, whereas Diamond showed the weakest performance (AUC = 0.57). All AUC values remain below 0.70, underscoring the limited discriminative capability of the algorithm on hand-drawn facial templates.
Figure 8. ROC curves for the Haar Cascade classifier across five facial shape categories. They illustrate the trade-off between true positive rate and false positive rate at varying thresholds. The highest AUC value (0.65) was recorded for Rectangle, whereas Diamond showed the weakest performance (AUC = 0.57). All AUC values remain below 0.70, underscoring the limited discriminative capability of the algorithm on hand-drawn facial templates.
Applsci 16 00977 g008
Figure 9. Confusion matrix illustrating the prediction performance of the MTCNN-MobileNetV2 model across five face shape categories. Correct classifications appear along the diagonal, with the Circle category achieving the highest accuracy (345 true positives). The Square category, however, exhibited a high rate of misclassification as Circle (214 cases), suggesting potential overlap in low-level feature representations.
Figure 9. Confusion matrix illustrating the prediction performance of the MTCNN-MobileNetV2 model across five face shape categories. Correct classifications appear along the diagonal, with the Circle category achieving the highest accuracy (345 true positives). The Square category, however, exhibited a high rate of misclassification as Circle (214 cases), suggesting potential overlap in low-level feature representations.
Applsci 16 00977 g009
Figure 10. ROC curve analysis for MTCNN-MobileNetV2 across five facial shape categories. While AUC scores remain in a narrow range (0.77–0.79), the subtle differences reflect the model’s varying ability to distinguish between classes, with Rectangle achieving the highest class separability.
Figure 10. ROC curve analysis for MTCNN-MobileNetV2 across five facial shape categories. While AUC scores remain in a narrow range (0.77–0.79), the subtle differences reflect the model’s varying ability to distinguish between classes, with Rectangle achieving the highest class separability.
Applsci 16 00977 g010
Figure 11. Confusion matrix of the Dlib-based classifier for 2D hand-drawn facial shapes. Dlib demonstrates high accuracy in identifying face shape templates, but shows confusion between structurally similar categories.
Figure 11. Confusion matrix of the Dlib-based classifier for 2D hand-drawn facial shapes. Dlib demonstrates high accuracy in identifying face shape templates, but shows confusion between structurally similar categories.
Applsci 16 00977 g011
Figure 12. ROC curve of the Dlib-based classifier for multi-class 2D face shape classification. The Rectangle category achieved the highest AUC (0.97), followed by Circle (0.94), Triangle (0.93), Diamond (0.91), and Square (0.90). All curves lie significantly above the random baseline, indicating high discriminative power and consistent recognition performance across categories.
Figure 12. ROC curve of the Dlib-based classifier for multi-class 2D face shape classification. The Rectangle category achieved the highest AUC (0.97), followed by Circle (0.94), Triangle (0.93), Diamond (0.91), and Square (0.90). All curves lie significantly above the random baseline, indicating high discriminative power and consistent recognition performance across categories.
Applsci 16 00977 g012
Figure 13. Detection results of the Dlib 68-point landmark model across five face shape templates. The figure demonstrates accurate facial feature localization for all categories, highlighting the method’s effectiveness in supporting automated scoring of hand-drawn makeup designs. The green frames denote the detected facial regions, while the dots indicate the localized facial landmark points.
Figure 13. Detection results of the Dlib 68-point landmark model across five face shape templates. The figure demonstrates accurate facial feature localization for all categories, highlighting the method’s effectiveness in supporting automated scoring of hand-drawn makeup designs. The green frames denote the detected facial regions, while the dots indicate the localized facial landmark points.
Applsci 16 00977 g013
Figure 14. Detection performance comparison of three facial detection techniques—Dlib, MTCNN-MobileNetV2, and Haar Cascade—on 200 hand-drawn face templates. While all methods achieved similar true positive rates, Dlib produced no false positives, indicating the highest precision. Haar Cascade exhibited the most false detections, but still maintained a basic level of detection performance.
Figure 14. Detection performance comparison of three facial detection techniques—Dlib, MTCNN-MobileNetV2, and Haar Cascade—on 200 hand-drawn face templates. While all methods achieved similar true positive rates, Dlib produced no false positives, indicating the highest precision. Haar Cascade exhibited the most false detections, but still maintained a basic level of detection performance.
Applsci 16 00977 g014
Figure 15. Detection performance comparison of three facial detection techniques—Dlib, MTCNN-MobileNetV2, and Haar Cascade—on the full dataset of 11,600 hand-drawn face templates. Dlib maintained the lowest false positive rate and produced the fewest unclassified cases, demonstrating the highest overall robustness. Both Dlib and MTCNN-MobileNetV2 showed stable accuracy comparable to their performance under standard conditions.
Figure 15. Detection performance comparison of three facial detection techniques—Dlib, MTCNN-MobileNetV2, and Haar Cascade—on the full dataset of 11,600 hand-drawn face templates. Dlib maintained the lowest false positive rate and produced the fewest unclassified cases, demonstrating the highest overall robustness. Both Dlib and MTCNN-MobileNetV2 showed stable accuracy comparable to their performance under standard conditions.
Applsci 16 00977 g015
Table 1. Comparison between general facial recognition and 2D Makeup paper template recognition.
Table 1. Comparison between general facial recognition and 2D Makeup paper template recognition.
AspectGeneral Facial Recognition2D Makeup Paper Template Recognition
Data Dimension3D images (with depth information)2D planar images
Texture InformationRich features such as skin details, pores, and wrinklesRelies on hand-drawn lines and color expressions
Impact of Lighting VariationsShadows and lighting variations assist in depth estimationFixed lighting; shadows cannot provide depth cues
Dynamic FeaturesFacial expressions, lip movements, and other micro-dynamicsCompletely static, unable to capture facial dynamics
Feature StabilityStandardized landmark detection via deep learningAffected by drawing errors; facial proportions may be inaccurate
Color ConsistencyUniform skin tones, influenced by environmental lightingDependent on drawing tools (e.g., pencils, watercolor, markers), leading to potential color inconsistencies
Table 2. Dataset distribution by category and dataset split.
Table 2. Dataset distribution by category and dataset split.
CategoryTraining SetValidation SetTest SetTotal
Circle12804805602320
Diamond12804805602320
Rectangle12804805602320
Square12804805602320
Triangle12804805602320
Total64002400280011,600
Table 3. Symbols and Definitions Used in the GFE-Based Landmark Evaluation.
Table 3. Symbols and Definitions Used in the GFE-Based Landmark Evaluation.
SymbolMeaning
PThe complete set of facial landmarks, consisting of 68 two-dimensional coordinate points.
{p1, p2, …, p68}Elements of the set P, representing individual facial feature points from point 1 to 68.
piA single facial landmark point (e.g., eye corner, nose tip, mouth corner), where i = 1 to 68.
Mathematical symbol meaning “is an element of” or “belongs to.”
2Denotes the 2D Cartesian coordinate space: each point has coordinates (xi, yi).
DeyeInterocular distance: Euclidean distance between eye corner landmarks p36 and p45.
rfFacial aspect ratio: Width-to-height ratio based on selected facial landmarks.
wf, ℎfFacial width and height, respectively, used in computing rf and symmetry metrics.
RsymFacial symmetry index based on normalized deviation of paired horizontal landmarks.
xiL, xiRHorizontal coordinates of symmetric landmark pairs (left/right), used in symmetry calculation.
CchinChin curvature index: Second derivative of vertical coordinates over a range of chin landmarks.
Δ g cGeometric center offset: Distance between average landmark center and ideal facial center.
cidealIdeal geometric center from a template face, used for landmark alignment comparison.
S g eoOverall geometric scoring function combining all metrics with weights.
wiNormalized weight coefficient (wi ∈ [0, 1]) for each geometric metric, summing to 1.
fi(gi)Normalization function applied to each geometric metric gi to scale into a unified score range.
giA geometric metric used in the scoring function (e.g., Deye, rf, Rsym, Cchin, Δ g c).
Table 4. Performance Metrics of the Haar Cascade Classifier.
Table 4. Performance Metrics of the Haar Cascade Classifier.
ModelAccuracyPrecisionRecallF1-Score
Haar0.31090.53190.31090.2523
Table 5. Model performance comparison for MTCNN-MobileNetV2 in 2D facial feature classification.
Table 5. Model performance comparison for MTCNN-MobileNetV2 in 2D facial feature classification.
ModelAccuracyPrecisionRecallF1-Score
MTCNN-MobileNet0.49600.51880.83850.4960
Table 6. Model performance metrics for Dlib algorithm in 2D facial feature classification.
Table 6. Model performance metrics for Dlib algorithm in 2D facial feature classification.
ModelAccuracyPrecisionRecallF1-Score
Dlib0.74770.78330.74770.7527
Table 7. Detection results on 200 sample images.
Table 7. Detection results on 200 sample images.
TechnologyTotal ImagesTotal
Detections
True PositivesFalse Positives
Haar Cascade200362313
MTCNN-MobileNetV220030282
Dlib2002002000
Table 8. Detection Results on 11,600 Sample Images.
Table 8. Detection Results on 11,600 Sample Images.
TechnologyTotal
Images
Total DetectionsTrue PositivesFalse PositivesUnclassified
Haar Cascade11,600176813723969832
MTCNN-MobileNetV211,600198017841969620
Dlib11,60011,21611,18828384
Table 9. Overall model performance.
Table 9. Overall model performance.
ModelAccuracyPrecisionRecallF1-Score
Dlib + GFE0.60520.60870.60520.6012
MTCNN-MobileNetV20.23450.72320.23450.1372
Haar Cascade0.20290.26010.20290.0736
Table 10. Recognition Accuracy by Face Shape.
Table 10. Recognition Accuracy by Face Shape.
Face ShapeHaar CascadeMTCNN-MobileNetV2Dlib + GFE
Circle_R0.00000.34280.5619
Diamond_R0.00000.10690.6078
Rectangle_R0.01690.08790.7772
Square_R0.33410.07690.5440
Triangle_R0.01700.07120.5149
Table 11. Summary of representative related studies, highlighting their key contributions, identified limitations, and how the proposed GFE approach addresses existing gaps.
Table 11. Summary of representative related studies, highlighting their key contributions, identified limitations, and how the proposed GFE approach addresses existing gaps.
StudyKey ContributionsIdentified LimitationsHow This Study Addresses the Gaps
Zheng et al. [31]Sketch portrait scoring using attention-based CNN modelsFocuses on sketch aesthetics; lacks stable landmark detection for stylized templatesIntroduces geometric enhancement to stabilize landmark localization under stylistic distortions
Lyu et al. [23]3DAM-GAN for adversarial makeup image generationDesigned for generation/translation tasks, not for landmark analysisLeverages geometric modeling to maintain landmark structure under stylized or occluded regions
Shreya et al. [32]Comparative evaluation of face-recognition models (Haar, MTCNN, Dlib)Tested on real-face datasets only; limited coverage of hand-drawn or stylized facesExtends the evaluation to artistic/stylized templates and validates robustness under distortions
Rathnayake et al. [24]Comprehensive review of pupil localizationFocuses on partial-region detection; lacks full-face geometric alignmentIncorporates holistic geometric reasoning to support landmark stability
Liu et al. [28]Hybrid pixel–geometric models for facial expression recognitionNot optimized for stylized drawings or face-template distortionExtends geometric modeling to handle asymmetric contours and proportion exaggeration
Eman et al. [34]Mask-face recognition using PCA, RPCA, KNNModels trained on partially occluded faces; performance unstable under abstractionApplies geometric enhancement to mitigate structural loss in hand-drawn faces
Saabia et al. [35]Feature selection using GWO with PCA and Gabor filtersFocuses on stylized abstraction but lacks landmark-level constraintsIncorporates geometric constraints to improve landmark reliability under stylized variations
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chang, C.; Fanjiang, Y.-Y.; Hung, C.-H. Geometric Feature Enhancement for Robust Facial Landmark Detection in Makeup Paper Templates. Appl. Sci. 2026, 16, 977. https://doi.org/10.3390/app16020977

AMA Style

Chang C, Fanjiang Y-Y, Hung C-H. Geometric Feature Enhancement for Robust Facial Landmark Detection in Makeup Paper Templates. Applied Sciences. 2026; 16(2):977. https://doi.org/10.3390/app16020977

Chicago/Turabian Style

Chang, Cheng, Yong-Yi Fanjiang, and Chi-Huang Hung. 2026. "Geometric Feature Enhancement for Robust Facial Landmark Detection in Makeup Paper Templates" Applied Sciences 16, no. 2: 977. https://doi.org/10.3390/app16020977

APA Style

Chang, C., Fanjiang, Y.-Y., & Hung, C.-H. (2026). Geometric Feature Enhancement for Robust Facial Landmark Detection in Makeup Paper Templates. Applied Sciences, 16(2), 977. https://doi.org/10.3390/app16020977

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop